Automatic Machine Learning Algorithm Configuration in R

Introduction

automlr is an R-package for automatically configuring mlr machine learning algorithms so that they perform well. It is designed for simplicity of use and able to run with minimal user intervention.

automlr is currently under development. You can see the current status in the ‘develop’ branch, but that branch may or may not be functional.

What does `automlr` offer?

automlr complements mlr to make optimization over multiple learners and their respective hyperparameters possible. It offers:

A database of sensible hyperparameter search ranges for many mlr learners and preprocessing operators.
A process of combining these learners and preprocessing operations into a single Learner object, and the respective hyperparameters into a single searchspace.
A unified interface to optimization algorithms designed to enable easy optimization run continuations.

Installation

Note: Installation of automlr is currently only tested on Linux systems. Installation on other systems, especially MS Windows, might not work.

Installation will fail if roxygen2 is not installed, so make sure it is:

if (!require("roxygen2")) install.packages("roxygen2")

automlr furthermore depends on my CPO extension to mlr, which must be installed from github. Install my private branch which is kept in a state compatible to automlr.

devtools::install_github("mb706/mlr",
    ref = "mb706_CPO")

Then you can install automlr from source. automlr can work with many mlr learners; however, many of them have package dependencies. automlr doesn’t need these packages and will skip learners that are not installed, but without them, the search space will be incomplete (and a warning will be given). To install automlr and all referenced learner packages (this can take a while!), do

devtools::install_github("mlr-org/automlr",
    dependencies = c("Depends", "Imports", "Suggests"))

Alternatively, to only install automlr (and its essential dependencies), do

devtools::install_github("mlr-org/automlr")

It is highly recommended to install my forks of the e1071, kernlab and mda packages, which fix bugs that otherwise regularly lead to R crashing or hanging:

devtools::install_github("mb706/e1071")
devtools::install_github("mb706/kernlab")
devtools::install_github("mb706/mda")

Usage

To run a small example to fit some learners on the mlr-provided pid.task, execute

library("mlr")
library("automlr")
# depending on your RNG luck, this can take tens of minutes
amrun = automlr(pid.task, backend = "random",
  budget = c(evals = 10), verbosity = 1)
result = amfinish(amrun)
print(result, verbose = TRUE)

This already shows all the mandatory arguments of automlr: The task for which to optimize, the backend to use (may be “random”, “irace” or “mbo”), and a computational budget. The resulting object can be given to another automlr call with a different budget to continue optimizing, or to amfinish to finalize the run.

You can subset the search space like so:

amrun2 = automlr(pid.task, backend = "random", budget = c(evals = 10),
    searchspace = list(mlrLearners$classif.randomForest, mlrLearners$classif.svm))

or

amrun2 = automlr(pid.task, backend = "random", budget = c(evals = 10),
    searchspace = list(mlrLearners[c("classif.randomForest", "classif.svm")]))

The functions and data exported by automlr that will be of interest to the user:

automlr invocation
automlr
The main entry point; can be called with a task and a backend, or with an object that was returned by a previous automlr invocation, or even with a file name that was used by automlr to save the state. The user can choose:
- which backend to use (backend)
- the computational budget (budget)
- a possible savefile (savefile) and the interval in which to save to a file (save.interval)
- a measure for which to optimize, if not the task’s default measure (measure)
- the verbosity level (verbosity) ranging from 0 to 6, 0 being the least verbose. Level 6 also stops the optimization process when a learner returns an error.
makeBackendconfRandom, makeBackendconfMbo, makeBackendconfIrace
Make backend configuration objects to use instead of the backend strings (“random” etc.)

amfinish
Generates an AMResult object that contains information about the optimization result.

mlrLearners, mlrLearnersNoWrappers
A collection of mlr learners with corresponding search space. mlrLearnersNoWrappers does not contain preprocessing wrappers.

mlrLightweight, mlrLightweightNoWrappers
Similar to mlrLearners and mlrLearnersNoWrappers; these are search spaces, but with the slowest learners removed. This decreases evaluation time and is also necessary for the “mbo” backend to work.
searchspace definition

autolearner
define your own mlr learner to put in a search space

autoWrapper
define an mlr wrapper to use in a search space

sp
for defining parameters that are given to autolearner

See their respective R documentation for more information and additional arguments.

Troubleshooting

Segfaults

Unfortunately some learners, especially ones that use native code, may crash the whole R session. Also, apparently a recent linux kernel release caused problems with rJava packages. If you see segfaults happening, try the following:

Run export _JAVA_OPTIONS="-Xss2560k -Xmx2g"; before running R; alternatively, run options(java.parameters = c("-Xss2560k", "-Xmx2g")) at the beginning of your R session. This may help even if the crash happens in a non-java learner.
Use setDefaultRWTBackend("fork"). This causes all learners to be run in a separate process. See the issue concerning the “fork” backend, however.
Run automlr with a small value for save.interval and have a process in place to resurrect R after a segfault with the savefile.

Timeout Overrun

The default “native” backend of interrupting learners that run over time is not able to stop learners that take a long time in native (C/Fortran) code routines. Use setDefaultRWTBackend("fork") to kill slow learners effectively, at the cost of some performance. However, see the following issue.

setDefaultRWTBackend(“fork”) causes hangs

This happens if you use automlr with the “fork” backend and a learner uses java. Currently, there is no way of using the fork backend with java based learners. Use the mlrLightweightNoJava searchspace to exclude all java based learners.

Empty result when using “walltime” budget

If you are running automlr with “walltime” budget, beware that a hard execution time limit is set to 10% of the walltime budget + 10 minutes, after which the current irace or mlrMBO cycle is killed. To avoid this behaviour, set max.walltime.overrun to a larger value, possibly Inf.

Optimization Takes Too Long

Unfortunately, the runtime of different learners varies widely. To exclude the most problematic learners, use searchspace = mlrLightweight when calling automlr.

If a single evaluation is stuck in a a loop and does not finish, it is possible that this is a bug in the learner. If you can provide useful information about a bug, please open an “Issue” on github. Gather this information using gdb or your debugger of choice (if you know your way around one); otherwise try to find a way to reproduce the behaviour. I (and probably the learner package’s developer) are very happy to track down and fix these kind of bugs.

Maximal Number of DLLs reached

This is because R is very conservative on how many DLLs it allows to be loaded. If you are using R >= 3.4, one solution is to set the environmental variable `R_MAX_NUM_DLLS` to something greater than 100, as found out here. Otherwise, reduce the number of learners you are using in your searchspace.

If you are doing this, also take care that your `ulimit -n` might need adjusting.

Project status

The project is currently undergoing heavy development; while the spirit of the application is expected to be stable, the user interface may undergo slight changes in the future. Expect the internals of automlr to be changing regularly.

Notes

The “irace” backend’s behaviour deviates slightly from that of the irace package in so far that the number of evaluations per generation, and the slimming of the sampling distribution, are independent of the budget.
The “mbo” backend currently uses an inferior imputation method for the surrogate model, and its performance should not be seen as representative for mlrMBO.
for tasks with tens of features and thousands of rows, expect automlr to use about 0.5-2MB of memory per row of data.

Project TODO

(under consideration, subject to change)

[ ] release 0.3
- [ ] integration of wrapper CPOs
[ ] release 0.4
- [ ] nicer printing of results
- [ ] consistent randomness
  - [ ] test that execution with same seed gets same result
  - [ ] use seeds in learners that use external RNGs
- [ ] memory handling
- [ ] searchspace
  - [ ] respect parameter equality IDs
  - [ ] automatically recognize absence of learner (in a hypothetical future mlr version) and don’t throw an error
- [ ] tests
  - [ ] 100% test coverage
  - [ ] test for all possible wrong arguments
  - [ ] other things?
- [ ] regression learners
- [ ] installation on Win32
- [ ] more empirical grounding for mlrLightweight.
[ ] release 0.5
- [ ] more sophisticated search space extensions
  - [ ] metalearner wrappers
[ ] release 0.6
- [ ] cleaning up
  - [ ] Consistent solution for timeouts, the current one is not stable
  - [ ] Remove Ctrl-C handler, R does not work like this
- [ ] CPOs
  - [ ] do CPO wrapping the correct way
  - [ ] use Meta-CPO
  - [ ] make CPO types etc. work together
[ ] release 1.0
- [ ] everything is really, really stable
[ ] possible future releases
- [ ] other backends?
- [ ] simultaneous multiple task optimization
- [ ] batchJobs integration? (e.g. break run down into smaller jobs automatically)
- [ ] priors for learners?

Name		Name	Last commit message	Last commit date
Latest commit History 411 Commits
R		R
tests		tests
tools		tools
working		working
.Rbuildignore		.Rbuildignore
.Rinstignore		.Rinstignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NEWS		NEWS
README.org		README.org
automlr-manual.pdf		automlr-manual.pdf
configure		configure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Machine Learning Algorithm Configuration in R

Introduction

What does `automlr` offer?

Installation

Usage

Troubleshooting

Segfaults

Timeout Overrun

setDefaultRWTBackend(“fork”) causes hangs

Empty result when using “walltime” budget

Optimization Takes Too Long

Maximal Number of DLLs reached

Project status

Notes

Project TODO

About

Releases

Packages

Languages

mb706/automlr

Folders and files

Latest commit

History

Repository files navigation

Automatic Machine Learning Algorithm Configuration in R

Introduction

What does automlr offer?

Installation

Usage

Troubleshooting

Segfaults

Timeout Overrun

setDefaultRWTBackend(“fork”) causes hangs

Empty result when using “walltime” budget

Optimization Takes Too Long

Maximal Number of DLLs reached

Project status

Notes

Project TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

What does `automlr` offer?

Packages