automlr is an R-package for automatically configuring mlr machine learning algorithms so that they perform well. It is designed for simplicity of use and able to run with minimal user intervention.
automlr
is currently under development. You can see the current status in the ‘develop’ branch, but that branch may or may not be functional.
automlr
complements mlr
to make optimization over multiple learners and their respective hyperparameters possible. It offers:
- A database of sensible hyperparameter search ranges for many
mlr
learners and preprocessing operators. - A process of combining these learners and preprocessing operations into a single
Learner
object, and the respective hyperparameters into a single searchspace. - A unified interface to optimization algorithms designed to enable easy optimization run continuations.
Note: Installation of automlr is currently only tested on Linux systems. Installation on other systems, especially MS Windows, might not work.
Installation will fail if roxygen2 is not installed, so make sure it is:
if (!require("roxygen2")) install.packages("roxygen2")
automlr
furthermore depends on my CPO extension to mlr, which must be installed from github. Install my private branch which is kept in a state compatible to automlr
.
devtools::install_github("mb706/mlr",
ref = "mb706_CPO")
Then you can install automlr
from source. automlr
can work with many mlr learners; however, many of them have package dependencies. automlr
doesn’t need these packages and will skip learners that are not installed, but without them, the search space will be incomplete (and a warning will be given). To install automlr
and all referenced learner packages (this can take a while!), do
devtools::install_github("mlr-org/automlr",
dependencies = c("Depends", "Imports", "Suggests"))
Alternatively, to only install automlr
(and its essential dependencies), do
devtools::install_github("mlr-org/automlr")
It is highly recommended to install my forks of the e1071
, kernlab
and mda
packages, which fix bugs that otherwise regularly lead to R crashing or hanging:
devtools::install_github("mb706/e1071")
devtools::install_github("mb706/kernlab")
devtools::install_github("mb706/mda")
To run a small example to fit some learners on the mlr-provided pid.task
, execute
library("mlr")
library("automlr")
# depending on your RNG luck, this can take tens of minutes
amrun = automlr(pid.task, backend = "random",
budget = c(evals = 10), verbosity = 1)
result = amfinish(amrun)
print(result, verbose = TRUE)
This already shows all the mandatory arguments of automlr
: The task for which to optimize, the backend to use (may be “random”, “irace” or “mbo”), and a computational budget. The resulting object can be given to another automlr
call with a different budget to continue optimizing, or to amfinish
to finalize the run.
You can subset the search space like so:
amrun2 = automlr(pid.task, backend = "random", budget = c(evals = 10),
searchspace = list(mlrLearners$classif.randomForest, mlrLearners$classif.svm))
or
amrun2 = automlr(pid.task, backend = "random", budget = c(evals = 10),
searchspace = list(mlrLearners[c("classif.randomForest", "classif.svm")]))
The functions and data exported by automlr
that will be of interest to the user:
automlr
invocation- automlr
- The main entry point; can be called with a task and a backend, or with an object that was returned by a previous
automlr
invocation, or even with a file name that was used byautomlr
to save the state. The user can choose:- which backend to use (
backend
) - the computational budget (
budget
) - a possible savefile (
savefile
) and the interval in which to save to a file (save.interval
) - a measure for which to optimize, if not the task’s default measure (
measure
) - the verbosity level (
verbosity
) ranging from 0 to 6, 0 being the least verbose. Level 6 also stops the optimization process when a learner returns an error.
- which backend to use (
- makeBackendconfRandom, makeBackendconfMbo, makeBackendconfIrace
- Make backend configuration objects to use instead of the backend strings (“random” etc.)
- amfinish
- Generates an
AMResult
object that contains information about the optimization result. - mlrLearners, mlrLearnersNoWrappers
- A collection of mlr learners with corresponding search space.
mlrLearnersNoWrappers
does not contain preprocessing wrappers. - mlrLightweight, mlrLightweightNoWrappers
- Similar to
mlrLearners
andmlrLearnersNoWrappers
; these are search spaces, but with the slowest learners removed. This decreases evaluation time and is also necessary for the “mbo” backend to work.
- searchspace definition
- autolearner
- define your own mlr learner to put in a search space
- autoWrapper
- define an mlr wrapper to use in a search space
- sp
- for defining parameters that are given to
autolearner
See their respective R documentation for more information and additional arguments.
Unfortunately some learners, especially ones that use native code, may crash the whole R session. Also, apparently a recent linux kernel release caused problems with rJava packages. If you see segfaults happening, try the following:
- Run
export _JAVA_OPTIONS="-Xss2560k -Xmx2g";
before running R; alternatively, runoptions(java.parameters = c("-Xss2560k", "-Xmx2g"))
at the beginning of your R session. This may help even if the crash happens in a non-java learner. - Use
setDefaultRWTBackend("fork")
. This causes all learners to be run in a separate process. See the issue concerning the “fork” backend, however. - Run
automlr
with a small value forsave.interval
and have a process in place to resurrect R after a segfault with the savefile.
The default “native” backend of interrupting learners that run over time is not able to stop learners that take a long time in native (C/Fortran) code routines. Use setDefaultRWTBackend("fork")
to kill slow learners effectively, at the cost of some performance. However, see the following issue.
This happens if you use automlr
with the “fork” backend and a learner uses java. Currently, there is no way of using the fork backend with java based learners. Use the mlrLightweightNoJava
searchspace to exclude all java based learners.
If you are running automlr
with “walltime” budget, beware that a hard execution time limit is set to 10% of the walltime budget + 10 minutes, after which the current irace
or mlrMBO
cycle is killed. To avoid this behaviour, set max.walltime.overrun
to a larger value, possibly Inf
.
Unfortunately, the runtime of different learners varies widely. To exclude the most problematic learners, use searchspace = mlrLightweight
when calling automlr
.
If a single evaluation is stuck in a a loop and does not finish, it is possible that this is a bug in the learner. If you can provide useful information about a bug, please open an “Issue” on github. Gather this information using gdb
or your debugger of choice (if you know your way around one); otherwise try to find a way to reproduce the behaviour. I (and probably the learner package’s developer) are very happy to track down and fix these kind of bugs.
This is because R is very conservative on how many DLLs it allows to be loaded. If you are using R >= 3.4, one solution is to set the environmental variable `R_MAX_NUM_DLLS` to something greater than 100, as found out here. Otherwise, reduce the number of learners you are using in your searchspace.
If you are doing this, also take care that your `ulimit -n` might need adjusting.
The project is currently undergoing heavy development; while the spirit of the application is expected to be stable, the user interface may undergo slight changes in the future. Expect the internals of automlr
to be changing regularly.
- The “irace” backend’s behaviour deviates slightly from that of the
irace
package in so far that the number of evaluations per generation, and the slimming of the sampling distribution, are independent of the budget. - The “mbo” backend currently uses an inferior imputation method for the surrogate model, and its performance should not be seen as representative for
mlrMBO
. - for tasks with tens of features and thousands of rows, expect
automlr
to use about 0.5-2MB of memory per row of data.
(under consideration, subject to change)
- [ ] release 0.3
- [ ] integration of wrapper CPOs
- [ ] release 0.4
- [ ] nicer printing of results
- [ ] consistent randomness
- [ ] test that execution with same seed gets same result
- [ ] use seeds in learners that use external RNGs
- [ ] memory handling
- [ ] searchspace
- [ ] respect parameter equality IDs
- [ ] automatically recognize absence of learner (in a hypothetical future mlr version) and don’t throw an error
- [ ] tests
- [ ] 100% test coverage
- [ ] test for all possible wrong arguments
- [ ] other things?
- [ ] regression learners
- [ ] installation on Win32
- [ ] more empirical grounding for mlrLightweight.
- [ ] release 0.5
- [ ] more sophisticated search space extensions
- [ ] metalearner wrappers
- [ ] more sophisticated search space extensions
- [ ] release 0.6
- [ ] cleaning up
- [ ] Consistent solution for timeouts, the current one is not stable
- [ ] Remove Ctrl-C handler, R does not work like this
- [ ] CPOs
- [ ] do CPO wrapping the correct way
- [ ] use Meta-CPO
- [ ] make CPO types etc. work together
- [ ] cleaning up
- [ ] release 1.0
- [ ] everything is really, really stable
- [ ] possible future releases
- [ ] other backends?
- [ ] simultaneous multiple task optimization
- [ ] batchJobs integration? (e.g. break run down into smaller jobs automatically)
- [ ] priors for learners?