-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add mlr3 and renv gallery post #141
base: main
Are you sure you want to change the base?
Changes from 6 commits
93c2cde
44c4d84
f56dfe4
a78030f
b326b73
3cec192
c2c5978
adbc0d3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{ | ||
"hash": "f0492f5628b224e19cebaaa4b66e05db", | ||
"result": { | ||
"markdown": "---\ntitle: \"Why is mlr3 Eating my Disk / RAM?\"\ndescription: |\n Possible explanations for exuberant memory usage of mlr3 objects.\ncategories:\n - tuning\n - classification\nauthor:\n - name: Marc Becker\n url: https://github.com/be-marc\n - name: Sebastian Fischer\n url: https://github.com/sebffischer\ndate: 2023-11-09\nknitr:\n opts_chunk:\n R.options:\n datatable.print.nrows: 6\n datatable.print.trunc.cols: TRUE\nfreeze: true\n---\n\n\n\n\n\n\n\nWhen serializing `mlr3` objects, it can happen that their size is much larger than expected.\nWhile we are trying our best to keep such cases from happening, there are things that are beyond our control.\nThis gallery post serves as a technical trouble-shooting guide that covers various issues and offers solutions where possible.\nWe will update this post as new problems come to our attention.\nNote that while some of these issues might seem neglibile, they can cause serious problems when running large benchmrk experiments, e.g. using `mlr3batchmark`.\n\n## Avoid Installating Packages With Source References\n\nSome objects in `mlr3` have parameters that can be functions.\nOne example for that is `po(\"colapply\")`'s `applicator` parameter.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(\"mlr3verse\")\nlibrary(\"pryr\")\npo_center = po(\"colapply\", applicator = function(x) x - mean(x))\n```\n:::\n\n\nBecause `Learner`s store the hyperparameters that were used for training in their `$state`, it is important to ensure that their size is small.\nOne cause for large sizes of parameter values is the presence of source references in the function's attributes.\nSource references are kept when installing packages with the `--with-keep.source` option.\nNote that this option is enabled by default when installing packages with `renv`.\nYou can disble it by setting the following option before installing packages, e.g. by adding it to your `.Rprofile`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\noptions(\"install.opts\" = \"--without-keep.source\")\n```\n:::\n\n\n## Duplication of Data When Serializing\n\nAnjor cause for increased object size is how R duplicates data when serializing objects.\nConsider the simple example below:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx = rnorm(1000000)\ny = x\nlx = list(x)\nlxy = list(x, y)\nobject_size(lx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.00 MB\n```\n:::\n\n```{.r .cell-code}\nobject_size(lxy)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.00 MB\n```\n:::\n:::\n\n\nBecause of R's copy-on-write semantics, data is only copied when it is modified, i.e. the list `lx` has the same size as `lxy` because `x` and `y` all point to the same underlying data.\nHowever, when serializing `lxy`, both `x` and `y` are serialized independently and its memory footprint is doubled.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nobject_size(serialize(lxy, NULL))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n16.00 MB\n```\n:::\n:::\n\n\nBecause data is serialized not only when manually saving objects, but also when parallelizing execution via `future` or when using encapsulation, this can cause the same information to be duplicated many times and blow up both RAM and disk usage.\nWhile we have some mechanisms (like `mlr3misc::leanify`) in place to counteract this to some extent, it is impossible to get rid of the problem completely.\n\n## Setting the Correct Flags\n\nAnother -- easily amendable -- source for large object sizes is forgetting to set the right flags.\nThe list below contains some important configuration options that can be used to reduce the size of important `mlr3` objects:\n\n* `benchmark()` and `resample()` have the flags `store_backends` and `store_models`\n* `auto_tuner` has flags `store_tuning_instance` and `store_benchmark_result`\n* `tune()` has flags `store_benchmark_result` and `store_models`\n", | ||
"supporting": [], | ||
"filters": [ | ||
"rmarkdown/pagebreak.lua" | ||
], | ||
"includes": {}, | ||
"engineDependencies": {}, | ||
"preserve": {}, | ||
"postProcess": true | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
--- | ||
title: "Why is mlr3 Eating my Disk / RAM?" | ||
description: | | ||
Possible explanations for exuberant memory usage of mlr3 objects. | ||
categories: | ||
- tuning | ||
- classification | ||
author: | ||
- name: Marc Becker | ||
url: https://github.com/be-marc | ||
- name: Sebastian Fischer | ||
url: https://github.com/sebffischer | ||
date: 2023-11-09 | ||
knitr: | ||
opts_chunk: | ||
R.options: | ||
datatable.print.nrows: 6 | ||
datatable.print.trunc.cols: TRUE | ||
--- | ||
|
||
{{< include ../../_setup.qmd >}} | ||
|
||
```{r} | ||
#| include: false | ||
lgr::get_logger("mlr3")$set_threshold("warn") | ||
``` | ||
|
||
|
||
When serializing `mlr3` objects, it can happen that their size is much larger than expected. | ||
While we are trying our best to keep such cases from happening, there are things that are beyond our control. | ||
This gallery post serves as a technical trouble-shooting guide that covers various issues and offers solutions where possible. | ||
We will update this post as new problems come to our attention. | ||
Note that while some of these issues might seem neglibile, they can cause serious problems when running large benchmrk experiments, e.g. using `mlr3batchmark`. | ||
|
||
## Avoid Installating Packages With Source References | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe mention renv right in the heading. |
||
|
||
Some objects in `mlr3` have parameters that can be functions. | ||
One example for that is `po("colapply")`'s `applicator` parameter. | ||
|
||
```{r, output = FALSE} | ||
library("mlr3verse") | ||
library("pryr") | ||
po_center = po("colapply", applicator = function(x) x - mean(x)) | ||
``` | ||
|
||
Because `Learner`s store the hyperparameters that were used for training in their `$state`, it is important to ensure that their size is small. | ||
One cause for large sizes of parameter values is the presence of source references in the function's attributes. | ||
Source references are kept when installing packages with the `--with-keep.source` option. | ||
Note that this option is enabled by default when installing packages with `renv`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mention that installing with source refs is not the normal case but I would point out renv earlier. How can I check that mlr3 was installed with source refs? |
||
You can disble it by setting the following option before installing packages, e.g. by adding it to your `.Rprofile`. | ||
|
||
```{r, eval = FALSE} | ||
options("install.opts" = "--without-keep.source") | ||
``` | ||
|
||
## Duplication of Data When Serializing | ||
|
||
Anjor cause for increased object size is how R duplicates data when serializing objects. | ||
sebffischer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Consider the simple example below: | ||
|
||
```{r} | ||
x = rnorm(1000000) | ||
y = x | ||
lx = list(x) | ||
lxy = list(x, y) | ||
object_size(lx) | ||
object_size(lxy) | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like the example because it is easy to understand but it is not clear where this happens in mlr3. Maybe mention the repeated storing of our objects in benchmark results. |
||
|
||
Because of R's copy-on-write semantics, data is only copied when it is modified, i.e. the list `lx` has the same size as `lxy` because `x` and `y` all point to the same underlying data. | ||
However, when serializing `lxy`, both `x` and `y` are serialized independently and its memory footprint is doubled. | ||
|
||
```{r} | ||
object_size(serialize(lxy, NULL)) | ||
``` | ||
|
||
Because data is serialized not only when manually saving objects, but also when parallelizing execution via `future` or when using encapsulation, this can cause the same information to be duplicated many times and blow up both RAM and disk usage. | ||
While we have some mechanisms (like `mlr3misc::leanify`) in place to counteract this to some extent, it is impossible to get rid of the problem completely. | ||
sebffischer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Setting the Correct Flags | ||
|
||
Another -- easily amendable -- source for large object sizes is forgetting to set the right flags. | ||
The list below contains some important configuration options that can be used to reduce the size of important `mlr3` objects: | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Link to the book. This is explained in detail there. |
||
* `benchmark()` and `resample()` have the flags `store_backends` and `store_models` | ||
* `auto_tuner` has flags `store_tuning_instance` and `store_benchmark_result` | ||
* `tune()` has flags `store_benchmark_result` and `store_models` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe start with non-technical term like storing mlr3 objects.