Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add mlr3 and renv gallery post #141

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"hash": "f0492f5628b224e19cebaaa4b66e05db",
"result": {
"markdown": "---\ntitle: \"Why is mlr3 Eating my Disk / RAM?\"\ndescription: |\n Possible explanations for exuberant memory usage of mlr3 objects.\ncategories:\n - tuning\n - classification\nauthor:\n - name: Marc Becker\n url: https://github.com/be-marc\n - name: Sebastian Fischer\n url: https://github.com/sebffischer\ndate: 2023-11-09\nknitr:\n opts_chunk:\n R.options:\n datatable.print.nrows: 6\n datatable.print.trunc.cols: TRUE\nfreeze: true\n---\n\n\n\n\n\n\n\nWhen serializing `mlr3` objects, it can happen that their size is much larger than expected.\nWhile we are trying our best to keep such cases from happening, there are things that are beyond our control.\nThis gallery post serves as a technical trouble-shooting guide that covers various issues and offers solutions where possible.\nWe will update this post as new problems come to our attention.\nNote that while some of these issues might seem neglibile, they can cause serious problems when running large benchmrk experiments, e.g. using `mlr3batchmark`.\n\n## Avoid Installating Packages With Source References\n\nSome objects in `mlr3` have parameters that can be functions.\nOne example for that is `po(\"colapply\")`'s `applicator` parameter.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(\"mlr3verse\")\nlibrary(\"pryr\")\npo_center = po(\"colapply\", applicator = function(x) x - mean(x))\n```\n:::\n\n\nBecause `Learner`s store the hyperparameters that were used for training in their `$state`, it is important to ensure that their size is small.\nOne cause for large sizes of parameter values is the presence of source references in the function's attributes.\nSource references are kept when installing packages with the `--with-keep.source` option.\nNote that this option is enabled by default when installing packages with `renv`.\nYou can disble it by setting the following option before installing packages, e.g. by adding it to your `.Rprofile`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\noptions(\"install.opts\" = \"--without-keep.source\")\n```\n:::\n\n\n## Duplication of Data When Serializing\n\nAnjor cause for increased object size is how R duplicates data when serializing objects.\nConsider the simple example below:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx = rnorm(1000000)\ny = x\nlx = list(x)\nlxy = list(x, y)\nobject_size(lx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.00 MB\n```\n:::\n\n```{.r .cell-code}\nobject_size(lxy)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.00 MB\n```\n:::\n:::\n\n\nBecause of R's copy-on-write semantics, data is only copied when it is modified, i.e. the list `lx` has the same size as `lxy` because `x` and `y` all point to the same underlying data.\nHowever, when serializing `lxy`, both `x` and `y` are serialized independently and its memory footprint is doubled.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nobject_size(serialize(lxy, NULL))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n16.00 MB\n```\n:::\n:::\n\n\nBecause data is serialized not only when manually saving objects, but also when parallelizing execution via `future` or when using encapsulation, this can cause the same information to be duplicated many times and blow up both RAM and disk usage.\nWhile we have some mechanisms (like `mlr3misc::leanify`) in place to counteract this to some extent, it is impossible to get rid of the problem completely.\n\n## Setting the Correct Flags\n\nAnother -- easily amendable -- source for large object sizes is forgetting to set the right flags.\nThe list below contains some important configuration options that can be used to reduce the size of important `mlr3` objects:\n\n* `benchmark()` and `resample()` have the flags `store_backends` and `store_models`\n* `auto_tuner` has flags `store_tuning_instance` and `store_benchmark_result`\n* `tune()` has flags `store_benchmark_result` and `store_models`\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
6 changes: 6 additions & 0 deletions mlr-org/faq.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ toc: false

* [What is the purpose of the `OMP_THREAD_LIMIT` environment variable?](#omp-thread-limit)
* [Why is tuning slow despite quick model fitting?](#tuning-slow)
* [Why is `mlr3` using so much disk space / memory?]

## What is the purpose of the `OMP_THREAD_LIMIT` environment variable? {#omp-thread-limit}

Expand Down Expand Up @@ -43,3 +44,8 @@ Refer to the [OpenMP Thread Limit](#omp-thread-limit) section in this FAQ for gu
5. **Nested Resampling Strategies:** When employing nested resampling, choosing an effective parallelization strategy is crucial.
The wrong strategy can lead to inefficiencies.
For a deeper understanding, refer to the [nested resampling section](https://mlr3book.mlr-org.com/chapters/chapter10/advanced_technical_aspects_of_mlr3.html#sec-nested-resampling-parallelization) in our book.

## Why is `mlr3` using so much disk space / memory?

There are various reasons why `mlr3` might use more memory / disk space than expected.
We have written a [gallery post](https://mlr-org/gallery/technical/2023-11-09-memory-disk/) that covers various aspects of this topic and offers solutions where possible.
87 changes: 87 additions & 0 deletions mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: "Why is mlr3 Eating my Disk / RAM?"
description: |
Possible explanations for exuberant memory usage of mlr3 objects.
categories:
- tuning
- classification
author:
- name: Marc Becker
url: https://github.com/be-marc
- name: Sebastian Fischer
url: https://github.com/sebffischer
date: 2023-11-09
knitr:
opts_chunk:
R.options:
datatable.print.nrows: 6
datatable.print.trunc.cols: TRUE
---

{{< include ../../_setup.qmd >}}

```{r}
#| include: false
lgr::get_logger("mlr3")$set_threshold("warn")
```


When serializing `mlr3` objects, it can happen that their size is much larger than expected.
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe start with non-technical term like storing mlr3 objects.

While we are trying our best to keep such cases from happening, there are things that are beyond our control.
This gallery post serves as a technical trouble-shooting guide that covers various issues and offers solutions where possible.
We will update this post as new problems come to our attention.
Note that while some of these issues might seem neglibile, they can cause serious problems when running large benchmrk experiments, e.g. using `mlr3batchmark`.

## Avoid Installating Packages With Source References
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention renv right in the heading.


Some objects in `mlr3` have parameters that can be functions.
One example for that is `po("colapply")`'s `applicator` parameter.

```{r, output = FALSE}
library("mlr3verse")
library("pryr")
po_center = po("colapply", applicator = function(x) x - mean(x))
```

Because `Learner`s store the hyperparameters that were used for training in their `$state`, it is important to ensure that their size is small.
One cause for large sizes of parameter values is the presence of source references in the function's attributes.
Source references are kept when installing packages with the `--with-keep.source` option.
Note that this option is enabled by default when installing packages with `renv`.
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that installing with source refs is not the normal case but I would point out renv earlier. How can I check that mlr3 was installed with source refs?

You can disble it by setting the following option before installing packages, e.g. by adding it to your `.Rprofile`.

```{r, eval = FALSE}
options("install.opts" = "--without-keep.source")
```

## Duplication of Data When Serializing

Another cause for increased object size is how R duplicates data when serializing objects.
Consider the simple example below:

```{r}
x = rnorm(1000000)
y = x
lx = list(x)
lxy = list(x, y)
object_size(lx)
object_size(lxy)
```
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the example because it is easy to understand but it is not clear where this happens in mlr3. Maybe mention the repeated storing of our objects in benchmark results.


Because of R's copy-on-write semantics, data is only copied when it is modified, i.e. the list `lx` has the same size as `lxy` because `x` and `y` all point to the same underlying data.
However, when serializing `lxy`, both `x` and `y` are serialized independently and its memory footprint is doubled.

```{r}
object_size(serialize(lxy, NULL))
```

Because data is serialized not only when manually saving objects, but also when parallelizing execution via `future` or when using encapsulation, this can cause the same information to be duplicated many times and blow up both RAM and disk usage.
While we have some mechanisms (like `mlr3misc::leanify` and database normalization) in place to counteract this to some extent, it is impossible to get rid of the problem completely.

## Setting the Correct Flags

Another -- easily amendable -- source for large object sizes is forgetting to set the right flags.
The list below contains some important configuration options that can be used to reduce the size of important `mlr3` objects:

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the book. This is explained in detail there.

* `benchmark()` and `resample()` have the flags `store_backends` and `store_models`
* `auto_tuner` has flags `store_tuning_instance` and `store_benchmark_result`
* `tune()` has flags `store_benchmark_result` and `store_models`