mlr-org · be-marc · Nov 11, 2023 · Nov 15, 2023 · Dec 5, 2023 · Dec 5, 2023
diff --git a/mlr-org/_freeze/gallery/technical/2023-11-09-memory-disk/index/execute-results/html.json b/mlr-org/_freeze/gallery/technical/2023-11-09-memory-disk/index/execute-results/html.json
@@ -0,0 +1,14 @@
+{
+  "hash": "f0492f5628b224e19cebaaa4b66e05db",
+  "result": {
+    "markdown": "---\ntitle: \"Why is mlr3 Eating my Disk / RAM?\"\ndescription: |\n    Possible explanations for exuberant memory usage of mlr3 objects.\ncategories:\n  - tuning\n  - classification\nauthor:\n  - name: Marc Becker\n    url: https://github.com/be-marc\n  - name: Sebastian Fischer\n    url: https://github.com/sebffischer\ndate: 2023-11-09\nknitr:\n  opts_chunk:\n    R.options:\n      datatable.print.nrows: 6\n      datatable.print.trunc.cols: TRUE\nfreeze: true\n---\n\n\n\n\n\n\n\nWhen serializing `mlr3` objects, it can happen that their size is much larger than expected.\nWhile we are trying our best to keep such cases from happening, there are things that are beyond our control.\nThis gallery post serves as a technical trouble-shooting guide that covers various issues and offers solutions where possible.\nWe will update this post as new problems come to our attention.\nNote that while some of these issues might seem neglibile, they can cause serious problems when running large benchmrk experiments, e.g. using `mlr3batchmark`.\n\n## Avoid Installating Packages With Source References\n\nSome objects in `mlr3` have parameters that can be functions.\nOne example for that is `po(\"colapply\")`'s `applicator` parameter.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(\"mlr3verse\")\nlibrary(\"pryr\")\npo_center = po(\"colapply\", applicator = function(x) x - mean(x))\n```\n:::\n\n\nBecause `Learner`s store the hyperparameters that were used for training in their `$state`, it is important to ensure that their size is small.\nOne cause for large sizes of parameter values is the presence of source references in the function's attributes.\nSource references are kept when installing packages with the `--with-keep.source` option.\nNote that this option is enabled by default when installing packages with `renv`.\nYou can disble it by setting the following option before installing packages, e.g. by adding it to your `.Rprofile`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\noptions(\"install.opts\" = \"--without-keep.source\")\n```\n:::\n\n\n## Duplication of Data When Serializing\n\nAnjor cause for increased object size is how R duplicates data when serializing objects.\nConsider the simple example below:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx = rnorm(1000000)\ny = x\nlx = list(x)\nlxy = list(x, y)\nobject_size(lx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.00 MB\n```\n:::\n\n```{.r .cell-code}\nobject_size(lxy)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.00 MB\n```\n:::\n:::\n\n\nBecause of R's copy-on-write semantics, data is only copied when it is modified, i.e. the list `lx` has the same size as `lxy` because `x` and `y` all point to the same underlying data.\nHowever, when serializing `lxy`, both `x` and `y` are serialized independently and its memory footprint is doubled.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nobject_size(serialize(lxy, NULL))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n16.00 MB\n```\n:::\n:::\n\n\nBecause data is serialized not only when manually saving objects, but also when parallelizing execution via `future` or when using encapsulation, this can cause the same information to be duplicated many times and blow up both RAM and disk usage.\nWhile we have some mechanisms (like `mlr3misc::leanify`) in place to counteract this to some extent, it is impossible to get rid of the problem completely.\n\n## Setting the Correct Flags\n\nAnother -- easily amendable -- source for large object sizes is forgetting to set the right flags.\nThe list below contains some important configuration options that can be used to reduce the size of important `mlr3` objects:\n\n* `benchmark()` and `resample()` have the flags `store_backends` and `store_models`\n* `auto_tuner` has flags `store_tuning_instance` and `store_benchmark_result`\n* `tune()` has flags `store_benchmark_result` and `store_models`\n",
+    "supporting": [],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
diff --git a/mlr-org/faq.qmd b/mlr-org/faq.qmd
@@ -9,6 +9,7 @@ toc: false
 
 * [What is the purpose of the `OMP_THREAD_LIMIT` environment variable?](#omp-thread-limit)
 * [Why is tuning slow despite quick model fitting?](#tuning-slow)
+* [Why is `mlr3` using so much disk space / memory?]
 
 ## What is the purpose of the `OMP_THREAD_LIMIT` environment variable? {#omp-thread-limit}
 
@@ -43,3 +44,8 @@ Refer to the [OpenMP Thread Limit](#omp-thread-limit) section in this FAQ for gu
 5. **Nested Resampling Strategies:** When employing nested resampling, choosing an effective parallelization strategy is crucial.
 The wrong strategy can lead to inefficiencies.
 For a deeper understanding, refer to the [nested resampling section](https://mlr3book.mlr-org.com/chapters/chapter10/advanced_technical_aspects_of_mlr3.html#sec-nested-resampling-parallelization) in our book.
+
+## Why is `mlr3` using so much disk space / memory?
+
+There are various reasons why `mlr3` might use more memory / disk space than expected.
+We have written a [gallery post](https://mlr-org/gallery/technical/2023-11-09-memory-disk/) that covers various aspects of this topic and offers solutions where possible.
diff --git a/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd b/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd
@@ -0,0 +1,87 @@
+---
+title: "Why is mlr3 Eating my Disk / RAM?"
+description: |
+    Possible explanations for exuberant memory usage of mlr3 objects.
+categories:
+  - tuning
+  - classification
+author:
+  - name: Marc Becker
+    url: https://github.com/be-marc
+  - name: Sebastian Fischer
+    url: https://github.com/sebffischer
+date: 2023-11-09
+knitr:
+  opts_chunk:
+    R.options:
+      datatable.print.nrows: 6
+      datatable.print.trunc.cols: TRUE
+---
+
+{{< include ../../_setup.qmd >}}
+
+```{r}
+#| include: false
+lgr::get_logger("mlr3")$set_threshold("warn")
+```
+
+
+When serializing `mlr3` objects, it can happen that their size is much larger than expected.
+While we are trying our best to keep such cases from happening, there are things that are beyond our control.
+This gallery post serves as a technical trouble-shooting guide that covers various issues and offers solutions where possible.
+We will update this post as new problems come to our attention.
+Note that while some of these issues might seem neglibile, they can cause serious problems when running large benchmrk experiments, e.g. using `mlr3batchmark`.
+
+## Avoid Installating Packages With Source References
+
+Some objects in `mlr3` have parameters that can be functions.
+One example for that is `po("colapply")`'s `applicator` parameter.
+
+```{r, output = FALSE}
+library("mlr3verse")
+library("pryr")
+po_center = po("colapply", applicator = function(x) x - mean(x))
+```
+
+Because `Learner`s store the hyperparameters that were used for training in their `$state`, it is important to ensure that their size is small.
+One cause for large sizes of parameter values is the presence of source references in the function's attributes.
+Source references are kept when installing packages with the `--with-keep.source` option.
+Note that this option is enabled by default when installing packages with `renv`.
+You can disble it by setting the following option before installing packages, e.g. by adding it to your `.Rprofile`.
+
+```{r, eval = FALSE}
+options("install.opts" = "--without-keep.source")
+```
+
+## Duplication of Data When Serializing
+
+Another cause for increased object size is how R duplicates data when serializing objects.
+Consider the simple example below:
+
+```{r}
+x = rnorm(1000000)
+y = x
+lx = list(x)
+lxy = list(x, y)
+object_size(lx)
+object_size(lxy)
+```
+
+Because of R's copy-on-write semantics, data is only copied when it is modified, i.e. the list `lx` has the same size as `lxy` because `x` and `y` all point to the same underlying data.
+However, when serializing `lxy`, both `x` and `y` are serialized independently and its memory footprint is doubled.
+
+```{r}
+object_size(serialize(lxy, NULL))
+```
+
+Because data is serialized not only when manually saving objects, but also when parallelizing execution via `future` or when using encapsulation, this can cause the same information to be duplicated many times and blow up both RAM and disk usage.
+While we have some mechanisms (like `mlr3misc::leanify` and  database normalization) in place to counteract this to some extent, it is impossible to get rid of the problem completely.
+
+## Setting the Correct Flags
+
+Another -- easily amendable -- source for large object sizes is forgetting to set the right flags.
+The list below contains some important configuration options that can be used to reduce the size of important `mlr3` objects:
+
+* `benchmark()` and `resample()` have the flags `store_backends` and `store_models`
+* `auto_tuner` has flags `store_tuning_instance` and `store_benchmark_result`
+* `tune()` has flags `store_benchmark_result` and `store_models`