From 93c2cdea5a0fdd3645f37f3c30bd939cc2db57e3 Mon Sep 17 00:00:00 2001 From: be-marc Date: Sat, 11 Nov 2023 10:26:27 +0100 Subject: [PATCH 1/7] draft --- .../technical/2023-11-09-renv/index.qmd | 54 +++++++++++++++++++ 1 file changed, 54 insertions(+) create mode 100644 mlr-org/gallery/technical/2023-11-09-renv/index.qmd diff --git a/mlr-org/gallery/technical/2023-11-09-renv/index.qmd b/mlr-org/gallery/technical/2023-11-09-renv/index.qmd new file mode 100644 index 00000000..d944eb33 --- /dev/null +++ b/mlr-org/gallery/technical/2023-11-09-renv/index.qmd @@ -0,0 +1,54 @@ +--- +title: "mlr3 and renv" +description: | + Fix renv to work with mlr3. +categories: + - tuning + - classification +author: + - name: Marc Becker + url: https://github.com/be-marc + - name: Sebastian Fischer + url: https://github.com/sebffischer +date: 2023-11-09 +knitr: + opts_chunk: + R.options: + datatable.print.nrows: 6 + datatable.print.trunc.cols: TRUE +--- + +{{< include ../../_setup.qmd >}} + +```{r} +options("install.opts" = "--without-keep.source") +``` + +Reproducibility is the ability to generate the same results given the same input data and code. +In data science, this also includes the software environment. +The `r ref_pkg("renv")` package is widely used to create reproducible environments in R. +We recommend using `renv` for all your `mlr3` projects. +However, the `r ref_pkg("renv")` team recently changed the installation process of packages. +This change affects the saving of results obtained with `mlr3` e.g. saving a `r ref("Learner")` or `r ref("BenchmarkResult")` object. +When the object is loaded to an R Session, the memory usage + +```{r} +#| eval: false +renv::init() +renv::install("mlr3") + +library(mlr3) + +task = tsk("sonar") +learner = lrn("classif.rpart") + +learner$train(task) + +pryr::object_size(learner) + +saveRDS(learner, "learner.rds") + +learner = readRDS("learner.rds") + + + From 44c4d846e66289e14d30e6ac9ca5ad1b6e513b47 Mon Sep 17 00:00:00 2001 From: Sebastian Fischer Date: Wed, 15 Nov 2023 18:33:08 +0100 Subject: [PATCH 2/7] work on memory issues --- .../technical/2023-11-09-renv/index.qmd | 142 +++++++++++++++--- 1 file changed, 122 insertions(+), 20 deletions(-) diff --git a/mlr-org/gallery/technical/2023-11-09-renv/index.qmd b/mlr-org/gallery/technical/2023-11-09-renv/index.qmd index d944eb33..ddbf30fd 100644 --- a/mlr-org/gallery/technical/2023-11-09-renv/index.qmd +++ b/mlr-org/gallery/technical/2023-11-09-renv/index.qmd @@ -1,7 +1,7 @@ --- -title: "mlr3 and renv" +title: "Why is mlr3 Eating my Disk" description: | - Fix renv to work with mlr3. + Possible explanations for exuberant memory usage of mlr3 objects. categories: - tuning - classification @@ -20,35 +20,137 @@ knitr: {{< include ../../_setup.qmd >}} +When saving and loading `mlr3` objects it can happen that the sizes of the objects are much larger than expected. +While we are trying our best to keep such cases from happening, there are things that are beyond our control. +This gallery post serves as a trouble-shooting guide that we will update as new issues come to our attention. +We will list and explain various problems and give suggestions on how they can be solved or mitigated if possible. + + +## Duplication of Data afer deserialization + +One source of increased object sizes that happens after serializing and unserializeing `mlr3` objects is the duplication of data. +The example below illustrates this issue. + ```{r} -options("install.opts" = "--without-keep.source") +library("mlr3verse") +library("pryr") + +# train a decision tree on the mtcars dataset +learner = lrn("regr.rpart") +task = tsk("mtcars") +learner$train(task) +state = learner$state + +path = tempfile() +saveRDS(learner$state, path) +state_reloaded = readRDS(path) + +object_size(state) +object_size(state_reloaded) ``` -Reproducibility is the ability to generate the same results given the same input data and code. -In data science, this also includes the software environment. -The `r ref_pkg("renv")` package is widely used to create reproducible environments in R. -We recommend using `renv` for all your `mlr3` projects. -However, the `r ref_pkg("renv")` team recently changed the installation process of packages. -This change affects the saving of results obtained with `mlr3` e.g. saving a `r ref("Learner")` or `r ref("BenchmarkResult")` object. -When the object is loaded to an R Session, the memory usage +The example shows how the size of the state object increases when saving and reading the object. + +This is not related to `mlr3` but is because of the way objects are serialized in R. +Because of R's copy-on-write semantics, data is only copied when it is modified. +In the code below, `lx` has the same size as `lxy` because `x` and `y` all point to the same underlying data. ```{r} -#| eval: false -renv::init() -renv::install("mlr3") +x = rnorm(1000000) +y = x +lx = list(x) +lxy = list(x, y) +object_size(lx) +object_size(lxy) +``` -library(mlr3) +However, when serializing `lxy`, both `x` and `y` are serialized independently and when loading the object again, its size in memory is doubled. -task = tsk("sonar") -learner = lrn("classif.rpart") -learner$train(task) +```{r} +saveRDS(lxy, path) +lxy_reloaded = readRDS(path) +object_size(lxy_reloaded) +``` + +While we have some mechanisms (like `mlr3misc::leanify`) in place to counteract this issue, the problem is still present to some degree. + + +## Serializing Closures + +Another possible source of increased object sizes is the serialization of closures. +Some objects in `mlr3` can be configured with closures. +One example for this is `PipeOpColApply` that has a parameter `applicator` that is applied to the columns of a task. +Let's say we want to use it to center the columns of a task. + +```{r} +center = function(x) x - mean(x) +po_center = po("colapply", applicator = function(x) center) +``` + +The graph has one of its parameters now set to a closure. +This means that when saving the parameter values, which we usually want to do, so we know afterwards how the model was trained, the enviornment of the closure is also serialized. +However this is not the case if the enclosing enviornment is either a package environment or the `.GlobalEnv`. +In your scripts, however, the enclosing enviornment of a function might neither be a package environment nor the global environment. +In order to show the effect we will construct the closure using a function factory. + +```{r} +make_center_large = function() { + some_stuff = rnorm(1000000) + + function(x) { + x - mean(x) + } +} +center_large = make_center_large() +``` + +Even thogh the `center` function itself is quite small w.r.t. source code, its enclosing enviornment is not. + +```{r} +object_size(center_large) +object_size(environment(center_large)$some_stuff) +``` + +The size of the earlier defined `center` function is neglibible, because the enclosing enviornment is the global enviornment. + +```{r} +object_size(center) +``` -pryr::object_size(learner) +This can be especially problematic when running large benchmark experiments, e.g. via `r mlr3batchmark`, where the learner states (which includes the parameter values) of each resample iteration are written to disk independently. -saveRDS(learner, "learner.rds") +```{r} +library("mlr3batchmark") +library("batchtools") + +reg = makeExperimentRegistry(NA, seed = 1) + +glrn_large = as_learner( + po("colapply", applicator = center_large) %>>% + lrn("regr.rpart") +) +glrn_large$id = "large" + +glrn = as_learner( + po("colapply", applicator = function(x) x - mean(x)) %>>% + lrn("regr.rpart") +) + +design = benchmark_grid(task, list(glrn, glrn_large), rsmp("cv")) + +batchmark(design) +submitJobs() +``` + +We can compare the sizes of the resulting benchmark results are re-reassembling and notice a tremendous difference. + +```{r} +object_size(reduceResultsBatchmark(1:10)) +object_size(reduceResultsBatchmark(11:20)) +``` -learner = readRDS("learner.rds") +## Sourcerefs From a78030f8ed3c895f4a6a318e02f1081a4887ff0c Mon Sep 17 00:00:00 2001 From: Sebastian Fischer Date: Tue, 5 Dec 2023 12:12:38 +0100 Subject: [PATCH 3/7] update memory usage post --- mlr-org/faq.qmd | 6 + .../2023-11-09-memory-disk/index.qmd | 78 +++++++++ .../technical/2023-11-09-renv/index.qmd | 156 ------------------ 3 files changed, 84 insertions(+), 156 deletions(-) create mode 100644 mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd delete mode 100644 mlr-org/gallery/technical/2023-11-09-renv/index.qmd diff --git a/mlr-org/faq.qmd b/mlr-org/faq.qmd index 9ee70833..a0e74756 100644 --- a/mlr-org/faq.qmd +++ b/mlr-org/faq.qmd @@ -9,6 +9,7 @@ toc: false * [What is the purpose of the `OMP_THREAD_LIMIT` environment variable?](#omp-thread-limit) * [Why is tuning slow despite quick model fitting?](#tuning-slow) +* [Why is `mlr3` using so much disk space / memory?] ## What is the purpose of the `OMP_THREAD_LIMIT` environment variable? {#omp-thread-limit} @@ -43,3 +44,8 @@ Refer to the [OpenMP Thread Limit](#omp-thread-limit) section in this FAQ for gu 5. **Nested Resampling Strategies:** When employing nested resampling, choosing an effective parallelization strategy is crucial. The wrong strategy can lead to inefficiencies. For a deeper understanding, refer to the [nested resampling section](https://mlr3book.mlr-org.com/chapters/chapter10/advanced_technical_aspects_of_mlr3.html#sec-nested-resampling-parallelization) in our book. + +## Why is `mlr3` using so much disk space / memory? + +There are various reasons why `mlr3` might use more memory / disk space than expected. +We have written a [gallery post](https://mlr-org/gallery/technical/2023-11-09-memory-disk/) that covers various aspects of this topic and offers solutions where possible. diff --git a/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd b/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd new file mode 100644 index 00000000..c7df1487 --- /dev/null +++ b/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd @@ -0,0 +1,78 @@ +--- +title: "Why is mlr3 Eating my Disk / RAM?" +description: | + Possible explanations for exuberant memory usage of mlr3 objects. +categories: + - tuning + - classification +author: + - name: Marc Becker + url: https://github.com/be-marc + - name: Sebastian Fischer + url: https://github.com/sebffischer +date: 2023-11-09 +knitr: + opts_chunk: + R.options: + datatable.print.nrows: 6 + datatable.print.trunc.cols: TRUE +--- + +{{< include ../../_setup.qmd >}} + +When serializing `mlr3` objects, it can happen that their size is much larger than expected. +While we are trying our best to keep such cases from happening, there are things that are beyond our control. +This gallery post serves as a technical trouble-shooting guide that covers various issues and offers solutions where possible. +We will update this post as new problems come to our attention. +Note that while some of these issues might seem neglibile, they can cause serious problems when running large benchmrk experiments, e.g. using `mlr3batchmark`. + +## Avoid Installating Packages With Source References + +Some objects in `mlr3` have parameters that can be functions. +One example for that is `po("colapply")`'s `applicator` parameter. + +```{r} +po_center = po("colapply", applicator = function(x) x - mean(x)) +``` + +Because `Learner`s store the hyperparameters that were used for training in their `$state`, it is important to ensure that their size is small. +One cause for large sizes of parameter values is the use of source references, i.e. when installing packages with the `--with-keep.source` option. +Note that this option is enabled by default when installing packages with `renv`. +You can disble it by setting the following option before installing packages, e.g. by adding the following to your `.Rprofile`. + +```{r, eval = FALSE} +options("install.opts" = "--without-keep.source") +``` + +## Duplication of Data When Serializing + +Anjor cause for increased object size is how R duplicates data when serializing objects. +Consider the simple example below: + +```{r} +x = rnorm(1000000) +y = x +lx = list(x) +lxy = list(x, y) +object_size(lx) +object_size(lxy) +``` + +Because of R's copy-on-write semantics, data is only copied when it is modified, i.e. the list `lx` has the same size as `lxy` because `x` and `y` all point to the same underlying data. +However, when serializing `lxy`, both `x` and `y` are serialized independently and when loading the object again, its size in memory is doubled. + +```{r} +pryr::object_size(serialie(lxy, NULL)) +``` + +Because data is serialized not only when manually saving objects, but also when parallelizing execution via `future` or when using encapsulation, this can cause the same information to be duplicated many times and blow up both RAM and disk usage. +While we have some mechanisms (like `mlr3misc::leanify`) in place to counteract this to some extent, it is impossible to get rid of the problem completely. + +## Setting the Correct Flags + +Another -- easily amendable -- source for large object sizes is forgetting to set the right flags. +The list below contains some important configuration options that can be used to reduce the size of important `mlr3` objects: + +* `benchmark()` and `resample()` have the flags `store_backends` and `store_models` +* `auto_tuner` has flags `store_tuning_instance` and `store_benchmark_result` +* `tune()` has flags `store_benchmark_result` and `store_models` diff --git a/mlr-org/gallery/technical/2023-11-09-renv/index.qmd b/mlr-org/gallery/technical/2023-11-09-renv/index.qmd deleted file mode 100644 index ddbf30fd..00000000 --- a/mlr-org/gallery/technical/2023-11-09-renv/index.qmd +++ /dev/null @@ -1,156 +0,0 @@ ---- -title: "Why is mlr3 Eating my Disk" -description: | - Possible explanations for exuberant memory usage of mlr3 objects. -categories: - - tuning - - classification -author: - - name: Marc Becker - url: https://github.com/be-marc - - name: Sebastian Fischer - url: https://github.com/sebffischer -date: 2023-11-09 -knitr: - opts_chunk: - R.options: - datatable.print.nrows: 6 - datatable.print.trunc.cols: TRUE ---- - -{{< include ../../_setup.qmd >}} - -When saving and loading `mlr3` objects it can happen that the sizes of the objects are much larger than expected. -While we are trying our best to keep such cases from happening, there are things that are beyond our control. -This gallery post serves as a trouble-shooting guide that we will update as new issues come to our attention. -We will list and explain various problems and give suggestions on how they can be solved or mitigated if possible. - - -## Duplication of Data afer deserialization - -One source of increased object sizes that happens after serializing and unserializeing `mlr3` objects is the duplication of data. -The example below illustrates this issue. - -```{r} -library("mlr3verse") -library("pryr") - -# train a decision tree on the mtcars dataset -learner = lrn("regr.rpart") -task = tsk("mtcars") -learner$train(task) -state = learner$state - -path = tempfile() -saveRDS(learner$state, path) -state_reloaded = readRDS(path) - -object_size(state) -object_size(state_reloaded) -``` - -The example shows how the size of the state object increases when saving and reading the object. - -This is not related to `mlr3` but is because of the way objects are serialized in R. -Because of R's copy-on-write semantics, data is only copied when it is modified. -In the code below, `lx` has the same size as `lxy` because `x` and `y` all point to the same underlying data. - -```{r} -x = rnorm(1000000) -y = x -lx = list(x) -lxy = list(x, y) -object_size(lx) -object_size(lxy) -``` - -However, when serializing `lxy`, both `x` and `y` are serialized independently and when loading the object again, its size in memory is doubled. - - -```{r} -saveRDS(lxy, path) -lxy_reloaded = readRDS(path) -object_size(lxy_reloaded) -``` - -While we have some mechanisms (like `mlr3misc::leanify`) in place to counteract this issue, the problem is still present to some degree. - - -## Serializing Closures - -Another possible source of increased object sizes is the serialization of closures. -Some objects in `mlr3` can be configured with closures. -One example for this is `PipeOpColApply` that has a parameter `applicator` that is applied to the columns of a task. -Let's say we want to use it to center the columns of a task. - -```{r} -center = function(x) x - mean(x) -po_center = po("colapply", applicator = function(x) center) -``` - -The graph has one of its parameters now set to a closure. -This means that when saving the parameter values, which we usually want to do, so we know afterwards how the model was trained, the enviornment of the closure is also serialized. -However this is not the case if the enclosing enviornment is either a package environment or the `.GlobalEnv`. -In your scripts, however, the enclosing enviornment of a function might neither be a package environment nor the global environment. -In order to show the effect we will construct the closure using a function factory. - -```{r} -make_center_large = function() { - some_stuff = rnorm(1000000) - - function(x) { - x - mean(x) - } -} -center_large = make_center_large() -``` - -Even thogh the `center` function itself is quite small w.r.t. source code, its enclosing enviornment is not. - -```{r} -object_size(center_large) -object_size(environment(center_large)$some_stuff) -``` - -The size of the earlier defined `center` function is neglibible, because the enclosing enviornment is the global enviornment. - -```{r} -object_size(center) -``` - -This can be especially problematic when running large benchmark experiments, e.g. via `r mlr3batchmark`, where the learner states (which includes the parameter values) of each resample iteration are written to disk independently. - -```{r} -library("mlr3batchmark") -library("batchtools") - -reg = makeExperimentRegistry(NA, seed = 1) - -glrn_large = as_learner( - po("colapply", applicator = center_large) %>>% - lrn("regr.rpart") -) -glrn_large$id = "large" - -glrn = as_learner( - po("colapply", applicator = function(x) x - mean(x)) %>>% - lrn("regr.rpart") -) - -design = benchmark_grid(task, list(glrn, glrn_large), rsmp("cv")) - -batchmark(design) -submitJobs() -``` - -We can compare the sizes of the resulting benchmark results are re-reassembling and notice a tremendous difference. - -```{r} -object_size(reduceResultsBatchmark(1:10)) -object_size(reduceResultsBatchmark(11:20)) -``` - - -## Sourcerefs - - From b326b730f1fffcd883e5e3b421bf726d5e152fd6 Mon Sep 17 00:00:00 2001 From: Sebastian Fischer Date: Tue, 5 Dec 2023 13:02:41 +0100 Subject: [PATCH 4/7] render post --- .../index/execute-results/html.json | 14 +++++++++++++ .../2023-11-09-memory-disk/index.qmd | 20 ++++++++++++++----- 2 files changed, 29 insertions(+), 5 deletions(-) create mode 100644 mlr-org/_freeze/gallery/technical/2023-11-09-memory-disk/index/execute-results/html.json diff --git a/mlr-org/_freeze/gallery/technical/2023-11-09-memory-disk/index/execute-results/html.json b/mlr-org/_freeze/gallery/technical/2023-11-09-memory-disk/index/execute-results/html.json new file mode 100644 index 00000000..caad9d97 --- /dev/null +++ b/mlr-org/_freeze/gallery/technical/2023-11-09-memory-disk/index/execute-results/html.json @@ -0,0 +1,14 @@ +{ + "hash": "f0492f5628b224e19cebaaa4b66e05db", + "result": { + "markdown": "---\ntitle: \"Why is mlr3 Eating my Disk / RAM?\"\ndescription: |\n Possible explanations for exuberant memory usage of mlr3 objects.\ncategories:\n - tuning\n - classification\nauthor:\n - name: Marc Becker\n url: https://github.com/be-marc\n - name: Sebastian Fischer\n url: https://github.com/sebffischer\ndate: 2023-11-09\nknitr:\n opts_chunk:\n R.options:\n datatable.print.nrows: 6\n datatable.print.trunc.cols: TRUE\nfreeze: true\n---\n\n\n\n\n\n\n\nWhen serializing `mlr3` objects, it can happen that their size is much larger than expected.\nWhile we are trying our best to keep such cases from happening, there are things that are beyond our control.\nThis gallery post serves as a technical trouble-shooting guide that covers various issues and offers solutions where possible.\nWe will update this post as new problems come to our attention.\nNote that while some of these issues might seem neglibile, they can cause serious problems when running large benchmrk experiments, e.g. using `mlr3batchmark`.\n\n## Avoid Installating Packages With Source References\n\nSome objects in `mlr3` have parameters that can be functions.\nOne example for that is `po(\"colapply\")`'s `applicator` parameter.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(\"mlr3verse\")\nlibrary(\"pryr\")\npo_center = po(\"colapply\", applicator = function(x) x - mean(x))\n```\n:::\n\n\nBecause `Learner`s store the hyperparameters that were used for training in their `$state`, it is important to ensure that their size is small.\nOne cause for large sizes of parameter values is the presence of source references in the function's attributes.\nSource references are kept when installing packages with the `--with-keep.source` option.\nNote that this option is enabled by default when installing packages with `renv`.\nYou can disble it by setting the following option before installing packages, e.g. by adding it to your `.Rprofile`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\noptions(\"install.opts\" = \"--without-keep.source\")\n```\n:::\n\n\n## Duplication of Data When Serializing\n\nAnjor cause for increased object size is how R duplicates data when serializing objects.\nConsider the simple example below:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx = rnorm(1000000)\ny = x\nlx = list(x)\nlxy = list(x, y)\nobject_size(lx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.00 MB\n```\n:::\n\n```{.r .cell-code}\nobject_size(lxy)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.00 MB\n```\n:::\n:::\n\n\nBecause of R's copy-on-write semantics, data is only copied when it is modified, i.e. the list `lx` has the same size as `lxy` because `x` and `y` all point to the same underlying data.\nHowever, when serializing `lxy`, both `x` and `y` are serialized independently and its memory footprint is doubled.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nobject_size(serialize(lxy, NULL))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n16.00 MB\n```\n:::\n:::\n\n\nBecause data is serialized not only when manually saving objects, but also when parallelizing execution via `future` or when using encapsulation, this can cause the same information to be duplicated many times and blow up both RAM and disk usage.\nWhile we have some mechanisms (like `mlr3misc::leanify`) in place to counteract this to some extent, it is impossible to get rid of the problem completely.\n\n## Setting the Correct Flags\n\nAnother -- easily amendable -- source for large object sizes is forgetting to set the right flags.\nThe list below contains some important configuration options that can be used to reduce the size of important `mlr3` objects:\n\n* `benchmark()` and `resample()` have the flags `store_backends` and `store_models`\n* `auto_tuner` has flags `store_tuning_instance` and `store_benchmark_result`\n* `tune()` has flags `store_benchmark_result` and `store_models`\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd b/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd index c7df1487..731df48c 100644 --- a/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd +++ b/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd @@ -16,10 +16,17 @@ knitr: R.options: datatable.print.nrows: 6 datatable.print.trunc.cols: TRUE +freeze: true --- {{< include ../../_setup.qmd >}} +```{r} +#| include: false +lgr::get_logger("mlr3")$set_threshold("warn") +``` + + When serializing `mlr3` objects, it can happen that their size is much larger than expected. While we are trying our best to keep such cases from happening, there are things that are beyond our control. This gallery post serves as a technical trouble-shooting guide that covers various issues and offers solutions where possible. @@ -31,14 +38,17 @@ Note that while some of these issues might seem neglibile, they can cause seriou Some objects in `mlr3` have parameters that can be functions. One example for that is `po("colapply")`'s `applicator` parameter. -```{r} +```{r, output = FALSE} +library("mlr3verse") +library("pryr") po_center = po("colapply", applicator = function(x) x - mean(x)) ``` Because `Learner`s store the hyperparameters that were used for training in their `$state`, it is important to ensure that their size is small. -One cause for large sizes of parameter values is the use of source references, i.e. when installing packages with the `--with-keep.source` option. +One cause for large sizes of parameter values is the presence of source references in the function's attributes. +Source references are kept when installing packages with the `--with-keep.source` option. Note that this option is enabled by default when installing packages with `renv`. -You can disble it by setting the following option before installing packages, e.g. by adding the following to your `.Rprofile`. +You can disble it by setting the following option before installing packages, e.g. by adding it to your `.Rprofile`. ```{r, eval = FALSE} options("install.opts" = "--without-keep.source") @@ -59,10 +69,10 @@ object_size(lxy) ``` Because of R's copy-on-write semantics, data is only copied when it is modified, i.e. the list `lx` has the same size as `lxy` because `x` and `y` all point to the same underlying data. -However, when serializing `lxy`, both `x` and `y` are serialized independently and when loading the object again, its size in memory is doubled. +However, when serializing `lxy`, both `x` and `y` are serialized independently and its memory footprint is doubled. ```{r} -pryr::object_size(serialie(lxy, NULL)) +object_size(serialize(lxy, NULL)) ``` Because data is serialized not only when manually saving objects, but also when parallelizing execution via `future` or when using encapsulation, this can cause the same information to be duplicated many times and blow up both RAM and disk usage. From 3cec192b3c8f24d70c8de2e0f24c85666bb26a5b Mon Sep 17 00:00:00 2001 From: Sebastian Fischer Date: Tue, 5 Dec 2023 13:03:05 +0100 Subject: [PATCH 5/7] ... --- mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd | 1 - 1 file changed, 1 deletion(-) diff --git a/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd b/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd index 731df48c..4d1145dc 100644 --- a/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd +++ b/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd @@ -16,7 +16,6 @@ knitr: R.options: datatable.print.nrows: 6 datatable.print.trunc.cols: TRUE -freeze: true --- {{< include ../../_setup.qmd >}} From c2c597840892d65737b57ef88c2390be626883c1 Mon Sep 17 00:00:00 2001 From: Sebastian Fischer Date: Wed, 6 Dec 2023 12:25:13 +0100 Subject: [PATCH 6/7] Update mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd Co-authored-by: Marc Becker <33069354+be-marc@users.noreply.github.com> --- mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd b/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd index 4d1145dc..585f8497 100644 --- a/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd +++ b/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd @@ -75,7 +75,7 @@ object_size(serialize(lxy, NULL)) ``` Because data is serialized not only when manually saving objects, but also when parallelizing execution via `future` or when using encapsulation, this can cause the same information to be duplicated many times and blow up both RAM and disk usage. -While we have some mechanisms (like `mlr3misc::leanify`) in place to counteract this to some extent, it is impossible to get rid of the problem completely. +While we have some mechanisms (like `mlr3misc::leanify` and database normalization) in place to counteract this to some extent, it is impossible to get rid of the problem completely. ## Setting the Correct Flags From adbc0d3848ae0b628a49293872673162577f63fc Mon Sep 17 00:00:00 2001 From: Sebastian Fischer Date: Wed, 6 Dec 2023 12:25:19 +0100 Subject: [PATCH 7/7] Update mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd Co-authored-by: Marc Becker <33069354+be-marc@users.noreply.github.com> --- mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd b/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd index 585f8497..50dd7863 100644 --- a/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd +++ b/mlr-org/gallery/technical/2023-11-09-memory-disk/index.qmd @@ -55,7 +55,7 @@ options("install.opts" = "--without-keep.source") ## Duplication of Data When Serializing -Anjor cause for increased object size is how R duplicates data when serializing objects. +Another cause for increased object size is how R duplicates data when serializing objects. Consider the simple example below: ```{r}