-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add documentation for caching feature (#237)
- Loading branch information
1 parent
dcd09a0
commit 504d293
Showing
6 changed files
with
282 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -66,3 +66,4 @@ target/ | |
# tests | ||
.pytest_cache/* | ||
.mypy_cache/ | ||
.ipynb_checkpoints/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
.. _contributing: | ||
|
||
****************** | ||
Contributing guide | ||
Contributing Guide | ||
****************** | ||
|
||
.. note:: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,268 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Xbatcher Caching Feature \n", | ||
"\n", | ||
"This notebook demonstrates the new caching feature added to xbatcher's `BatchGenerator`. This feature allows you to cache batches, potentially improving performance for repeated access to the same batches. \n", | ||
"\n", | ||
"\n", | ||
"## Introduction\n", | ||
"\n", | ||
"The caching feature in xbatcher's `BatchGenerator` allows you to store generated batches in a cache, which can significantly speed up subsequent accesses to the same batches. This is particularly useful in scenarios where you need to iterate over the same dataset multiple times. \n", | ||
"\n", | ||
"\n", | ||
"The cache is pluggable, meaning you can use any dict-like object to store the cache. This flexibility allows for various storage backends, including local storage, distributed storage systems, or cloud storage solutions.\n", | ||
"\n", | ||
"## Installation \n", | ||
"\n", | ||
"To use the caching feature, you'll need to have xbatcher installed, along with zarr for serialization. If you haven't already, you can install these using pip:\n", | ||
"\n", | ||
"```bash\n", | ||
"python -m pip install xbatcher zarr\n", | ||
"```\n", | ||
"\n", | ||
"or \n", | ||
"\n", | ||
"using conda:\n", | ||
"\n", | ||
"```bash\n", | ||
"conda install -c conda-forge xbatcher zarr\n", | ||
"```\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Basic Usage \n", | ||
"\n", | ||
"Let's start with a basic example of how to use the caching feature:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import tempfile\n", | ||
"\n", | ||
"import xarray as xr\n", | ||
"import zarr\n", | ||
"\n", | ||
"import xbatcher" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# create a cache using Zarr's DirectoryStore\n", | ||
"directory = f'{tempfile.mkdtemp()}/xbatcher-cache'\n", | ||
"print(directory)\n", | ||
"cache = zarr.storage.DirectoryStore(directory)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In this example, we're using a local directory to store the cache, but you could use any zarr-compatible store, such as S3, Redis, etc." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# load a sample dataset\n", | ||
"ds = xr.tutorial.open_dataset('air_temperature', chunks={})\n", | ||
"ds" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# create a BatchGenerator with caching enabled\n", | ||
"gen = xbatcher.BatchGenerator(ds, input_dims={'lat': 10, 'lon': 10}, cache=cache)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Performance Comparison\n", | ||
"\n", | ||
"\n", | ||
"Let's compare the performance with and without caching:\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import time\n", | ||
"\n", | ||
"\n", | ||
"def time_iteration(gen):\n", | ||
" start = time.time()\n", | ||
" for batch in gen:\n", | ||
" pass\n", | ||
" end = time.time()\n", | ||
" return end - start" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"directory = f'{tempfile.mkdtemp()}/xbatcher-cache'\n", | ||
"cache = zarr.storage.DirectoryStore(directory)\n", | ||
"\n", | ||
"# Without cache\n", | ||
"gen_no_cache = xbatcher.BatchGenerator(ds, input_dims={'lat': 10, 'lon': 10})\n", | ||
"time_no_cache = time_iteration(gen_no_cache)\n", | ||
"print(f'Time without cache: {time_no_cache:.2f} seconds')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# With cache\n", | ||
"gen_with_cache = xbatcher.BatchGenerator(\n", | ||
" ds, input_dims={'lat': 10, 'lon': 10}, cache=cache\n", | ||
")\n", | ||
"time_first_run = time_iteration(gen_with_cache)\n", | ||
"print(f'Time with cache (first run): {time_first_run:.2f} seconds')\n", | ||
"\n", | ||
"\n", | ||
"time_second_run = time_iteration(gen_with_cache)\n", | ||
"print(f'Time with cache (second run): {time_second_run:.2f} seconds')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"You should see that the second run with cache is significantly faster than both the first run and the run without cache." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Advanced Usage \n", | ||
"\n", | ||
"### Custom Cache Preprocessing\n", | ||
"\n", | ||
"You can also specify a custom preprocessing function to be applied to batches before they are cached:\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# create a cache using Zarr's DirectoryStore\n", | ||
"directory = f'{tempfile.mkdtemp()}/xbatcher-cache'\n", | ||
"cache = zarr.storage.DirectoryStore(directory)\n", | ||
"\n", | ||
"\n", | ||
"def preprocess_batch(batch):\n", | ||
" # example: add a new variable to each batch\n", | ||
" batch['new_var'] = batch['air'] * 2\n", | ||
" return batch\n", | ||
"\n", | ||
"\n", | ||
"gen_with_preprocess = xbatcher.BatchGenerator(\n", | ||
" ds,\n", | ||
" input_dims={'lat': 10, 'lon': 10},\n", | ||
" cache=cache,\n", | ||
" cache_preprocess=preprocess_batch,\n", | ||
")\n", | ||
"\n", | ||
"# Now, each cached batch will include the 'new_var' variable\n", | ||
"for batch in gen_with_preprocess:\n", | ||
" print(batch)\n", | ||
" break" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Using Different Storage Backends\n", | ||
"\n", | ||
"While we've been using a local directory for caching, you can use any dict-like that is compatible with zarr. For example, you could use an S3 bucket as the cache storage backend:\n", | ||
"\n", | ||
"```python\n", | ||
"import s3fs\n", | ||
"import zarr \n", | ||
"\n", | ||
"# Set up S3 filesystem (you'll need appropriate credentials)\n", | ||
"s3 = s3fs.S3FileSystem(anon=False)\n", | ||
"store = s3.get_mapper('s3://my-bucket/my-cache.zarr')\n", | ||
"\n", | ||
"# Use this cache with BatchGenerator\n", | ||
"gen_s3 = xbatcher.BatchGenerator(ds, input_dims={'lat': 10, 'lon': 10}, cache=cache)\n", | ||
"```\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Considerations and Best Practices \n", | ||
"\n", | ||
"- **Storage Space**: Be mindful of the storage space required for your cache, especially when working with large datasets.\n", | ||
"- **Cache Invalidation**: The current implementation doesn't handle cache invalidation. If your source data changes, you'll need to manually clear or update the cache.\n", | ||
"- **Performance Tradeoffs**: While caching can significantly speed up repeated access to the same data, the initial caching process may be slower than processing without a cache. Consider your use case to determine if caching is beneficial.\n", | ||
"- **Storage Backend**: Choose a storage backend that's appropriate for your use case. Local storage might be fastest for single-machine applications, while distributed or cloud storage might be necessary for cluster computing or cloud-based workflows.\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.9" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
User Guide | ||
=========== | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: Contents: | ||
|
||
caching |