Skip to content

Commit

Permalink
Add documentation for caching feature (#237)
Browse files Browse the repository at this point in the history
  • Loading branch information
andersy005 authored Sep 17, 2024
1 parent dcd09a0 commit 504d293
Show file tree
Hide file tree
Showing 6 changed files with 282 additions and 1 deletion.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -66,3 +66,4 @@ target/
# tests
.pytest_cache/*
.mypy_cache/
.ipynb_checkpoints/
3 changes: 3 additions & 0 deletions ci/requirements/doc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,12 @@ dependencies:
- adlfs
- ipykernel
- nbsphinx
- netcdf4
- pooch
- zarr
# Editable xbatcher installation
- pip

- pip:
# relative to this file. Needs to be editable to be accepted.
- -e ../..
2 changes: 1 addition & 1 deletion doc/contributing.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _contributing:

******************
Contributing guide
Contributing Guide
******************

.. note::
Expand Down
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ or via a built-in `Xarray accessor <http://xarray.pydata.org/en/stable/internals
:caption: Contents:

api
user-guide/index
tutorials-and-presentations
roadmap
contributing
268 changes: 268 additions & 0 deletions doc/user-guide/caching.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Xbatcher Caching Feature \n",
"\n",
"This notebook demonstrates the new caching feature added to xbatcher's `BatchGenerator`. This feature allows you to cache batches, potentially improving performance for repeated access to the same batches. \n",
"\n",
"\n",
"## Introduction\n",
"\n",
"The caching feature in xbatcher's `BatchGenerator` allows you to store generated batches in a cache, which can significantly speed up subsequent accesses to the same batches. This is particularly useful in scenarios where you need to iterate over the same dataset multiple times. \n",
"\n",
"\n",
"The cache is pluggable, meaning you can use any dict-like object to store the cache. This flexibility allows for various storage backends, including local storage, distributed storage systems, or cloud storage solutions.\n",
"\n",
"## Installation \n",
"\n",
"To use the caching feature, you'll need to have xbatcher installed, along with zarr for serialization. If you haven't already, you can install these using pip:\n",
"\n",
"```bash\n",
"python -m pip install xbatcher zarr\n",
"```\n",
"\n",
"or \n",
"\n",
"using conda:\n",
"\n",
"```bash\n",
"conda install -c conda-forge xbatcher zarr\n",
"```\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic Usage \n",
"\n",
"Let's start with a basic example of how to use the caching feature:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import tempfile\n",
"\n",
"import xarray as xr\n",
"import zarr\n",
"\n",
"import xbatcher"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create a cache using Zarr's DirectoryStore\n",
"directory = f'{tempfile.mkdtemp()}/xbatcher-cache'\n",
"print(directory)\n",
"cache = zarr.storage.DirectoryStore(directory)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we're using a local directory to store the cache, but you could use any zarr-compatible store, such as S3, Redis, etc."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# load a sample dataset\n",
"ds = xr.tutorial.open_dataset('air_temperature', chunks={})\n",
"ds"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create a BatchGenerator with caching enabled\n",
"gen = xbatcher.BatchGenerator(ds, input_dims={'lat': 10, 'lon': 10}, cache=cache)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Performance Comparison\n",
"\n",
"\n",
"Let's compare the performance with and without caching:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"\n",
"def time_iteration(gen):\n",
" start = time.time()\n",
" for batch in gen:\n",
" pass\n",
" end = time.time()\n",
" return end - start"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"directory = f'{tempfile.mkdtemp()}/xbatcher-cache'\n",
"cache = zarr.storage.DirectoryStore(directory)\n",
"\n",
"# Without cache\n",
"gen_no_cache = xbatcher.BatchGenerator(ds, input_dims={'lat': 10, 'lon': 10})\n",
"time_no_cache = time_iteration(gen_no_cache)\n",
"print(f'Time without cache: {time_no_cache:.2f} seconds')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# With cache\n",
"gen_with_cache = xbatcher.BatchGenerator(\n",
" ds, input_dims={'lat': 10, 'lon': 10}, cache=cache\n",
")\n",
"time_first_run = time_iteration(gen_with_cache)\n",
"print(f'Time with cache (first run): {time_first_run:.2f} seconds')\n",
"\n",
"\n",
"time_second_run = time_iteration(gen_with_cache)\n",
"print(f'Time with cache (second run): {time_second_run:.2f} seconds')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see that the second run with cache is significantly faster than both the first run and the run without cache."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Advanced Usage \n",
"\n",
"### Custom Cache Preprocessing\n",
"\n",
"You can also specify a custom preprocessing function to be applied to batches before they are cached:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create a cache using Zarr's DirectoryStore\n",
"directory = f'{tempfile.mkdtemp()}/xbatcher-cache'\n",
"cache = zarr.storage.DirectoryStore(directory)\n",
"\n",
"\n",
"def preprocess_batch(batch):\n",
" # example: add a new variable to each batch\n",
" batch['new_var'] = batch['air'] * 2\n",
" return batch\n",
"\n",
"\n",
"gen_with_preprocess = xbatcher.BatchGenerator(\n",
" ds,\n",
" input_dims={'lat': 10, 'lon': 10},\n",
" cache=cache,\n",
" cache_preprocess=preprocess_batch,\n",
")\n",
"\n",
"# Now, each cached batch will include the 'new_var' variable\n",
"for batch in gen_with_preprocess:\n",
" print(batch)\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using Different Storage Backends\n",
"\n",
"While we've been using a local directory for caching, you can use any dict-like that is compatible with zarr. For example, you could use an S3 bucket as the cache storage backend:\n",
"\n",
"```python\n",
"import s3fs\n",
"import zarr \n",
"\n",
"# Set up S3 filesystem (you'll need appropriate credentials)\n",
"s3 = s3fs.S3FileSystem(anon=False)\n",
"store = s3.get_mapper('s3://my-bucket/my-cache.zarr')\n",
"\n",
"# Use this cache with BatchGenerator\n",
"gen_s3 = xbatcher.BatchGenerator(ds, input_dims={'lat': 10, 'lon': 10}, cache=cache)\n",
"```\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Considerations and Best Practices \n",
"\n",
"- **Storage Space**: Be mindful of the storage space required for your cache, especially when working with large datasets.\n",
"- **Cache Invalidation**: The current implementation doesn't handle cache invalidation. If your source data changes, you'll need to manually clear or update the cache.\n",
"- **Performance Tradeoffs**: While caching can significantly speed up repeated access to the same data, the initial caching process may be slower than processing without a cache. Consider your use case to determine if caching is beneficial.\n",
"- **Storage Backend**: Choose a storage backend that's appropriate for your use case. Local storage might be fastest for single-machine applications, while distributed or cloud storage might be necessary for cluster computing or cloud-based workflows.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
8 changes: 8 additions & 0 deletions doc/user-guide/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
User Guide
===========

.. toctree::
:maxdepth: 2
:caption: Contents:

caching

0 comments on commit 504d293

Please sign in to comment.