Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked writing of h5py.Dataset and zarr.Array #1624

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ivirshup
Copy link
Member

@ivirshup ivirshup commented Aug 28, 2024

This PR fixes #1623 by writing backed dense arrays in chunks.

Very open to feedback on the logic of how chunking pattern of writes is selected. Maybe we should prioritize the chunking of the destination array over the chunking of the source array?

cc: @ebezzi

Some proof it works:

%load_ext memory_profiler

import h5py
from anndata.experimental import write_elem
import numpy as np

f = h5py.File("tmp.h5", "w")
X = np.ones((10_000, 10_000))

%memit write_elem(f, "X", X)
# peak memory: 945.67 MiB, increment: 0.56 MiB

%memit write_elem(f, "X2", f["X"])
peak memory: 1047.00 MiB, increment: 101.12 MiB

%memit write_elem(f, "X3", f["X"], dataset_kwargs={"compression":"gzip"})
peak memory: 1068.03 MiB, increment: 6.14 MiB

Copy link

codecov bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.99%. Comparing base (8cc5a18) to head (32e008d).
Report is 41 commits behind head on main.

Files with missing lines Patch % Lines
src/anndata/_io/specs/methods.py 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1624      +/-   ##
==========================================
- Coverage   86.74%   83.99%   -2.75%     
==========================================
  Files          37       40       +3     
  Lines        5988     6031      +43     
==========================================
- Hits         5194     5066     -128     
- Misses        794      965     +171     
Flag Coverage Δ
83.99% <93.33%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/anndata/_io/specs/methods.py 88.77% <93.33%> (+0.37%) ⬆️

... and 23 files with indirect coverage changes

@@ -392,6 +389,46 @@ def write_basic(
f.create_dataset(k, data=elem, **dataset_kwargs)


def _iter_chunks_for_copy(elem, dest):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typing please!

src/anndata/_io/specs/methods.py Show resolved Hide resolved
def write_chunked_dense_array(
f: GroupStorageType,
k: str,
elem,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elem,
elem: GroupStorageType,

no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayStorageType I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elem,
elem: ArrayStorageType,

@ilan-gold ilan-gold added this to the 0.10.10 milestone Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing a h5py.Dataset loads the whole thing into memory
2 participants