Chunked writing of h5py.Dataset and zarr.Array #1624

ivirshup · 2024-08-28T23:44:39Z

Closes Writing a h5py.Dataset loads the whole thing into memory #1623
Tests added
- This might already be good on tests, but I should check
Release note added (or unnecessary)
Add benchmarks

This PR fixes #1623 by writing backed dense arrays in chunks.

Very open to feedback on the logic of how chunking pattern of writes is selected. Maybe we should prioritize the chunking of the destination array over the chunking of the source array?

cc: @ebezzi

Some proof it works:

%load_ext memory_profiler

import h5py
from anndata.experimental import write_elem
import numpy as np

f = h5py.File("tmp.h5", "w")
X = np.ones((10_000, 10_000))

%memit write_elem(f, "X", X)
# peak memory: 945.67 MiB, increment: 0.56 MiB

%memit write_elem(f, "X2", f["X"])
peak memory: 1047.00 MiB, increment: 101.12 MiB

%memit write_elem(f, "X3", f["X"], dataset_kwargs={"compression":"gzip"})
peak memory: 1068.03 MiB, increment: 6.14 MiB

for more information, see https://pre-commit.ci

codecov · 2024-08-29T00:10:41Z

Codecov Report

Attention: Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.99%. Comparing base (8cc5a18) to head (32e008d).
Report is 41 commits behind head on main.

Files with missing lines	Patch %	Lines
src/anndata/_io/specs/methods.py	93.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1624      +/-   ##
==========================================
- Coverage   86.74%   83.99%   -2.75%     
==========================================
  Files          37       40       +3     
  Lines        5988     6031      +43     
==========================================
- Hits         5194     5066     -128     
- Misses        794      965     +171

Flag	Coverage Δ
	`83.99% <93.33%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/anndata/_io/specs/methods.py	`88.77% <93.33%> (+0.37%)`	⬆️

... and 23 files with indirect coverage changes

ilan-gold · 2024-08-29T08:32:13Z

src/anndata/_io/specs/methods.py

@@ -392,6 +389,46 @@ def write_basic(
    f.create_dataset(k, data=elem, **dataset_kwargs)


+def _iter_chunks_for_copy(elem, dest):


Typing please!

src/anndata/_io/specs/methods.py

ilan-gold · 2024-08-29T08:33:50Z

src/anndata/_io/specs/methods.py

+def write_chunked_dense_array(
+    f: GroupStorageType,
+    k: str,
+    elem,


Suggested change

elem,

elem: GroupStorageType,

no?

ArrayStorageType I think

Suggested change

elem,

elem: ArrayStorageType,

ivirshup and others added 3 commits August 28, 2024 16:40

Chunked writing of h5py.Dataset and zarr.Array

d60c3ab

[pre-commit.ci] auto fixes from pre-commit.com hooks

232bee4

for more information, see https://pre-commit.ci

Make n-dimensional

c43c5e2

Add some tests, which fail :(

749880b

ilan-gold reviewed Aug 29, 2024

View reviewed changes

ilan-gold added the skip-gpu-ci label Aug 29, 2024

ilan-gold added this to the 0.10.10 milestone Aug 29, 2024

Fix up chunking algorithm + add some types

32e008d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked writing of h5py.Dataset and zarr.Array #1624

Chunked writing of h5py.Dataset and zarr.Array #1624

ivirshup commented Aug 28, 2024 •

edited

Loading

codecov bot commented Aug 29, 2024 •

edited

Loading

ilan-gold Aug 29, 2024

ilan-gold Aug 29, 2024

ivirshup Aug 30, 2024

ilan-gold Sep 2, 2024

ilan-gold Sep 2, 2024

		@@ -392,6 +389,46 @@ def write_basic(
		f.create_dataset(k, data=elem, **dataset_kwargs)


		def _iter_chunks_for_copy(elem, dest):

Chunked writing of h5py.Dataset and zarr.Array #1624

Are you sure you want to change the base?

Chunked writing of h5py.Dataset and zarr.Array #1624

Conversation

ivirshup commented Aug 28, 2024 • edited Loading

codecov bot commented Aug 29, 2024 • edited Loading

Codecov Report

ilan-gold Aug 29, 2024

Choose a reason for hiding this comment

ilan-gold Aug 29, 2024

Choose a reason for hiding this comment

ivirshup Aug 30, 2024

Choose a reason for hiding this comment

ilan-gold Sep 2, 2024

Choose a reason for hiding this comment

ilan-gold Sep 2, 2024

Choose a reason for hiding this comment

ivirshup commented Aug 28, 2024 •

edited

Loading

codecov bot commented Aug 29, 2024 •

edited

Loading