Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batches to zarr #40

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

leifdenby
Copy link

Add BatchGenerator.to_zarr and BatchGenerator.from_zarr to make it possible to save generated batches to zarr and later load them from zarr. By chunking along the batch dimension this enables fast data-loading at training time.

Add `BatchGenerator.to_zarr` and `BatchGenerator.from_zarr` to make it
possible to save generated batches to zarr and later load them from
zarr. By chunking along the batch dimension this enables fast
data-loading at training time.
@codecov
Copy link

codecov bot commented Feb 3, 2022

Codecov Report

Merging #40 (08a9e94) into main (34ca3f9) will decrease coverage by 3.04%.
The diff coverage is 85.71%.

@@             Coverage Diff             @@
##              main      #40      +/-   ##
===========================================
- Coverage   100.00%   96.95%   -3.05%     
===========================================
  Files            3        3              
  Lines          134      164      +30     
  Branches        30       38       +8     
===========================================
+ Hits           134      159      +25     
- Misses           0        2       +2     
- Partials         0        3       +3     
Impacted Files Coverage Δ
xbatcher/generators.py 95.65% <85.71%> (-4.35%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34ca3f9...08a9e94. Read the comment docs.

@jhamman
Copy link
Contributor

jhamman commented Feb 3, 2022

@leifdenby - thanks for opening this PR and apologies the review wasn't picked up sooner.

My main question/comment is about whether or not we want to think about serializing some of the batch-generator's attributes in the Zarr dataset. It seems like without too much effort, we could effectively reconstruct the full BatchGenerator attribute namespace.

Also, as an aside, this fits nicely within the caching-api's element in the xbatcher roadmap: https://xbatcher.readthedocs.io/en/latest/roadmap.html#caching-apis. We hadn't gotten there yet but I'm glad to see this moving.

@leifdenby
Copy link
Author

leifdenby commented Feb 8, 2022

My main question/comment is about whether or not we want to think about serializing some of the batch-generator's attributes in the Zarr dataset. It seems like without too much effort, we could effectively reconstruct the full BatchGenerator attribute namespace.

Yes, that sounds like a good idea. Do you have list of attributes in mind? Looking at BatchGenerator.__init__ maybe I should start with input_dims, input_overlap, batch_dims and concat_input_dims. I don't think it makes sense to store preload_batch and of course not the source dataset ds. What do you think?

Also, as an aside, this fits nicely within the caching-api's element in the xbatcher roadmap: https://xbatcher.readthedocs.io/en/latest/roadmap.html#caching-apis. We hadn't gotten there yet but I'm glad to see this moving.

Great! :) I've used this a few times now and it works well for me.

@jhamman
Copy link
Contributor

jhamman commented Feb 9, 2022

Looking at BatchGenerator.init maybe I should start with input_dims, input_overlap, batch_dims and concat_input_dims

Yes, this is exactly what I was thinking.

Also, looking at the code coverage report, it looks like we're in pretty good shape but could use a bit more testing on the edge cases. I'll leave a few more comments in the code to highlight areas that could use a test.

Copy link
Contributor

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few pointers for possible test improvements

ds_all = xr.concat(batch_datasets, dim='batch_number').reset_index(
'sample'
)
if 'batch' in chunks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test when 'batch' not in chunks

if 'batch' in chunks:
chunks['batch_number'] = chunks.pop('batch')

if len(chunks) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test when len(chunks) == 0

self.path = path

def __iter__(self):
for batch_id in self.ds_batches.batch_number.values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly why but codecov think something in this for loop is not being covered by the existing tests. Perhaps its the empty iterable (.values) or it could be the if` statement in line 194. Any thoughts?

Copy link
Contributor

@RichardScottOZ RichardScottOZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in computere by the way

@jhamman jhamman mentioned this pull request Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants