-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batches to zarr #40
base: main
Are you sure you want to change the base?
Batches to zarr #40
Conversation
Add `BatchGenerator.to_zarr` and `BatchGenerator.from_zarr` to make it possible to save generated batches to zarr and later load them from zarr. By chunking along the batch dimension this enables fast data-loading at training time.
Codecov Report
@@ Coverage Diff @@
## main #40 +/- ##
===========================================
- Coverage 100.00% 96.95% -3.05%
===========================================
Files 3 3
Lines 134 164 +30
Branches 30 38 +8
===========================================
+ Hits 134 159 +25
- Misses 0 2 +2
- Partials 0 3 +3
Continue to review full report at Codecov.
|
@leifdenby - thanks for opening this PR and apologies the review wasn't picked up sooner. My main question/comment is about whether or not we want to think about serializing some of the batch-generator's attributes in the Zarr dataset. It seems like without too much effort, we could effectively reconstruct the full BatchGenerator attribute namespace. Also, as an aside, this fits nicely within the caching-api's element in the xbatcher roadmap: https://xbatcher.readthedocs.io/en/latest/roadmap.html#caching-apis. We hadn't gotten there yet but I'm glad to see this moving. |
Yes, that sounds like a good idea. Do you have list of attributes in mind? Looking at
Great! :) I've used this a few times now and it works well for me. |
Yes, this is exactly what I was thinking. Also, looking at the code coverage report, it looks like we're in pretty good shape but could use a bit more testing on the edge cases. I'll leave a few more comments in the code to highlight areas that could use a test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few pointers for possible test improvements
ds_all = xr.concat(batch_datasets, dim='batch_number').reset_index( | ||
'sample' | ||
) | ||
if 'batch' in chunks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test when 'batch' not in chunks
if 'batch' in chunks: | ||
chunks['batch_number'] = chunks.pop('batch') | ||
|
||
if len(chunks) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test when len(chunks) == 0
self.path = path | ||
|
||
def __iter__(self): | ||
for batch_id in self.ds_batches.batch_number.values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exactly why but codecov think something in this for loop is not being covered by the existing tests. Perhaps its the empty iterable (.values) or it could be the
if` statement in line 194. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in computere by the way
Add
BatchGenerator.to_zarr
andBatchGenerator.from_zarr
to make it possible to save generated batches to zarr and later load them from zarr. By chunking along the batch dimension this enables fast data-loading at training time.