ray dataset not handled well post downloading deltas #75

valiantljk · 2023-02-09T22:02:05Z

https://github.com/ray-project/deltacat/blob/main/deltacat/compute/compactor/steps/hash_bucket.py#L121

if storage_type is StorageType.DISTRIBUTED, the returned tables is a distributed ray dataset.
Several changes are needed to deal with a distributed dataset:

replace len() with ds.count(), as len() is not an available attribute of ray dataset
figure out how to enumerate a ray dataset as 'Dataset' object is not iterable
figure out how to avoid memory copy if ray dataset has to be converted into a pa.table, etc.

The text was updated successfully, but these errors were encountered:

valiantljk self-assigned this Feb 9, 2023

valiantljk added bug Something isn't working P1 Resolve if not working on P0 (< 2 weeks) labels Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ray dataset not handled well post downloading deltas #75

ray dataset not handled well post downloading deltas #75

valiantljk commented Feb 9, 2023

ray dataset not handled well post downloading deltas #75

ray dataset not handled well post downloading deltas #75

Comments

valiantljk commented Feb 9, 2023