[BUG] Incorrect read_parquet on spark distributed parquet files. #16968

matt7salomon · 2024-10-01T19:55:51Z

When I read in a parquet dataset saved with Spark on a databricks catalog I get lots of .
I tried

import glob

cudf_dfs = [cudf.read_parquet(file) for file in glob.glob("/Volumes/path/*.parquet")]
cudf_df = cudf.concat(cudf_dfs,ignore_index=True)

and

import dask_cudf
dask_df = dask_cudf.read_parquet("/Volumes/path/*.parquet",chunksize='50MB')
cudf_df = dask_df.compute()

and I also tried using the pyarrow engine on the load which overflows.

Let me know if you need some code to replicate but it should be easy. Any spark based parquet would do.

The text was updated successfully, but these errors were encountered:

matt7salomon added the bug Something isn't working label Oct 1, 2024

Provide feedback