You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Reading parquet dataset on GPU throws an "cudf engine doesn't support the following keyword arguments: ['strings_to_categorical']" error. Reading the data on CPU runs successfully:
distributed.worker - WARNING - Compute Failed
Key: _sample_row_group-1d48e09b-5b56-4e62-92ec-860ff2f9dd40
Function: execute_task
args: ((<function apply at 0x7f6b6e47f010>, <function _sample_row_group at 0x7f692006f130>, ['path/to/parquet_files/000000000000.parquet', <gcsfs.core.GCSFileSystem object at 0x7f6ae65ec6d0>], (<class 'dict'>, [['cpu', False], ['memory_usage', True]])))
kwargs: {}
Exception: 'ValueError("cudf engine doesn\'t support the following keyword arguments: [\'strings_to_categorical\']")'
Steps/Code to reproduce bug
dataset = 'path/to/parquet_files/*.parquet'
dataset_nvt = nvt.Dataset(dataset, engine='parquet', cpu=False)
# Fails with ValueError: cudf engine doesn't support the following keyword arguments: ['strings_to_categorical']
dataset_nvt = nvt.Dataset(dataset, engine='parquet', cpu=True)
# Runs successfully
Expected behavior
The dataset should be read from file under both cpu=True/False
Environment details (please complete the following information):
Environment location: GCP vertex ai notebook (GPU: NVIDIA V100 x 1)
Method of NVTabular install: conda
nvtabular == 23.8.00
cudf == 23.10.02 (above error was also present under 23.12.01)
dask == 2023.9.2
workflow = nvt.Workflow.load(f"path/to/workflow")
workflow.transform(dataset_nvt)
# Exception: "TypeError('String Arrays is not yet implemented in cudf')"
full error:
Key: ('transform-bdc9b5878b9eff9e4e8eb287f652e68a', 63)
Function: subgraph_callable-6a50eb3e-1830-40d8-bff7-0a6db4e7
args: ([<Node SelectionOp>], 'read-parquet-070e46c56ae3f13e04d07d8cae7b3f14', {'piece': ('path/to/parquet_files/000000000000.parquet', None, None)})
kwargs: {}
Exception: "TypeError('String Arrays is not yet implemented in cudf')"
The workflow includes nvt.ops.Categorify and nvt.ops.Groupby operations to create a string array of sequential events per grouped entity.
Sorry for this ridiculously late response @orlev2 - Just coming across this now.
As far as I can tell, the rapids/dask pinning in Merlin has been far too loose. NVTabular 23.8 was definitely not tested with cudf>=23.08 or dask>=2023.8.
The merlin 23.08 containers use cudf-23.04 (which uses dask-2023.1.1), so using that is your best bet.
NOTE: The lack of upper pinning in NVTabular is indeed a "bug" of sorts - I apologize about that.
Describe the bug
Reading parquet dataset on GPU throws an "cudf engine doesn't support the following keyword arguments: ['strings_to_categorical']" error. Reading the data on CPU runs successfully:
Steps/Code to reproduce bug
Expected behavior
The dataset should be read from file under both cpu=True/False
Environment details (please complete the following information):
nvtabular == 23.8.00
cudf == 23.10.02 (above error was also present under 23.12.01)
dask == 2023.9.2
@niraj06
The text was updated successfully, but these errors were encountered: