[QST] How can I fit a Workflow to a large dataset ? #1761

Azilyss · 2023-02-14T11:18:28Z

What is the best way to fit a Workflow to a large dataset?

When fitting the workflow I run into OOM error:

MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /project/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

Setup
GPU - 1 x NVIDIA Tesla V100
Docker image : nvcr.io/nvidia/merlin/merlin-pytorch:22.12
Platform: Ubuntu 20.04
Python version: 3.8

Dataset
1.7M rows - 5 columns
4GB
Operations include nvt.ops.Categorify() with 250k unique items in a column.

I have come across this troubleshooting document but did not succeed in implementing a working solution. Is setting up a LocalCUDACluster necessary to handle this issue ?

The text was updated successfully, but these errors were encountered:

Tselmeg-C · 2023-02-14T13:39:27Z

I am facing the same problem: I am running the merlin ts container using this command
docker run -it --network="<user-container-preprocessing-data>" --name merlin --shm-size=1g --ulimit memlock=-1 ulimit stack=67108864 -v <my_vol> nvcr.io/nvidia/merlin/merlin-tensorflow:22.12
for smaller data was working well, but for data with 3M rows, 90 cols with a size of 2+ GB, the workflow_fit_transform is not working, throwing 'kernel dies' error. And I have only CPU for now. Pls tell me there are other ways to walkaround this problem instead of setting up a GPU.

rnyak · 2023-02-14T17:20:47Z

@Azilyss may I ask

what's your GPU? is it a single 16GB ?
Is your raw dataset one single csv file or parquet file or it consists of multiple files?
if you read your dataset with the following step what's the number of row groups? if your num_row_groups 1 or 2, meaning a small number that means your row_group_size memory is pretty big. Note that we expect row_group memory size to be smaller than part_size.

num_rows, num_row_groups, names = cudf.io.read_parquet_metadata('<yourdataset.parquet'>)
print(num_rows), print(num_row_groups)

you can always set part_size argument in the NVTabular nvt.Dataset() something like that before doing workflow.fit(dataset):

dataset= nvt.Dataset(PATH, part_size="500MB")

you can change part_size.

Azilyss · 2023-02-14T19:11:52Z

Hi @rnyak,

Yes it's a single Tesla V100 16GB GPU.
The dataset is split into 100 parquet files so ~35MB each
num_rows = 12_757_831 and num_row_groups = 1

Setting part_size as follow does not help :

dataset = nvt.Dataset(list_paths, engine="parquet", part_size="128MB")
workflow.fit(dataset)

rnyak · 2023-02-14T19:34:54Z

@Azilyss where do you get the OOM? during workflow.fit() ? can please set the part_size ='1000MB' and try again? Note that your parquet size is for a compressed file.

you can simply read one parquet file via cudf.read_parquet(..) and then check the GPU memory consumption on nvidia-smi.

And please be sure for GPU memory is free when you start using NVTabular.. does your notebook/script have any import tensorflow as tf command?

Azilyss · 2023-02-14T21:13:56Z

Hi @rnyak,

Yes, the error occur during workflow.fit() at :

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvtabular/ops/categorify.py", line 463, in transform
    encoded = _encode(
  File "/usr/local/lib/python3.8/dist-packages/nvtabular/ops/categorify.py", line 1427, in _encode
    labels = codes.merge(
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/indexed_frame.py", line 1902, in sort_values
    self._get_columns_by_label(by)._get_sorted_inds(
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 1539, in _get_sorted_inds
    return libcudf.sort.order_by(to_sort, ascending, na_position)
  File "sort.pyx", line 138, in cudf._lib.sort.order_by
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/local/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

The above exception was the direct cause of the following exception:

...

Traceback (most recent call last):
  File "/pythonproject/dataset/workflow.py", line 238, in run_workflow
    workflow.fit(dataset)
  File "/usr/local/lib/python3.8/dist-packages/nvtabular/workflow/workflow.py", line 198, in fit
    self.executor.fit(ddf, current_phase)
  ...
  File "/usr/local/lib/python3.8/dist-packages/merlin/dag/executors.py", line 170, in _transform_data
    output_data = node.op.transform(selection, input_data)
  File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nvtabular/ops/categorify.py", line 483, in transform
    raise RuntimeError(f"Failed to categorical encode column {name}") from e
RuntimeError: Failed to categorical encode column item_id

The GPU memory consumption is 246MB for a single uncompressed file.

I am using PyTorch for this project, but this script is not importing torch.

Using part_size = "1000MB", the workflow.fit() still fails at categorify.

rnyak · 2023-02-16T14:07:31Z

@Azilyss can you try to feed only 1 parquet file to NVT and see if you are getting OOM? Also, does your dataset have very long string columns? If yes, that might be problematic for categorify which relies on cudf and dask_cudf under the hood. can you share a sample dataset? and your NVT workflow code pls?

Azilyss · 2023-02-24T13:45:31Z

Hi @rnyak,

I am not getting OOM when loading only one file.

The dataset contains np.int64 columns item_id and session_id. The NVT workflow mainly categorifies these 2 columns and groups the item_ids by session_id into a list. A session typically has 60 items, there are about 200k sessions per part, and 200k unique items in the entire dataset.

The workflow code is as follows :

import nvtabular as nvt
from merlin.dag import ColumnSelector
from merlin.schema import Schema, Tags
from nvtabular import ops

input_paths = ["part.0.parquet", "part.1.parquet", ... , "part.99.parquet"]
output_path = "output"

input_features = ["session_id", "item_id"]
max_len = 200
min_len = 2

cat_features = (
    ColumnSelector(input_features)
    >> ops.Categorify()
    >> nvt.ops.AddMetadata(tags=[Tags.CATEGORICAL])
)
groupby_features = cat_features >> nvt.ops.Groupby(
    groupby_cols=["session_id"],
    aggs={"item_id": ["list", "count"]},
    name_sep="-",
)
seq_feats_list = (
    groupby_features["item_id-list"]
    >> nvt.ops.ListSlice(-max_len, pad=True, pad_value=0)
    >> nvt.ops.Rename(postfix="_seq")
    >> nvt.ops.AddMetadata(tags=[Tags.LIST])
    >> nvt.ops.ValueCount()
)
seq_feats_list = seq_feats_list >> nvt.ops.AddMetadata(tags=[Tags.ITEM, Tags.ID, Tags.ITEM_ID])

selected_features = seq_feats_list + groupby_features["item_id-count"]

features = selected_features >> nvt.ops.Filter(
    f=lambda df: df["item_id-count"] >= min_len
)

dataset = nvt.Dataset(input_paths, engine="parquet", part_size="1000MB")
dataset = dataset.shuffle_by_keys(keys=["session_id"])

workflow = nvt.Workflow(features)
workflow.fit(dataset)

rnyak · 2023-02-24T18:52:41Z

@Azilyss can you please set max_len = 200 to something smaller? to like 60 and test again? IF a session typically has 60 items, is there a reason you set the max_length to 200? it will add lots of 0 padding to the session that have less than 200 items in it..

Azilyss · 2023-02-24T18:59:11Z

Apologies, I meant most sessions have around 60 items but a session can contain between 2 and 200 items.

rnyak · 2023-02-24T19:02:39Z

@Azilyss you can still reduce the max_length to 60 and test :) where do you read these parquet files from? from a google cloud or AWS storage? or from a local disk?

Azilyss · 2023-02-28T13:46:04Z

Hi @rnyak,

I did but without success. The data is loaded from a local disk.

I modified the code to do the shuffle_by_keys with dask beforehand so it does not pile up with other operations :
ddf.shuffle(on="session_id", ignore_index=True, npartitions=100).

At this point, I am not encountering an OOM error if I remove the ValueCount() operation from the workflow, but the schema is generated without the valuecount.min and valuecount.max fields, which in this example should be :

value_count {
    min: 60 
    max: 60
}

Is there any reason why this operation would be costly in this case and how it could be generated otherwise, given that the sequence length is constant thanks to pad=True ?

rnyak · 2023-02-28T14:19:44Z

@Azilyss not sure why this is required? dataset = dataset.shuffle_by_keys(keys=["session_id"]) if you do shuffle_by_keys you are adding extra computational complexity. are your same session_ids divided in between different parquet files? meaning for example two different parquet files, say "part.0.parquet", "part.1.parquet" include same raw session_id?

With the latest changes in the repo, if you are using ListSlice (pad=True), you can remove the ValueCount() and you should be able to see properties.value_count.min and properties.value_count.max in your output schema. I tested it at my end. These changes will be part of our upcoming release merlin-pytorch:23.02 docker image which is gonna be released soon.

liyunrui · 2023-03-02T05:51:59Z

@rnyak

I got same issue when I try to load data with nvt.Dataset(data_path)

And I use the way you suggested to see num_row_groups:

DATA_FOLDER = os.environ.get("DATA_FOLDER", "/home/ec2-user/SageMaker/p2p_two_tower_data/")

num_rows, num_row_groups, names = cudf.io.read_parquet_metadata(DATA_FOLDER + "raw_train/data.parquet")
print(num_rows), print(num_row_groups) # 13704855, 1

Then, you're saying row_group_size memory is pretty big. Note that we expect row_group memory size to be smaller than part_size.

So, i try to use part_size like [1] suggest but do not work, I got error.

DATA_FOLDER = os.environ.get("DATA_FOLDER", "/home/ec2-user/SageMaker/p2p_two_tower_data/")

train_dataset = nvt.Dataset(DATA_FOLDER + "/raw_train/.parquet", part_size="256MB")
valid_dataset = nvt.Dataset(DATA_FOLDER + "/raw_valid/.parquet")

Error:
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /home/ec2-user/anaconda3/envs/rapids-22.08/include/rmm/mr/device/cuda_memory_resource.hpp

Anything I can try, be excited to see use more data to train and see if it works
[1].https://nvidia-merlin.github.io/NVTabular/v0.6.0/api/dataset.html

Note:
One weird thing is no matter which part_size I tried, for example(512MB, 256MB, 1000MB), it always consume same amount of gpu memory(8273MiB) in the below(nvidia-smi)
| 0 N/A N/A 357 C ...s/rapids-22.08/bin/python 8273MiB |

liyunrui · 2023-03-02T09:09:03Z

I solved my issue by doing the below step:
Step1: Increase row_group_size as [1] suggested

You can use most Data Frame frameworks to set the row group size (number of rows) for your parquet files. In the following Pandas and cuDF examples, the row_group_size is the number of rows that will be stored in each row group (internal structure within the parquet file):

#Pandas
pandas_df.to_parquet("/file/path", engine="pyarrow", row_group_size=10000)
#cuDF
cudf_df.to_parquet("/file/path", engine="pyarrow", row_group_size=10000)

Step2: change part_size
train_dataset = nvt.Dataset(DATA_FOLDER + "/raw_train/.parquet", part_size="256MB")

[1].https://nvidia-merlin.github.io/NVTabular/v0.7.0/resources/troubleshooting.html

rnyak · 2023-03-02T21:38:25Z

Yes, one can reset the row_group_size of the each parquet file and save it again on disk. We recommend not a large row_group_size value. instead we recommend to keep it small. Please note that if you read and save data with cudf the correct arg is row_group_size_rows like below:

cudf_df.to_parquet("/file/path",  row_group_size_rows=10000)

note that 10000 is just an example. you can set it accordingly by checking the total number of rows in your dataset, so that the total number of rows can be divisible by row_group_size_rows.

Please check this document for further info.

Azilyss · 2023-03-03T17:29:37Z

Yes, sessions could be divided between different parquet files, but this is no longer the case.

I had pip install nvtabular==1.8.1, and the ListSlice() changes are not included in that release.

Removing ValueCount() operation solved my OOM issue actually. I did not see any direct improvement by increasing row_group_size in the raw parquet files nor by specifying a part_size in nvt.Dataset but did take into account these suggestions when resolving this issue.

Thanks for your time and helpful suggestions during this resolution.

ShengyuanSW · 2023-04-11T22:46:36Z

Hi all, I got similar OOM issues, I first encountered this problem when building the dataset (nvt.Dataset) and I solved the problem by adding row_group_size at this stage. But in the model training stage I encountered the same problem again. When using model.fit(train_dataset, validation_data=valid_dataset, batch_size=128), OOM error appeared again.
std::bad_alloc: out_of_memory: CUDA error at: /opt/rapids/rmm/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

Any suggestions would be appreciated!

rnyak · 2023-04-13T07:16:15Z

@ShengyuanSW can you pls give us some info about your HW, GPU memory size, and your env? are you getting OOM when you run your NVTabular workflow? or when you train your model? what you shared above looks like you are getting an error from model training is that correcT? is yes, what model is that?

ShengyuanSW · 2023-04-13T12:56:26Z

@ShengyuanSW can you pls give us some info about your HW, GPU memory size, and your env? are you getting OOM when you run your NVTabular workflow? or when you train your model? what you shared above looks like you are getting an error from model training is that correcT? is yes, what model is that?

GPU：24GB, python 3.8.1, inside docker container
training model stage

3.twoTower model and DLRM in Merlin models

Thanks!

rnyak · 2023-04-13T14:42:45Z

@ShengyuanSW thanks. Couple follow up questions:

what's the size of your dataset?
how many parquet files do you have and what's the size of each parquet file that you feed to the Two-tower or DLRM model?
are you using Merlin Models?

ShengyuanSW · 2023-04-14T14:21:49Z

@ShengyuanSW thanks. Couple follow up questions:

what's the size of your dataset?

how many parquet files do you have and what's the size of each parquet file that you feed to the Two-tower or DLRM model?

are you using Merlin Models?

1M+ data records, 800+ features
I used one parquet file with row_group_size: 10k , part size: 256MB.
I can create the data by nvt.Dataset
yes, reference here https://nvidia-merlin.github.io/models/main/models_overview.html

rnyak · 2023-04-18T09:57:49Z

800+ features

Just to confirm, do you have ~800 input features? and you are using Two-Towel model? how many input features are you feeding user-tower and item-tower?

do you mind sharing your model script please?

I used one parquet file
can you pls tell us size of this one parquet file. what's it? 1GB, 5 GB?

you are able to run NVTabular pipeline is that correct? you can process your dataset and then export your processed parquet file is that correcT?

ShengyuanSW · 2023-04-26T14:48:19Z

features:

I used user embeddings and item embeddings. each embeddings is a 384 dimensions Bert embedding. I split them into 384 columns. and beside there are some other properties for users and items used as features.
Number of user features: 412
Number of item features: 390

can you pls tell us size of this one parquet file. what's it? 1GB, 5 GB?

About 11GB.

you are able to run NVTabular pipeline is that correct? you can process your dataset and then export your processed parquet file is that correcT?

yes. run NVTabular pipeline no error, export data and import data no error. only has OOM error in Two-Tower model.fit

rnyak · 2023-05-01T22:59:41Z

@ShengyuanSW can you calculate the model size in GB? what's the size of embedding tables? can you share your schema file please?

also be sure you are adding this line at the very beginning of your TT model code before doing import tensorflow as tf

import os
os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"

rnyak · 2023-05-25T16:53:36Z

closing this issue due to low activity. Please reopen if you still have an issue.

Chevolier · 2024-08-18T13:38:37Z

Encountered the same problem, .fit() cannot handle large dataset.

rnyak · 2024-08-20T13:24:31Z

@Chevolier thanks for your interest in Merlin. We would need more than Encountered the same problem :) If you can give us some information about your model size, HW specs, number of features, etc, that'd be more informative.

if your embedding table cannot fit in a single GPU memory, naturally .fit() will result in OOM. Did you check your embedding table size? what's your GPU memory?

Azilyss added the question Further information is requested label Feb 14, 2023

rnyak added the P1 label Feb 28, 2023

rnyak closed this as completed May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How can I fit a Workflow to a large dataset ? #1761

[QST] How can I fit a Workflow to a large dataset ? #1761

Azilyss commented Feb 14, 2023

Tselmeg-C commented Feb 14, 2023

rnyak commented Feb 14, 2023

Azilyss commented Feb 14, 2023

rnyak commented Feb 14, 2023 •

edited

Loading

Azilyss commented Feb 14, 2023

rnyak commented Feb 16, 2023 •

edited

Loading

Azilyss commented Feb 24, 2023 •

edited

Loading

rnyak commented Feb 24, 2023 •

edited

Loading

Azilyss commented Feb 24, 2023

rnyak commented Feb 24, 2023 •

edited

Loading

Azilyss commented Feb 28, 2023

rnyak commented Feb 28, 2023

liyunrui commented Mar 2, 2023 •

edited

Loading

liyunrui commented Mar 2, 2023

rnyak commented Mar 2, 2023

Azilyss commented Mar 3, 2023

ShengyuanSW commented Apr 11, 2023

rnyak commented Apr 13, 2023

ShengyuanSW commented Apr 13, 2023

rnyak commented Apr 13, 2023

ShengyuanSW commented Apr 14, 2023 •

edited

Loading

rnyak commented Apr 18, 2023

ShengyuanSW commented Apr 26, 2023 •

edited

Loading

rnyak commented May 1, 2023

rnyak commented May 25, 2023

Chevolier commented Aug 18, 2024

rnyak commented Aug 20, 2024

[QST] How can I fit a Workflow to a large dataset ? #1761

[QST] How can I fit a Workflow to a large dataset ? #1761

Comments

Azilyss commented Feb 14, 2023

Tselmeg-C commented Feb 14, 2023

rnyak commented Feb 14, 2023

Azilyss commented Feb 14, 2023

rnyak commented Feb 14, 2023 • edited Loading

Azilyss commented Feb 14, 2023

rnyak commented Feb 16, 2023 • edited Loading

Azilyss commented Feb 24, 2023 • edited Loading

rnyak commented Feb 24, 2023 • edited Loading

Azilyss commented Feb 24, 2023

rnyak commented Feb 24, 2023 • edited Loading

Azilyss commented Feb 28, 2023

rnyak commented Feb 28, 2023

liyunrui commented Mar 2, 2023 • edited Loading

liyunrui commented Mar 2, 2023

rnyak commented Mar 2, 2023

Azilyss commented Mar 3, 2023

ShengyuanSW commented Apr 11, 2023

rnyak commented Apr 13, 2023

ShengyuanSW commented Apr 13, 2023

rnyak commented Apr 13, 2023

ShengyuanSW commented Apr 14, 2023 • edited Loading

rnyak commented Apr 18, 2023

ShengyuanSW commented Apr 26, 2023 • edited Loading

rnyak commented May 1, 2023

rnyak commented May 25, 2023

Chevolier commented Aug 18, 2024

rnyak commented Aug 20, 2024

rnyak commented Feb 14, 2023 •

edited

Loading

rnyak commented Feb 16, 2023 •

edited

Loading

Azilyss commented Feb 24, 2023 •

edited

Loading

rnyak commented Feb 24, 2023 •

edited

Loading

rnyak commented Feb 24, 2023 •

edited

Loading

liyunrui commented Mar 2, 2023 •

edited

Loading

ShengyuanSW commented Apr 14, 2023 •

edited

Loading

ShengyuanSW commented Apr 26, 2023 •

edited

Loading