-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] How can I fit a Workflow to a large dataset ? #1761
Comments
I am facing the same problem: I am running the merlin ts container using this command |
@Azilyss may I ask
you can always set
you can change part_size. |
Hi @rnyak,
Setting part_size as follow does not help :
|
@Azilyss where do you get the OOM? during you can simply read one parquet file via And please be sure for GPU memory is free when you start using NVTabular.. does your notebook/script have any |
Hi @rnyak, Yes, the error occur during workflow.fit() at :
The GPU memory consumption is 246MB for a single uncompressed file. I am using PyTorch for this project, but this script is not importing torch. Using part_size = "1000MB", the workflow.fit() still fails at categorify. |
@Azilyss can you try to feed only 1 parquet file to NVT and see if you are getting OOM? Also, does your dataset have very long |
Hi @rnyak, I am not getting OOM when loading only one file. The dataset contains np.int64 columns item_id and session_id. The NVT workflow mainly categorifies these 2 columns and groups the item_ids by session_id into a list. A session typically has 60 items, there are about 200k sessions per part, and 200k unique items in the entire dataset. The workflow code is as follows : import nvtabular as nvt
from merlin.dag import ColumnSelector
from merlin.schema import Schema, Tags
from nvtabular import ops
input_paths = ["part.0.parquet", "part.1.parquet", ... , "part.99.parquet"]
output_path = "output"
input_features = ["session_id", "item_id"]
max_len = 200
min_len = 2
cat_features = (
ColumnSelector(input_features)
>> ops.Categorify()
>> nvt.ops.AddMetadata(tags=[Tags.CATEGORICAL])
)
groupby_features = cat_features >> nvt.ops.Groupby(
groupby_cols=["session_id"],
aggs={"item_id": ["list", "count"]},
name_sep="-",
)
seq_feats_list = (
groupby_features["item_id-list"]
>> nvt.ops.ListSlice(-max_len, pad=True, pad_value=0)
>> nvt.ops.Rename(postfix="_seq")
>> nvt.ops.AddMetadata(tags=[Tags.LIST])
>> nvt.ops.ValueCount()
)
seq_feats_list = seq_feats_list >> nvt.ops.AddMetadata(tags=[Tags.ITEM, Tags.ID, Tags.ITEM_ID])
selected_features = seq_feats_list + groupby_features["item_id-count"]
features = selected_features >> nvt.ops.Filter(
f=lambda df: df["item_id-count"] >= min_len
)
dataset = nvt.Dataset(input_paths, engine="parquet", part_size="1000MB")
dataset = dataset.shuffle_by_keys(keys=["session_id"])
workflow = nvt.Workflow(features)
workflow.fit(dataset) |
@Azilyss can you please set |
Apologies, I meant most sessions have around 60 items but a session can contain between 2 and 200 items. |
@Azilyss you can still reduce the |
Hi @rnyak, I did but without success. The data is loaded from a local disk. I modified the code to do the shuffle_by_keys with dask beforehand so it does not pile up with other operations : At this point, I am not encountering an OOM error if I remove the ValueCount() operation from the workflow, but the schema is generated without the valuecount.min and valuecount.max fields, which in this example should be :
Is there any reason why this operation would be costly in this case and how it could be generated otherwise, given that the sequence length is constant thanks to |
@Azilyss not sure why this is required? With the latest changes in the repo, if you are using |
I got same issue when I try to load data with nvt.Dataset(data_path) And I use the way you suggested to see num_row_groups: DATA_FOLDER = os.environ.get("DATA_FOLDER", "/home/ec2-user/SageMaker/p2p_two_tower_data/") num_rows, num_row_groups, names = cudf.io.read_parquet_metadata(DATA_FOLDER + "raw_train/data.parquet") Then, you're saying row_group_size memory is pretty big. Note that we expect row_group memory size to be smaller than part_size. So, i try to use part_size like [1] suggest but do not work, I got error. DATA_FOLDER = os.environ.get("DATA_FOLDER", "/home/ec2-user/SageMaker/p2p_two_tower_data/") train_dataset = nvt.Dataset(DATA_FOLDER + "/raw_train/.parquet", part_size="256MB") Error: Anything I can try, be excited to see use more data to train and see if it works Note: |
I solved my issue by doing the below step: You can use most Data Frame frameworks to set the row group size (number of rows) for your parquet files. In the following Pandas and cuDF examples, the row_group_size is the number of rows that will be stored in each row group (internal structure within the parquet file): #Pandas Step2: change part_size [1].https://nvidia-merlin.github.io/NVTabular/v0.7.0/resources/troubleshooting.html |
Yes, one can reset the
note that Please check this document for further info. |
Yes, sessions could be divided between different parquet files, but this is no longer the case. I had pip install nvtabular==1.8.1, and the ListSlice() changes are not included in that release. Removing ValueCount() operation solved my OOM issue actually. I did not see any direct improvement by increasing row_group_size in the raw parquet files nor by specifying a part_size in nvt.Dataset but did take into account these suggestions when resolving this issue. Thanks for your time and helpful suggestions during this resolution. |
Hi all, I got similar OOM issues, I first encountered this problem when building the dataset (nvt.Dataset) and I solved the problem by adding row_group_size at this stage. But in the model training stage I encountered the same problem again. When using model.fit(train_dataset, validation_data=valid_dataset, batch_size=128), OOM error appeared again. Any suggestions would be appreciated! |
@ShengyuanSW can you pls give us some info about your HW, GPU memory size, and your env? are you getting OOM when you run your NVTabular workflow? or when you train your model? what you shared above looks like you are getting an error from model training is that correcT? is yes, what model is that? |
3.twoTower model and DLRM in Merlin models Thanks! |
@ShengyuanSW thanks. Couple follow up questions:
|
|
Just to confirm, do you have ~800 input features? and you are using Two-Towel model? how many input features are you feeding user-tower and item-tower? do you mind sharing your model script please?
you are able to run NVTabular pipeline is that correct? you can process your dataset and then export your processed parquet file is that correcT? |
I used user embeddings and item embeddings. each embeddings is a 384 dimensions Bert embedding. I split them into 384 columns. and beside there are some other properties for users and items used as features.
About 11GB.
yes. run NVTabular pipeline no error, export data and import data no error. only has OOM error in Two-Tower model.fit |
@ShengyuanSW can you calculate the model size in GB? what's the size of embedding tables? can you share your schema file please? also be sure you are adding this line at the very beginning of your TT model code before doing
|
closing this issue due to low activity. Please reopen if you still have an issue. |
Encountered the same problem, .fit() cannot handle large dataset. |
@Chevolier thanks for your interest in Merlin. We would need more than if your embedding table cannot fit in a single GPU memory, naturally |
What is the best way to fit a Workflow to a large dataset?
When fitting the workflow I run into OOM error:
Setup
GPU - 1 x NVIDIA Tesla V100
Docker image : nvcr.io/nvidia/merlin/merlin-pytorch:22.12
Platform: Ubuntu 20.04
Python version: 3.8
Dataset
1.7M rows - 5 columns
4GB
Operations include nvt.ops.Categorify() with 250k unique items in a column.
I have come across this troubleshooting document but did not succeed in implementing a working solution. Is setting up a LocalCUDACluster necessary to handle this issue ?
The text was updated successfully, but these errors were encountered: