-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] When dataset increases, NVTabular workflow causes to Dtype discrepancy. #1790
Comments
@ldane hello. do you mind to clarify The original dtype of If you can share your workflow script that gives you this error, and together with the parquet file of the day that gives you this error we can take a look. you can send it via email. thanks. |
@rnyak, This is coming from NVT workflow. My training is completed successfully (NVTabular+T4Rec). After training, I'm getting this error while doing inference. I'm loading the workflow and the dataset and doing a transform operation: I'm using a pandas data frame. Is there anyway to figure out which row causes this exception? Maybe somehow I could increase the verbosity of logs generated by NVTabular? I'm attaching some files, for your consideration. ETL-code.ipynb.zip Full Stack:
|
@ldane From your workflow code I see that you are trying to use I see you have |
@rnyak I shouldn't be using feed_date column at all. It is there to keep track of the data partitions. If I was using any operations, I could understand it might be causing issues. Are you saying even though I don't do any modifications, It is not advised to use SelectOperator with unsupported data types? Is there any documentation that allows us to see which data types are supported by NVTabular and Triton? |
There are two points here to be considered
hope that helps. @karlhigley would you like to add something else? |
@rnyak I'll explain it better. In this two cases (28 days vs 91 days), the test set is exactly same. There aren't multiple files. I'm using only one day worth of data to verify the inference is working. The data is retrieved and kept in memory as Pandas DataFrame. I verified all of rows for feed_date has exact same value. Therefore, feed_date isn't corrupted as you described. Let's see if we can't find a way to figure out what is going on, before trying to solve this with workarounds. |
@ldane thanks for extra info. Can you please confirm my understanding:
your ETL pipeline shows you read data (parquet files ) from path like that can you please print and share the train data dtypes, and sessions_gdf dtypes here? I am asking these questions to be able to generate an example pipeline with a synthetic data set to repro your error. |
@ldane thanks. your raw train data feed_date column is
I am not sql person but does that mean you generating your feed_rate col in test set with
as a debugging step, do you mind removing (Note: in your ETL notebook you are not filtering the sequences based on min_seq_length. Please note that Transformer-based model requires at least two items per session for training and eval steps. I assume you dont have sessions less than 2 items, and that's why you dont add the filtering step in NVT workflow, but I just wanted to mention in any case :) ). |
@rnyak Yes, the feed_date column in train data set is string. I'm generating that field as string from BigQuery. I see some on my test dataset, I'm using DATE data type. In the data frame, feed_date has dtype of dbdate. I've just tried to cast feed_date into string and 91 days is also working. Now my question is, how is the 28days working with dbdate dtype and not causing the discrepancy? and How does the workflow for 91days verify the dtypes and fail? |
When I increase the dataset duration from 28 days to 91days, I'm getting Dtype discrepancy.
Same code works with 28 days:
What would be the best way to troubleshoot this issue? Should I share pkl files for both workflow?
The text was updated successfully, but these errors were encountered: