You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my testing with large datasets, there is at least one array of objects that is not being reported with --schema when the array begins on line 1,326,612,715 out of 1,495,055,188 lines in the 11GB file.
Is it possible that schema only reviews the first X lines or bytes of a file? If so, is there any way that I can override that?
Reproduction steps
With a 11GB (or larger) file: dsq --schema --pretty LARGE_FILE.json
Versions
OS: Ubuntu 22.04 LTS, AMD EPYC 7R32
Shell: bash
dsq version: dsq 0.20.2 from apt
The text was updated successfully, but these errors were encountered:
Hey! Thanks for the report. Yeah datastation/dsq does sampling to get reasonable performance. Maybe it makes sense to sample a larger file but then performance is going to get much worse. Overall I don't yet have a great strategy for dealing with very large files.
Before I discovered Datastation, the way I had imagined building my own was to stream-read the file and when I see an array -- to read only the first 3 of the array's children into memory, counting but discarding all other objects in the array until I capture the last 3.
The flaw with my plan was that if there is an array child that didn't conform to the structure of the first and last 3 in the array, my report would not include them in the schema -- but it would have found this schema element that datastation/dsq is missing.
Perhaps a hybrid of your approach and mine which can be activated by an --array_depth=3 argument?
Describe the bug and expected behavior
In my testing with large datasets, there is at least one array of objects that is not being reported with --schema when the array begins on line 1,326,612,715 out of 1,495,055,188 lines in the 11GB file.
Is it possible that schema only reviews the first X lines or bytes of a file? If so, is there any way that I can override that?
Reproduction steps
With a 11GB (or larger) file:
dsq --schema --pretty LARGE_FILE.json
Versions
dsq 0.20.2
from aptThe text was updated successfully, but these errors were encountered: