Pattern for organizing (large) datasets into data packages #662
Replies: 2 comments
-
I didn't see a way in the tabular data resource/package specification to link tables in more than one data package with foreign key relationships. If that's true, then in the case of a data package that is specifying a bunch of tabular resources which are meant to be linked to each other, it seems like you'd probably have to put all the linked resources in the same package, and partition the resources into separate manageable files if you wanted to preserve that linkage information. Does that work pretty cleanly as it is now? I see that the multi-file support is part of the data resource spec, rather than the tabular data resource spec.
The semantic partitioning would be more useful for allowing uses to pick-and-choose only certain chunks of the data that they want, and might also reduce the frequency with which any individual file is likely to need updating (e.g. all the older years worth of data are probably going to be pretty static, with only the newest year getting updated regularly), while the row-based chunking would give you nice even file sizes that might be easier to deal with in some cases. It would also be good just to provide a little guidance as to how big a file/resource/package should be. Is it max 100MB per file? or 1GB per file? The overall Data Package DataHub "small to medium (MB to GB)" data guideline doesn't make this super clear. |
Beta Was this translation helpful? Give feedback.
-
@zaneselvans there is an approach for linking across data packages as detailed in "Table Schema: Foreign Keys to Data Packages" in http://frictionlessdata.io/specs/patterns/.
Yes on these points. Would you like to have a go at a proposal? |
Beta Was this translation helpful? Give feedback.
-
As a User I want to know patterns (and best practices) for structuring (large) datasets as data packages so that I can use best practice and common approach
Example questions: suppose I have a 5GB time series dataset of rainfall observations across 30 years at a daily level and across 10k geographic locations (grouped by locality, then state, then country)
See also the support for chunking/partitioning resources already in data packages frictionlessdata/datapackage#228
Research & Reading
This idea of partitioning shares much in common with partitioning in databases (or, more accurately, database tables).
Essentially we are asking for partitioning criteria people should use to partition their dataset into resources (or even their resources).
See https://en.wikipedia.org/wiki/Partition_(database)
Beta Was this translation helpful? Give feedback.
All reactions