Pattern for organizing (large) datasets into data packages #662

rufuspollock · 2018-09-01T12:39:16Z

rufuspollock
Sep 1, 2018
Maintainer

As a User I want to know patterns (and best practices) for structuring (large) datasets as data packages so that I can use best practice and common approach

Example questions: suppose I have a 5GB time series dataset of rainfall observations across 30 years at a daily level and across 10k geographic locations (grouped by locality, then state, then country)

How do you partition across data packages? Is this one data package or many (e.g. one for each year)
How do you partition across resources? Does all data go in one big file or do you partition by common values for key fields (e.g. by each year)

See also the support for chunking/partitioning resources already in data packages frictionlessdata/datapackage#228

Research & Reading

This idea of partitioning shares much in common with partitioning in databases (or, more accurately, database tables).

Essentially we are asking for partitioning criteria people should use to partition their dataset into resources (or even their resources).

See https://en.wikipedia.org/wiki/Partition_(database)

zaneselvans · 2018-09-15T18:19:04Z

zaneselvans
Sep 15, 2018

I didn't see a way in the tabular data resource/package specification to link tables in more than one data package with foreign key relationships. If that's true, then in the case of a data package that is specifying a bunch of tabular resources which are meant to be linked to each other, it seems like you'd probably have to put all the linked resources in the same package, and partition the resources into separate manageable files if you wanted to preserve that linkage information. Does that work pretty cleanly as it is now?

I see that the multi-file support is part of the data resource spec, rather than the tabular data resource spec.

Are there additional specifics that need to be worked out for how multi-file support would work with a table?
Is the ordering of the files important?
Is there any information about how one goes about combining the files into a single unified resource on the user end? For tables, I assume it would collections of rows with all of the columns available in each one of them, that are then concatenated together.
Might it be useful to specify the way in which a tabular data resource is being partitioned, maybe along the lines of how AWS/S3/Athena deals with partitioning, with each chunk being specified by some kind of filter (year=2016, or state="CA", or even year=2016 AND state="CA" or something like that -- semantic partitioning). Or you could just specify how many rows max you want in each chunk, and have them be sequential chunks.

The semantic partitioning would be more useful for allowing uses to pick-and-choose only certain chunks of the data that they want, and might also reduce the frequency with which any individual file is likely to need updating (e.g. all the older years worth of data are probably going to be pretty static, with only the newest year getting updated regularly), while the row-based chunking would give you nice even file sizes that might be easier to deal with in some cases.

It would also be good just to provide a little guidance as to how big a file/resource/package should be. Is it max 100MB per file? or 1GB per file? The overall Data Package DataHub "small to medium (MB to GB)" data guideline doesn't make this super clear.

0 replies

rufuspollock · 2019-01-31T19:58:37Z

rufuspollock
Jan 31, 2019
Maintainer Author

I didn't see a way in the tabular data resource/package specification to link tables in more than one data package with foreign key relationships. If that's true, then in the case of a data package that is specifying a bunch of tabular resources which are meant to be linked to each other, it seems like you'd probably have to put all the linked resources in the same package, and partition the resources into separate manageable files if you wanted to preserve that linkage information. Does that work pretty cleanly as it is now?

@zaneselvans there is an approach for linking across data packages as detailed in "Table Schema: Foreign Keys to Data Packages" in http://frictionlessdata.io/specs/patterns/.

The semantic partitioning would be more useful for allowing uses to pick-and-choose only certain chunks of the data that they want, and might also reduce the frequency with which any individual file is likely to need updating (e.g. all the older years worth of data are probably going to be pretty static, with only the newest year getting updated regularly), while the row-based chunking would give you nice even file sizes that might be easier to deal with in some cases.

Yes on these points. Would you like to have a go at a proposal?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pattern for organizing (large) datasets into data packages #662

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Pattern for organizing (large) datasets into data packages #662

rufuspollock Sep 1, 2018 Maintainer

Research & Reading

Replies: 2 comments

zaneselvans Sep 15, 2018

rufuspollock Jan 31, 2019 Maintainer Author

rufuspollock
Sep 1, 2018
Maintainer

zaneselvans
Sep 15, 2018

rufuspollock
Jan 31, 2019
Maintainer Author