-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-455: Add Variant specification docs #456
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for getting this PR in with the basics so that we can start working on smaller, more focused PRs to get the shredding spec into a usable form. There are definitely some changes to make, but I'd prefer not holding up the initial addition waiting for them.
minor formatting Co-authored-by: Ryan Blue <[email protected]>
+1. Thanks @gene-db to work on it. So we will include preliminary shredding spec as well? I'm fine with that. |
Add license
Add license
@rdblue I updated the PR to add licenses to the docs. I think that should make the tests pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I have left some comments.
As a follow up, it would be nice to have more explanations of the rationale for the decisions in this spec. If the spec is precise, it doesn't always explain why it is that way.
- The length of the ith string can be computed as `offset[i+1] - offset[i]`. | ||
- The offset of the first string is always equal to 0 and is therefore redundant. It is included in the spec to simplify in-memory-processing. | ||
- `offset_size_minus_one` indicates the number of bytes per `dictionary_size` and `offset` entry. I.e. a value of 0 indicates 1-byte offsets, 1 indicates 2-byte offsets, 2 indicates 3 byte offsets and 3 indicates 4-byte offsets. | ||
- If `sorted_strings` is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this assume any kind of encoding or is it byte-wise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All strings are UTF-8, but I think it's a follow up to clarify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are all UTF-8. I can add a follow up to clarify that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so lexicographic order is defined by unicode code points.
# Shredding Semantics | ||
|
||
Reconstruction of Variant value from a shredded representation is not expected to produce a bit-for-bit identical binary to the original unshredded value. | ||
For example, the order of fields in the binary may change, as may the physical representation of scalar values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the order of fields going to change? If we use the same order in the Parquet schema, then the order should be maintained, no?
Also it seems that we can add metadata to the parquet footer to make sure we can have identity preserving round trip. That seems like an important property to have to verify correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a Variant object, the field ids and field offsets have a strict ordering defined by the specification, but the field data (what the offsets are pointing to) do not have to be in the same order. Therefore, reconstruction may not preserve the same order of the field data as the original binary.
We can validate correctness by recursively inspecting Variant values (and field id/offset are valid according to the spec), and not bitwise comparing the results.
Co-authored-by: Julien Le Dem <[email protected]>
@julienledem Thanks! I clarified some of the comments, and I will address them in a followup PR. |
Does anyone know of parquet implementations that implement the variant type? I would like to try and organize getting this into the Rust implementation (see apache/arrow-rs#6736) but I couldn't find any example data / implementations while writing that up |
Rationale for this change
Spark and Parquet communities have agreed to move the Spark Variant spec to Parquet.
What changes are included in this PR?
Added the Variant specification docs.
Do these changes have PoC implementations?
Closes #455