Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-455: Add Variant specification docs #456

Merged
merged 5 commits into from
Oct 9, 2024

Conversation

gene-db
Copy link
Contributor

@gene-db gene-db commented Sep 30, 2024

Rationale for this change

Spark and Parquet communities have agreed to move the Spark Variant spec to Parquet.

What changes are included in this PR?

Added the Variant specification docs.

Do these changes have PoC implementations?

Closes #455

VariantEncoding.md Outdated Show resolved Hide resolved
VariantShredding.md Outdated Show resolved Hide resolved
Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for getting this PR in with the basics so that we can start working on smaller, more focused PRs to get the shredding spec into a usable form. There are definitely some changes to make, but I'd prefer not holding up the initial addition waiting for them.

minor formatting

Co-authored-by: Ryan Blue <[email protected]>
@sfc-gh-aixu
Copy link

+1. Thanks @gene-db to work on it. So we will include preliminary shredding spec as well? I'm fine with that.

@gene-db
Copy link
Contributor Author

gene-db commented Oct 4, 2024

@rdblue I updated the PR to add licenses to the docs. I think that should make the tests pass.

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I have left some comments.
As a follow up, it would be nice to have more explanations of the rationale for the decisions in this spec. If the spec is precise, it doesn't always explain why it is that way.

VariantEncoding.md Outdated Show resolved Hide resolved
- The length of the ith string can be computed as `offset[i+1] - offset[i]`.
- The offset of the first string is always equal to 0 and is therefore redundant. It is included in the spec to simplify in-memory-processing.
- `offset_size_minus_one` indicates the number of bytes per `dictionary_size` and `offset` entry. I.e. a value of 0 indicates 1-byte offsets, 1 indicates 2-byte offsets, 2 indicates 3 byte offsets and 3 indicates 4-byte offsets.
- If `sorted_strings` is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this assume any kind of encoding or is it byte-wise?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All strings are UTF-8, but I think it's a follow up to clarify that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are all UTF-8. I can add a follow up to clarify that point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so lexicographic order is defined by unicode code points.

# Shredding Semantics

Reconstruction of Variant value from a shredded representation is not expected to produce a bit-for-bit identical binary to the original unshredded value.
For example, the order of fields in the binary may change, as may the physical representation of scalar values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the order of fields going to change? If we use the same order in the Parquet schema, then the order should be maintained, no?
Also it seems that we can add metadata to the parquet footer to make sure we can have identity preserving round trip. That seems like an important property to have to verify correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a Variant object, the field ids and field offsets have a strict ordering defined by the specification, but the field data (what the offsets are pointing to) do not have to be in the same order. Therefore, reconstruction may not preserve the same order of the field data as the original binary.

We can validate correctness by recursively inspecting Variant values (and field id/offset are valid according to the spec), and not bitwise comparing the results.

Co-authored-by: Julien Le Dem <[email protected]>
@gene-db
Copy link
Contributor Author

gene-db commented Oct 7, 2024

@julienledem Thanks! I clarified some of the comments, and I will address them in a followup PR.

@julienledem julienledem merged commit 4f20815 into apache:master Oct 9, 2024
3 checks passed
@alamb
Copy link
Contributor

alamb commented Nov 15, 2024

Does anyone know of parquet implementations that implement the variant type?

I would like to try and organize getting this into the Rust implementation (see apache/arrow-rs#6736) but I couldn't find any example data / implementations while writing that up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the Variant Specification Docs
5 participants