GH-455: Add Variant specification docs #456

gene-db · 2024-09-30T17:44:09Z

Rationale for this change

Spark and Parquet communities have agreed to move the Spark Variant spec to Parquet.

What changes are included in this PR?

Added the Variant specification docs.

Do these changes have PoC implementations?

Closes #455

VariantEncoding.md

VariantShredding.md

rdblue

+1 for getting this PR in with the basics so that we can start working on smaller, more focused PRs to get the shredding spec into a usable form. There are definitely some changes to make, but I'd prefer not holding up the initial addition waiting for them.

minor formatting Co-authored-by: Ryan Blue <[email protected]>

sfc-gh-aixu · 2024-09-30T20:27:50Z

+1. Thanks @gene-db to work on it. So we will include preliminary shredding spec as well? I'm fine with that.

Add license

gene-db · 2024-10-04T21:01:13Z

@rdblue I updated the PR to add licenses to the docs. I think that should make the tests pass.

julienledem

This looks good to me. I have left some comments.
As a follow up, it would be nice to have more explanations of the rationale for the decisions in this spec. If the spec is precise, it doesn't always explain why it is that way.

VariantEncoding.md

julienledem · 2024-10-04T23:15:34Z

VariantEncoding.md

+- The length of the ith string can be computed as `offset[i+1] - offset[i]`.
+- The offset of the first string is always equal to 0 and is therefore redundant. It is included in the spec to simplify in-memory-processing.
+- `offset_size_minus_one` indicates the number of bytes per `dictionary_size` and `offset` entry. I.e. a value of 0 indicates 1-byte offsets, 1 indicates 2-byte offsets, 2 indicates 3 byte offsets and 3 indicates 4-byte offsets.
+- If `sorted_strings` is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness.


does this assume any kind of encoding or is it byte-wise?

All strings are UTF-8, but I think it's a follow up to clarify that.

Yes, they are all UTF-8. I can add a follow up to clarify that point.

so lexicographic order is defined by unicode code points.

julienledem · 2024-10-04T23:33:24Z

VariantShredding.md

+# Shredding Semantics
+
+Reconstruction of Variant value from a shredded representation is not expected to produce a bit-for-bit identical binary to the original unshredded value.
+For example, the order of fields in the binary may change, as may the physical representation of scalar values.


is the order of fields going to change? If we use the same order in the Parquet schema, then the order should be maintained, no?
Also it seems that we can add metadata to the parquet footer to make sure we can have identity preserving round trip. That seems like an important property to have to verify correctness.

In a Variant object, the field ids and field offsets have a strict ordering defined by the specification, but the field data (what the offsets are pointing to) do not have to be in the same order. Therefore, reconstruction may not preserve the same order of the field data as the original binary.

We can validate correctness by recursively inspecting Variant values (and field id/offset are valid according to the spec), and not bitwise comparing the results.

Co-authored-by: Julien Le Dem <[email protected]>

gene-db · 2024-10-07T17:03:43Z

@julienledem Thanks! I clarified some of the comments, and I will address them in a followup PR.

alamb · 2024-11-15T20:52:07Z

Does anyone know of parquet implementations that implement the variant type?

I would like to try and organize getting this into the Rust implementation (see apache/arrow-rs#6736) but I couldn't find any example data / implementations while writing that up

Add Variant specification docs

605c21e

rdblue reviewed Sep 30, 2024

View reviewed changes

VariantEncoding.md Outdated Show resolved Hide resolved

rdblue reviewed Sep 30, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

rdblue approved these changes Sep 30, 2024

View reviewed changes

Apply suggestions from code review

7318b2d

minor formatting Co-authored-by: Ryan Blue <[email protected]>

gene-db added 2 commits October 4, 2024 09:55

Update VariantEncoding.md

41cc505

Add license

Update VariantShredding.md

7b7a6bc

Add license

julienledem approved these changes Oct 4, 2024

View reviewed changes

Update VariantEncoding.md

a5c7d96

Co-authored-by: Julien Le Dem <[email protected]>

julienledem merged commit 4f20815 into apache:master Oct 9, 2024
3 checks passed

gene-db mentioned this pull request Oct 10, 2024

[FOLLOWUP] Clarify Variant specification details #457

Merged

alamb mentioned this pull request Nov 15, 2024

[Parquet] Implement Variant type support in Parquet apache/arrow-rs#6736

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-455: Add Variant specification docs #456

GH-455: Add Variant specification docs #456

gene-db commented Sep 30, 2024

rdblue left a comment

sfc-gh-aixu commented Sep 30, 2024

gene-db commented Oct 4, 2024

julienledem left a comment

julienledem Oct 4, 2024

rdblue Oct 7, 2024

gene-db Oct 7, 2024

julienledem Oct 11, 2024

julienledem Oct 4, 2024

gene-db Oct 7, 2024

gene-db commented Oct 7, 2024

alamb commented Nov 15, 2024

GH-455: Add Variant specification docs #456

GH-455: Add Variant specification docs #456

Conversation

gene-db commented Sep 30, 2024

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

rdblue left a comment

Choose a reason for hiding this comment

sfc-gh-aixu commented Sep 30, 2024

gene-db commented Oct 4, 2024

julienledem left a comment

Choose a reason for hiding this comment

julienledem Oct 4, 2024

Choose a reason for hiding this comment

rdblue Oct 7, 2024

Choose a reason for hiding this comment

gene-db Oct 7, 2024

Choose a reason for hiding this comment

julienledem Oct 11, 2024

Choose a reason for hiding this comment

julienledem Oct 4, 2024

Choose a reason for hiding this comment

gene-db Oct 7, 2024

Choose a reason for hiding this comment

gene-db commented Oct 7, 2024

alamb commented Nov 15, 2024