Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Extension types #451

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Sep 19, 2024

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

When a reader encounters an extension type in a Parquet schema, it should try
to match it by name to its known extension types. If it does not recognize
the extension type, then it should read it as the underlying physical type
and should not try to interpret the column's statistics. It may however
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min/max statistics, others should be valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes, you're right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps including column index?

@emkornfield
Copy link
Contributor

Generally seems reasonable to me.

When a reader encounters an extension type in a Parquet schema, it should try
to match it by name to its known extension types. If it does not recognize
the extension type, then it should read it as the underlying physical type
and should not try to interpret the column's statistics. It may however
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps including column index?

*
* If the extension type is not parametric, then `serialization` is empty.
*/
struct ExtensionTypeDescription {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why choosing a dedicated ExtensionTypeDescription struct over list<KeyValue>? I'm afraid that a binary typed field may incur misuse from the users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the list<KeyValue> contain and where would it reside? I'm not following you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct ExtensionTypeDescription {
  1: optional list<KeyValue> metadata
}

And specify the required keys for each extension type, pretty much like what Arrow does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not make sense, does it? The keys will always be the same, so why not reify them in the Thrift spec as the PR currently does?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are you thinking about extension-specific parameter keys as in https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we would still need the extension name, so this would be:

struct ExtensionTypeDescription {
  1: required string name
  2: optional list<KeyValue> parameters
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are you thinking about extension-specific parameter keys as in https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html ?

Yes, I mean something like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants