Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-758: Add Float16/Half-float logical type #184

Merged
merged 3 commits into from
Oct 27, 2023

Conversation

anjakefala
Copy link
Contributor

@anjakefala anjakefala commented Aug 26, 2022

In the Mailing List, I proposed the addition of a Half Float (float16) type in Parquet: https://lists.apache.org/thread/03vmcj7ygwvsbno764vd1hr954p62zr5

This type is becoming increasingly popular in Machine Learning, and there are a bundle of issues requesting its support in Parquet:
https://issues.apache.org/jira/browse/PARQUET-1647
https://issues.apache.org/jira/browse/PARQUET-758
https://issues.apache.org/jira/browse/ARROW-17464
apache/arrow#2691

This is my first logical type proposal! I followed this PR as inspiration, but really welcome feedback from the community.

Implementation PRs:


Make sure you have checked all steps below.

Jira

  • My PR addresses the following Parquet Jira 1 and 2 issues and references them in the PR title.

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

LogicalTypes.md Outdated Show resolved Hide resolved
LogicalTypes.md Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Aug 29, 2022

@anjakefala You need to add to the LogicalType union, not to the Type enum (which is for physical types).

Also cc @emkornfield

Type involves a trade-off of reduced precision,
in exchange for more efficient storage.
@emkornfield
Copy link
Contributor

emkornfield commented Aug 30, 2022

We should probably specify that using the Byte Split Encodings can be used for this type as well?

Also, in general, if possible try to avoid force pushing, as it makes it harder to compare iterative changes (this might not be the style in this repo, though so if you found instructions elsewhere on this, please ignore).

@emkornfield
Copy link
Contributor

It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts?

@@ -232,6 +232,7 @@ struct MapType {} // see LogicalTypes.md
struct ListType {} // see LogicalTypes.md
struct EnumType {} // allowed for BINARY, must be encoded with UTF-8
struct DateType {} // allowed for INT32
struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not allow bit splitting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield What do you mean here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, perhaps you mean the BYTE_STREAM_SPLIT encoding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. BYTE_STREAM_SPLIT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess it wouldn't cost much to allow it (implementations would not support it at the start anyway).

@gszadovszky
Copy link
Contributor

It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts?

I think the basic idea behind having physical and logical types is to support forward compatibility since we can always represent (somehow) a long-existing physical type while logical types are getting extended. Parquet-mr should work fine with "unknown" logical types by reading it back as an un-annotated physical vale (a Binary with two bytes in this case).
So, if the community supports having a half-precision floating point type I would vote on specifying it as a logical type.

The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. (BTW the sorting order of floating point numbers are still an open issue: PARQUET-1222)

@pitrou
Copy link
Member

pitrou commented Aug 30, 2022

It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type.

While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy:
https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190
(npy_half is just an unsigned 16-bit integer in this context)

@gszadovszky
Copy link
Contributor

gszadovszky commented Aug 30, 2022

It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type.

While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy: https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190 (npy_half is just an unsigned 16-bit integer in this context)

It is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See PARQUET-1222 for details.

@emkornfield
Copy link
Contributor

t is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See PARQUET-1222 for details.

I think these are orthogonal. I might be missing something but it seems like it would not be to hard to case float16 to float in java/cpp and do the comparison in that space and cast it back down. This might not be the most efficient implementation but would be straightforward? I am probably missing something. It would be nice to resolve PARQUET-1222 so the same semantics would apply to all floating point numbers.

The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics).

It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today?

@gszadovszky
Copy link
Contributor

I've came up with this ordering thing because we specify it for every logical types. (Unfortunately we don't do this for primitive types.) Therefore, I would expect to have the order specified for this new logical type as well which is not trivial and requires to solve PARQUET-1222. At least we should add a note about this issue.

It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today?

I do not have the proper environment to test it but based on the code we do not handle unknown logical types well in parquet-mr. I think it handles unknown logical types as if they were not there at all which is fine from the data point of view but we would blindly use the statistics which may cause data loss. Created PARQUET-2182 to track this.

@emkornfield
Copy link
Contributor

I think Parquet C++ probably has the same issue. Let me reread PARQUET-1222. to see what the current state is and if we can push it forward.

@pitrou
Copy link
Member

pitrou commented Sep 7, 2022

I agree with @emkornfield that ordering issues seem largely orthogonal, as they also affect FLOAT32 and FLOAT64 types.

@anjakefala
Copy link
Contributor Author

anjakefala commented Sep 29, 2022

@pitrou @emkornfield @gszadovszky

Is there anything I can do to move this addition forward? Can I help with any code?

In terms of design, my understanding from reading the comments is that @gszadovszky brought up an ordering concern (valid, but not a blocker?), and that a decision needs to be made on whether float16 would be implemented as a logical or physical type?

@emkornfield
Copy link
Contributor

Sorry for the delay, it sounds like PARQUET-1222 is blocker, let me make a proposal there and see if we can at least come to consensus on approach and maybe this feature can be the first test-case for it.

@emkornfield
Copy link
Contributor

Sorry for the delay but PARQUET-1222 has now been merged, so I think this is unblocked.

@anjakefala
Copy link
Contributor Author

Thanks so much for the update @emkornfield!

What is the next step I can take?

@emkornfield
Copy link
Contributor

@anjakefala IIUC, I think the prior objection was around not properly floating point sorting for statistics correctly. I think the main thing is to update the specification to say this requires the same sorting logic as float32 and float64. I need to rereview the current state of things, but then I think we can probably try to vote on the mailing list to see if this type is acceptable. I'm not sure on the exact process here (I don't know if implementations are required before a vote). @gszadovszky thoughts?

@anjakefala
Copy link
Contributor Author

Thank you @emkornfield! I added the sort order to the spec.

@anjakefala
Copy link
Contributor Author

Hey @emkornfield! Is it reasonable for me to send a proposal to the mailing list for a vote? It seems @gszadovszky is not available for insight; is there anyone else that can provide it?

@emkornfield
Copy link
Contributor

@shangxinli are there guidelines for what needs to happen to accept this addition?

@pitrou
Copy link
Member

pitrou commented Feb 1, 2023

@shangxinli are there guidelines for what needs to happen to accept this addition?

I suppose it needs a discussion and then a formal vote on the ML?

@shangxinli
Copy link

As @julienledem mentioned in the email discussion, let's have the corresponding PRs for support in the Java and C++ implementation ready before we merge this PR. We would like to have implementation support when the new type is released.

@anjakefala
Copy link
Contributor Author

It seems that both the Java implementation and the C++ implementation are in a state of readiness.

Has the vote started? Can anyone with visibility update me on the status?

@benibus
Copy link

benibus commented Oct 4, 2023

@anjakefala Agreed that everything seems to be in place. I'll be starting the vote on the ML later today.

@anjakefala
Copy link
Contributor Author

@pitrou @emkornfield @gszadovszky @JFinis @julienledem @shangxinli

The vote has been started by @benibus here: https://lists.apache.org/thread/gyvqcx9ssxkjlrwogqwy7n4z6ofdm871 Please chime in! I would also appreciate anyone forwarding the vote to the private listserv.

@benibus
Copy link

benibus commented Oct 16, 2023

@pitrou @gszadovszky @julienledem

Given that the vote for this has just passed, I believe we should be good to merge this now? (pending a final review pass, of course)

@wgtmac
Copy link
Member

wgtmac commented Oct 17, 2023

@pitrou @gszadovszky @julienledem

Given that the vote for this has just passed, I believe we should be good to merge this now? (pending a final review pass, of course)

Should we merge the PR in parquet-format first? My point is that it would be weird if this change commits with an unreleased and even uncommitted change of parquet.thrift.

@anjakefala
Copy link
Contributor Author

@wgtmac

Should we merge the PR in parquet-format first? My point is that it would be weird if this change commits with an unreleased and even uncommitted change of parquet.thrift.

This is the parquet-format PR!

There are too many PRs. xD

@wgtmac
Copy link
Member

wgtmac commented Oct 17, 2023

@wgtmac

Should we merge the PR in parquet-format first? My point is that it would be weird if this change commits with an unreleased and even uncommitted change of parquet.thrift.

This is the parquet-format PR!

There are too many PRs. xD

My bad! I got lost in these PRs.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've suggested the name FLOAT_16 in the vote like we already have logical types INT_8 etc. But this is not a strong opinion, I am fine with as is.

I agree with @emkornfield that we should allow the encoding BYTE_STREAM_SPLIT to be used for this new logical type. It is fine to handle it separately, though.

LogicalTypes.md Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Oct 17, 2023

I agree with @emkornfield that we should allow the encoding BYTE_STREAM_SPLIT to be used for this new logical type. It is fine to handle it separately, though.

I would contend that perhaps BYTE_STREAM_SPLIT wouldn't yield very interesting results on FLOAT16. It would be interesting to get numbers.

@benibus
Copy link

benibus commented Oct 17, 2023

I've suggested the name FLOAT_16 in the vote like we already have logical types INT_8 etc

I think this was only the convention for legacy ConvertedType enums. We could theoretically deviate from that since there are no sized integral logical types (they're all rolled into INTEGER/IntType).

As for BYTE_STREAM_SPLIT, my feeling is that we'll probably want it (for parity with FLOAT/DOUBLE, at least), but it could be added as a follow-up - along with an implementation + benchmarks, if necessary. There's also some ambiguity about whether to support BSS for FixedLenByteArray generally, which may warrant a separate discussion.

@anjakefala
Copy link
Contributor Author

@gszadovszky What is the merging process once it has approval and passed voting? =)

@gszadovszky
Copy link
Contributor

@benibus, could you officially close the vote on the mailing list so it is clear that it has passed?
@anjakefala, since we have 3 approvals already on this PR, any committer can push it. I would wait for the official closing of the vote, thought.

@benibus
Copy link

benibus commented Oct 26, 2023

For the record, I've announced the vote's passing in the original ML thread itself (apologies if the [RESULT] thread wasn't sufficient).

@gszadovszky
Copy link
Contributor

Sorry, @benibus. My bad. Thank you for managing the vote!
I'm pushing this PR...

@gszadovszky gszadovszky merged commit 46cc3a0 into apache:master Oct 27, 2023
3 checks passed
@gszadovszky
Copy link
Contributor

@anjakefala, do you have a jira user so I can assign it to you?

@anjakefala
Copy link
Contributor Author

I really appreciate everyone who took time out of their lives to give this PR attention! :)) Thanks for the final merge @gszadovszky! And yes, my apache arrow JIRA handle is the same as github @anjakefala.

zeroshade pushed a commit to apache/arrow that referenced this pull request Nov 13, 2023
### Rationale for this change

There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (#36073), so we should add one for Go as well.

### What changes are included in this PR?

- [x] Adds `LogicalType` definitions and methods for `Float16`
- [x] Adds support for `Float16` column statistics and comparators
- [x] Adds support for interchange between Parquet and Arrow's half-precision float

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes

* Closes: #37582

Authored-by: benibus <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…37599)

### Rationale for this change

There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache#36073), so we should add one for Go as well.

### What changes are included in this PR?

- [x] Adds `LogicalType` definitions and methods for `Float16`
- [x] Adds support for `Float16` column statistics and comparators
- [x] Adds support for interchange between Parquet and Arrow's half-precision float

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes

* Closes: apache#37582

Authored-by: benibus <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…37599)

### Rationale for this change

There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache#36073), so we should add one for Go as well.

### What changes are included in this PR?

- [x] Adds `LogicalType` definitions and methods for `Float16`
- [x] Adds support for `Float16` column statistics and comparators
- [x] Adds support for interchange between Parquet and Arrow's half-precision float

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes

* Closes: apache#37582

Authored-by: benibus <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
kou pushed a commit to apache/arrow-go that referenced this pull request Aug 30, 2024
### Rationale for this change

There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache/arrow#36073), so we should add one for Go as well.

### What changes are included in this PR?

- [x] Adds `LogicalType` definitions and methods for `Float16`
- [x] Adds support for `Float16` column statistics and comparators
- [x] Adds support for interchange between Parquet and Arrow's half-precision float

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes

* Closes: #37582

Authored-by: benibus <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants