-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-758: Add Float16/Half-float logical type #184
Conversation
@anjakefala You need to add to the Also cc @emkornfield |
Type involves a trade-off of reduced precision, in exchange for more efficient storage.
We should probably specify that using the Byte Split Encodings can be used for this type as well? Also, in general, if possible try to avoid force pushing, as it makes it harder to compare iterative changes (this might not be the style in this repo, though so if you found instructions elsewhere on this, please ignore). |
It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts? |
@@ -232,6 +232,7 @@ struct MapType {} // see LogicalTypes.md | |||
struct ListType {} // see LogicalTypes.md | |||
struct EnumType {} // allowed for BINARY, must be encoded with UTF-8 | |||
struct DateType {} // allowed for INT32 | |||
struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not allow bit splitting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield What do you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, perhaps you mean the BYTE_STREAM_SPLIT encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. BYTE_STREAM_SPLIT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I guess it wouldn't cost much to allow it (implementations would not support it at the start anyway).
I think the basic idea behind having physical and logical types is to support forward compatibility since we can always represent (somehow) a long-existing physical type while logical types are getting extended. Parquet-mr should work fine with "unknown" logical types by reading it back as an un-annotated physical vale (a The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. (BTW the sorting order of floating point numbers are still an open issue: PARQUET-1222) |
While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy: |
It is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See PARQUET-1222 for details. |
I think these are orthogonal. I might be missing something but it seems like it would not be to hard to case float16 to float in java/cpp and do the comparison in that space and cast it back down. This might not be the most efficient implementation but would be straightforward? I am probably missing something. It would be nice to resolve PARQUET-1222 so the same semantics would apply to all floating point numbers.
It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today? |
I've came up with this ordering thing because we specify it for every logical types. (Unfortunately we don't do this for primitive types.) Therefore, I would expect to have the order specified for this new logical type as well which is not trivial and requires to solve PARQUET-1222. At least we should add a note about this issue.
I do not have the proper environment to test it but based on the code we do not handle unknown logical types well in parquet-mr. I think it handles unknown logical types as if they were not there at all which is fine from the data point of view but we would blindly use the statistics which may cause data loss. Created PARQUET-2182 to track this. |
I think Parquet C++ probably has the same issue. Let me reread PARQUET-1222. to see what the current state is and if we can push it forward. |
I agree with @emkornfield that ordering issues seem largely orthogonal, as they also affect FLOAT32 and FLOAT64 types. |
@pitrou @emkornfield @gszadovszky Is there anything I can do to move this addition forward? Can I help with any code? In terms of design, my understanding from reading the comments is that @gszadovszky brought up an ordering concern (valid, but not a blocker?), and that a decision needs to be made on whether float16 would be implemented as a logical or physical type? |
Sorry for the delay, it sounds like PARQUET-1222 is blocker, let me make a proposal there and see if we can at least come to consensus on approach and maybe this feature can be the first test-case for it. |
Sorry for the delay but PARQUET-1222 has now been merged, so I think this is unblocked. |
Thanks so much for the update @emkornfield! What is the next step I can take? |
@anjakefala IIUC, I think the prior objection was around not properly floating point sorting for statistics correctly. I think the main thing is to update the specification to say this requires the same sorting logic as float32 and float64. I need to rereview the current state of things, but then I think we can probably try to vote on the mailing list to see if this type is acceptable. I'm not sure on the exact process here (I don't know if implementations are required before a vote). @gszadovszky thoughts? |
Thank you @emkornfield! I added the sort order to the spec. |
Hey @emkornfield! Is it reasonable for me to send a proposal to the mailing list for a vote? It seems @gszadovszky is not available for insight; is there anyone else that can provide it? |
@shangxinli are there guidelines for what needs to happen to accept this addition? |
I suppose it needs a discussion and then a formal vote on the ML? |
As @julienledem mentioned in the email discussion, let's have the corresponding PRs for support in the Java and C++ implementation ready before we merge this PR. We would like to have implementation support when the new type is released. |
It seems that both the Java implementation and the C++ implementation are in a state of readiness. Has the vote started? Can anyone with visibility update me on the status? |
@anjakefala Agreed that everything seems to be in place. I'll be starting the vote on the ML later today. |
@pitrou @emkornfield @gszadovszky @JFinis @julienledem @shangxinli The vote has been started by @benibus here: https://lists.apache.org/thread/gyvqcx9ssxkjlrwogqwy7n4z6ofdm871 Please chime in! I would also appreciate anyone forwarding the vote to the private listserv. |
@pitrou @gszadovszky @julienledem Given that the vote for this has just passed, I believe we should be good to merge this now? (pending a final review pass, of course) |
Should we merge the PR in parquet-format first? My point is that it would be weird if this change commits with an unreleased and even uncommitted change of |
This is the parquet-format PR! There are too many PRs. xD |
My bad! I got lost in these PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've suggested the name FLOAT_16
in the vote like we already have logical types INT_8
etc. But this is not a strong opinion, I am fine with as is.
I agree with @emkornfield that we should allow the encoding BYTE_STREAM_SPLIT
to be used for this new logical type. It is fine to handle it separately, though.
I would contend that perhaps |
I think this was only the convention for legacy As for |
@gszadovszky What is the merging process once it has approval and passed voting? =) |
@benibus, could you officially close the vote on the mailing list so it is clear that it has passed? |
For the record, I've announced the vote's passing in the original ML thread itself (apologies if the |
Sorry, @benibus. My bad. Thank you for managing the vote! |
@anjakefala, do you have a jira user so I can assign it to you? |
I really appreciate everyone who took time out of their lives to give this PR attention! :)) Thanks for the final merge @gszadovszky! And yes, my apache arrow JIRA handle is the same as github |
### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: #37582 Authored-by: benibus <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…37599) ### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: apache#37582 Authored-by: benibus <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…37599) ### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: apache#37582 Authored-by: benibus <[email protected]> Signed-off-by: Matt Topol <[email protected]>
### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache/arrow#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: #37582 Authored-by: benibus <[email protected]> Signed-off-by: Matt Topol <[email protected]>
In the Mailing List, I proposed the addition of a Half Float (float16) type in Parquet: https://lists.apache.org/thread/03vmcj7ygwvsbno764vd1hr954p62zr5
This type is becoming increasingly popular in Machine Learning, and there are a bundle of issues requesting its support in Parquet:
https://issues.apache.org/jira/browse/PARQUET-1647
https://issues.apache.org/jira/browse/PARQUET-758
https://issues.apache.org/jira/browse/ARROW-17464
apache/arrow#2691
This is my first logical type proposal! I followed this PR as inspiration, but really welcome feedback from the community.
Implementation PRs:
Make sure you have checked all steps below.
Jira
Commits
Documentation