-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2352: Allow truncation of row group min_values/max_value statistics #216
Conversation
db24acc
to
be4998e
Compare
src/main/thrift/parquet.thrift
Outdated
@@ -216,7 +216,12 @@ struct Statistics { | |||
/** count of distinct values occurring */ | |||
4: optional i64 distinct_count; | |||
/** | |||
* Min and max values for the column, determined by its ColumnOrder. | |||
* lower and upper bound values for the column, determined by its ColumnOrder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, you want to allow truncation by adding a flag to indicate whether the min_value/max_value is truncated or not, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the feature in https://issues.apache.org/jira/browse/PARQUET-1685, I want to assume that all existing stats are truncated. Going forward we should have a flag to explicitly indicate whether or not truncation took place and applications should perform aggregation pushdown only if that flag is found to indicate no truncation. But I think adding that flag can be tackled separately as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are relevant and can be discussed together.
What do you think? @gszadovszky @shangxinli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might not directly related to this question. If the column is a utf-8 column ( with LogicalType UTF8
), can the binary here be a non-utf8 bytes?
e.g. s[...100]
is a valid utf-8, however, s[...50]
is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapleFU very good point, we should not allow truncation to produce a value that is invalid for given logical type
i think this is what @raunaqmorarka meant with Such more compact values must still be valid values within the column's logical type.
in the code comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this PR to now include is_max_value_exact/is_min_value_exact for specifying when the max_value/min_value are the actual max and min values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac @gszadovszky can you please take a look at current changes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac @gszadovszky is anything more needed to merge this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just merged it. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Thanks @raunaqmorarka, this lgtm |
be4998e
to
868c6da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Rest LGTM
…alue statistics This updates the spec to allow truncation of row group min_values/max_value statistics so that readers can take advantage of row group pruning for predicates on columns containing long strings. https://issues.apache.org/jira/browse/PARQUET-1685 already introduced a feature to parquet-mr which allows users to deviate from the current spec and configure truncation of row group statistics. This change also adds is_max_value_exact/is_min_value_exact to allow writers to specify when the max_value/min_value are the actual max and min values found on the column chunk.
1595109
to
ce68775
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
So seems that for Also, we doesn't have page-index level statistics here |
Also, seems that this property only works for ByteArray, right (maybe FLBA also included) https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L965-L979 The syntax like this is not included in |
…-format 2.10 (#15412) [PARQUET-2352](apache/parquet-format#216) added fields to the `Statistics` struct to indicate whether the min and max values were exact or had been truncated. This was somewhat ambiguous in the past. One reason to want to know this is to allow avoiding the decoding of pages (or column chunks) that contain a single value (if the min and max are the same value, and are known to be exact values, and there are no nulls, then the only valid value for the page will be that value). This PR adds these new fields, which will always be `true` in cuDF since cuDF does not support truncating min and max values in the statistics (but does support truncation in the page indexes). Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Paul Mattione (https://github.com/pmattione-nvidia) - Karthikeyan (https://github.com/karthikeyann) URL: #15412
Jira
Description
This updates the spec to allow truncation of row group min_values/max_value statistics so that readers can take advantage of row group pruning for predicates on columns containing long strings.
https://issues.apache.org/jira/browse/PARQUET-1685 already introduced a feature to parquet-mr which allows users to deviate from the current spec and configure truncation of row group statistics.
This change also adds is_max_value_exact/is_min_value_exact to allow writers to specify
when the max_value/min_value are the actual max and min values found on the column chunk.
Since the possibility of truncation exists and is not possible to explicitly detect, attempts to pushdown min/max aggregation to parquet have avoided implementing it for string columns (e.g. https://issues.apache.org/jira/browse/SPARK-36645)
Given the above situation, the spec should be updated to allow truncation of min/max row group stats. This would align the spec with current reality that string column min/max row group stats could be truncated.