PARQUET-2352: Allow truncation of row group min_values/max_value statistics #216

raunaqmorarka · 2023-09-21T03:34:53Z

Jira

https://issues.apache.org/jira/browse/PARQUET-2352

Description

This updates the spec to allow truncation of row group min_values/max_value statistics so that readers can take advantage of row group pruning for predicates on columns containing long strings.
https://issues.apache.org/jira/browse/PARQUET-1685 already introduced a feature to parquet-mr which allows users to deviate from the current spec and configure truncation of row group statistics.
This change also adds is_max_value_exact/is_min_value_exact to allow writers to specify
when the max_value/min_value are the actual max and min values found on the column chunk.

Since the possibility of truncation exists and is not possible to explicitly detect, attempts to pushdown min/max aggregation to parquet have avoided implementing it for string columns (e.g. https://issues.apache.org/jira/browse/SPARK-36645)
Given the above situation, the spec should be updated to allow truncation of min/max row group stats. This would align the spec with current reality that string column min/max row group stats could be truncated.

wgtmac · 2023-09-21T15:50:54Z

src/main/thrift/parquet.thrift

@@ -216,7 +216,12 @@ struct Statistics {
   /** count of distinct values occurring */
   4: optional i64 distinct_count;
   /**
-    * Min and max values for the column, determined by its ColumnOrder.
+    * lower and upper bound values for the column, determined by its ColumnOrder.


IIRC, you want to allow truncation by adding a flag to indicate whether the min_value/max_value is truncated or not, right?

Given the feature in https://issues.apache.org/jira/browse/PARQUET-1685, I want to assume that all existing stats are truncated. Going forward we should have a flag to explicitly indicate whether or not truncation took place and applications should perform aggregation pushdown only if that flag is found to indicate no truncation. But I think adding that flag can be tackled separately as a follow-up.

I think they are relevant and can be discussed together.

What do you think? @gszadovszky @shangxinli

Might not directly related to this question. If the column is a utf-8 column ( with LogicalType UTF8 ), can the binary here be a non-utf8 bytes?

e.g. s[...100] is a valid utf-8, however, s[...50] is not.

@mapleFU very good point, we should not allow truncation to produce a value that is invalid for given logical type
i think this is what @raunaqmorarka meant with Such more compact values must still be valid values within the column's logical type. in the code comment

I've updated this PR to now include is_max_value_exact/is_min_value_exact for specifying when the max_value/min_value are the actual max and min values

@wgtmac @gszadovszky can you please take a look at current changes ?

@wgtmac @gszadovszky is anything more needed to merge this ?

I just merged it. Thank you!

findepi · 2023-09-21T20:52:39Z

Thanks @raunaqmorarka, this lgtm

mapleFU

The Rest LGTM

src/main/thrift/parquet.thrift

…alue statistics This updates the spec to allow truncation of row group min_values/max_value statistics so that readers can take advantage of row group pruning for predicates on columns containing long strings. https://issues.apache.org/jira/browse/PARQUET-1685 already introduced a feature to parquet-mr which allows users to deviate from the current spec and configure truncation of row group statistics. This change also adds is_max_value_exact/is_min_value_exact to allow writers to specify when the max_value/min_value are the actual max and min values found on the column chunk.

wgtmac

+1

wgtmac · 2023-10-14T14:19:46Z

cc @pitrou @emkornfield @shangxinli

mapleFU · 2023-11-23T10:56:56Z

So seems that for PageHeader (though if PageIndex enabled, we might not write page statistics) might also have to write these two statistics?

Also, we doesn't have page-index level statistics here

mapleFU · 2023-12-13T07:38:10Z

Also, seems that this property only works for ByteArray, right (maybe FLBA also included)

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L965-L979

The syntax like this is not included in exact, right( at least in current arrow-rs impl)? @raunaqmorarka @tustvold

…-format 2.10 (#15412) [PARQUET-2352](apache/parquet-format#216) added fields to the `Statistics` struct to indicate whether the min and max values were exact or had been truncated. This was somewhat ambiguous in the past. One reason to want to know this is to allow avoiding the decoding of pages (or column chunks) that contain a single value (if the min and max are the same value, and are known to be exact values, and there are no nulls, then the only valid value for the page will be that value). This PR adds these new fields, which will always be `true` in cuDF since cuDF does not support truncating min and max values in the statistics (but does support truncation in the page indexes). Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Paul Mattione (https://github.com/pmattione-nvidia) - Karthikeyan (https://github.com/karthikeyann) URL: #15412

raunaqmorarka mentioned this pull request Sep 21, 2023

Write min/max statistics with truncation for strings in parquet trinodb/trino#19038

Merged

raunaqmorarka force-pushed the PARQUET-2352 branch from db24acc to be4998e Compare September 21, 2023 04:16

wgtmac reviewed Sep 21, 2023

View reviewed changes

raunaqmorarka force-pushed the PARQUET-2352 branch from be4998e to 868c6da Compare September 30, 2023 04:16

raunaqmorarka requested review from wgtmac and gszadovszky September 30, 2023 04:19

mapleFU reviewed Oct 3, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

mapleFU approved these changes Oct 3, 2023

View reviewed changes

findepi mentioned this pull request Oct 6, 2023

Add aggregation pushdown support for count using Iceberg Metrics trinodb/trino#15832

Open

wgtmac requested changes Oct 8, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

raunaqmorarka force-pushed the PARQUET-2352 branch from 1595109 to ce68775 Compare October 8, 2023 17:46

raunaqmorarka requested a review from wgtmac October 8, 2023 17:46

wgtmac approved these changes Oct 9, 2023

View reviewed changes

gszadovszky approved these changes Oct 16, 2023

View reviewed changes

wgtmac merged commit 31f92c7 into apache:master Oct 18, 2023
3 checks passed

raunaqmorarka deleted the PARQUET-2352 branch October 18, 2023 09:41

tustvold mentioned this pull request Nov 5, 2023

Binary columns do not receive truncated statistics apache/arrow-rs#5037

Closed

etseidl mentioned this pull request Mar 28, 2024

Add fields to Parquet Statistics structure that were added in parquet-format 2.10 rapidsai/cudf#15412

Merged

3 tasks

mapleFU mentioned this pull request Aug 7, 2024

GH-43594: [C++] Remove std::optional from arrow::ArrayStatistics::is_{min,max}_exact apache/arrow#43595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2352: Allow truncation of row group min_values/max_value statistics #216

PARQUET-2352: Allow truncation of row group min_values/max_value statistics #216

raunaqmorarka commented Sep 21, 2023 •

edited

Loading

wgtmac Sep 21, 2023

raunaqmorarka Sep 21, 2023

wgtmac Sep 22, 2023

mapleFU Sep 22, 2023

findepi Sep 22, 2023

raunaqmorarka Sep 30, 2023

raunaqmorarka Oct 5, 2023

raunaqmorarka Oct 18, 2023

wgtmac Oct 18, 2023

raunaqmorarka Oct 18, 2023

findepi commented Sep 21, 2023

mapleFU left a comment

wgtmac left a comment

wgtmac commented Oct 14, 2023

mapleFU commented Nov 23, 2023 •

edited

Loading

mapleFU commented Dec 13, 2023

PARQUET-2352: Allow truncation of row group min_values/max_value statistics #216

PARQUET-2352: Allow truncation of row group min_values/max_value statistics #216

Conversation

raunaqmorarka commented Sep 21, 2023 • edited Loading

Jira

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Sep 21, 2023

mapleFU left a comment

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac commented Oct 14, 2023

mapleFU commented Nov 23, 2023 • edited Loading

mapleFU commented Dec 13, 2023

raunaqmorarka commented Sep 21, 2023 •

edited

Loading

mapleFU commented Nov 23, 2023 •

edited

Loading