New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

PARQUET-2473: Clarify records can not be split across v2 pages or PageIndex #244

Merged

wgtmac merged 2 commits into apache:master from alamb:alamb/clarify_reptition_levels

May 31, 2024

Contributor

alamb commented May 19, 2024 •

edited

Loading

This was sparked by a mailing list discussion: https://lists.apache.org/thread/rd8twnvg4bg3558r507rzpxckcxt5wdn

Several implementors of the Parquet spec have been confused by this point

Notes

There seems to be clear consensus that record boundaries can't span pages in V2 pages or if there is a page index, so let's make that clear in the spec to avoid future confusion

There also seemed to be consensus that Row Groups must start on record boundaries, and that the existing spec language was clear on this point, so I did not propose any changes to that language

https://issues.apache.org/jira/browse/PARQUET-2473

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2473
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

This PR is only a clarification

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does


          PARQUET-2473: Clarify records can not be split across v2 pages or Pag…

5e506c6

…eIndex

alamb changed the title ~~PARQUET-2473: Clarify records can not be split across boundaries for "v2 data pages" or if there is a page index~~ PARQUET-2473: Clarify records can not be split across v2 pages or PageIndex

alamb commented

View reviewed changes

src/main/thrift/parquet.thrift

@@ @@ -625,7 +631,11 @@ struct DataPageHeaderV2 { @@
                 /** Number of NULL values, in this data page.
                     Number of non-null = num_values - num_nulls which is also the number of values in the data section **/
 : required i32 num_nulls
-                /** Number of rows in this data page. which means pages change on record boundaries (r = 0) **/

Contributor Author

alamb May 19, 2024

I also propose expanding the r = 0 terminology to be explicitly repetition_level = 0 to match the conventions in the rest of the documentation

mapleFU reviewed

View reviewed changes

Member

mapleFU left a comment

LGTM, just a nit

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

tustvold reviewed

View reviewed changes

src/main/thrift/parquet.thrift

+                 *
+                 * If a ColumnIndex is present, a page must begin at a record
+                 * boundary (repetition_level = 0). Otherwise, pages may begin
+                 * within a record (repetition_level > 0).

Contributor

tustvold May 21, 2024

I wonder if we could instead state that records should not be split across pages, but that readers should tolerate this in the abscence of a page index or v2 data pages, as certain writers historically wrote such data?

Contributor Author

alamb May 21, 2024

I agree with you that splitting pages will make the file less compatible, even though some writers historically did that.

However, I would like to ensure there is consensus on updating the spec to say that writers should do something before changing the spec

etseidl reviewed

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved


          Apply suggestions from code review

07d1dfa

Co-authored-by: Ed Seidl <[email protected]>

Contributor Author

alamb commented May 22, 2024

Updated to clarify that records can't repeat when an OffsetIndex is present

mapleFU approved these changes

View reviewed changes

pitrou reviewed

View reviewed changes

src/main/thrift/parquet.thrift

+                /**
+                 * Number of values, including NULLs, in this data page.
+                 *
+                 * If a OffsetIndex is present, a page must begin at a record

Member

pitrou May 28, 2024

Can we unify the row/record terminology? They seem to mean the same thing.

Contributor Author

alamb May 29, 2024

I agree row/record seem to mean the same thing

I double checked and it appears the rest of the file is inconsistent in the terminology as well

For example the repetition level documentation refers to records

parquet-format/src/main/thrift/parquet.thrift

Lines 180 to 192 in 8d59c7d

    
           /** 
        
            * Representation of Schemas 
        
            */ 
        
           enum FieldRepetitionType { 
        
             /** This field is required (can not be null) and each record has exactly 1 value. */ 
        
             REQUIRED = 0; 
        
             /** The field is optional (can be null) and each record has 0 or 1 values. */ 
        
             OPTIONAL = 1; 
        
             /** The field is repeated and can contain 0 or more values */ 
        
             REPEATED = 2; 
        
           }

but there are several fields that are named num_rows that clearly refer to rows.

parquet-format/src/main/thrift/parquet.thrift

Lines 623 to 631 in 8d59c7d

    
             /** Number of values, including NULLs, in this data page. **/ 
        
             1: required i32 num_values 
        
             /** Number of NULL values, in this data page. 
        
                 Number of non-null = num_values - num_nulls which is also the number of values in the data section **/ 
        
             2: required i32 num_nulls 
        
             /** Number of rows in this data page. which means pages change on record boundaries (r = 0) **/ 
        
             3: required i32 num_rows 
        
             /** Encoding used for data in this page **/ 
        
             4: required Encoding encoding

In this PR I followed the term used in the repetition level documentation as I think it is the most relevant.

I will start a conversation on the mailing list about using consistent terminology

Contributor Author

alamb May 30, 2024

Mailing list conversation: https://lists.apache.org/thread/0gxjk4tphxls0gcrc7lt775pf8s7gtz3

Currently the consensus seems to be to use "row"

I will make a follow on PR to change that terminology after this PR is merged.

I would like to retain the "record" terminology here to remain consistent with the current wording in the repetition levels section

Contributor Author

alamb May 31, 2024

pitrou reviewed

View reviewed changes

src/main/thrift/parquet.thrift

+                /**
+                 * Number of rows in this data page. Every page must begin at a
+                 * record boundary (repetition_level = 0): records must **not** be
+                 * split across page boundaries when using V2 data pages.

Member

pitrou May 28, 2024

So a row is a record?

Contributor Author

alamb May 29, 2024

yes

Contributor Author

alamb commented May 31, 2024

I think this PR is now ready to go @wgtmac . Is there anything else we are waiting on?

wgtmac approved these changes

View reviewed changes

Member

wgtmac left a comment

Sorry for missing this. I'll merge it now.

wgtmac merged commit 38c108c into apache:master

3 checks passed

Contributor Author

alamb commented May 31, 2024

Sorry for missing this. I'll merge it now.

No worries -- thank you!

I'll file a JIRA shortly to improve the spec to use row terminology

alamb deleted the alamb/clarify_reptition_levels branch

May 31, 2024 15:38

Contributor Author

alamb commented May 31, 2024

Update: here is a PR to consistently use the "row" terminology: #256

etseidl added a commit to etseidl/parquet-format that referenced this pull request


          PARQUET-2473: Clarify records can not be split across v2 pages or Pag…

2635efe

…eIndex (apache#244)


Co-authored-by: Ed Seidl <[email protected]>

asfimport mentioned this pull request

Clarify parquet-format with respect to repeated fields across boundaries apache/parquet-site#67

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

pitrou pitrou left review comments

tustvold tustvold left review comments

etseidl etseidl left review comments

wgtmac wgtmac approved these changes

mapleFU mapleFU approved these changes

Labels

None yet