Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2480: Clarify what "page index" means in Parquet.thrift #245

Merged
merged 6 commits into from
May 22, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 19, 2024

https://issues.apache.org/jira/browse/PARQUET-2480

See the proposed update as rendered markdown: https://github.com/alamb/parquet-format/blob/alamb/page-index/PageIndex.md

I have always found it very confusing that people refer to the term parquet "page index", for example this message

However, the term "page index" is not used in the the parquet.thrift file itself, but only appears as the name of the file that describes the ColumnIndex and OffsetIndex, PageIndex.md

This means I can't search for "page index" in the spec and find out what people are talking about

Proposed Clarifications

  1. Update the introductory paragraph of PageIndex.md to clarify use the term "page index" and explain that it is encoded as ColumnIndex and OffsetIndex
  2. Update the description of ColumnIndex and OffsetIndex to include the term "page index" and clarify what those structures are used for.

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

This PR has no spec changes, only clarifications

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

PageIndex.md Outdated Show resolved Hide resolved
@mapleFU
Copy link
Member

mapleFU commented May 20, 2024

I also think that currently PageIndex means "offset index and column index". They're all page-level index

*/
struct SortingColumn {
/** The column index (in this row group) **/
/** The ordinal position of the column (in this row group) **/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this to avoid potential confusion with ColumnIndex which is a different structure

PageIndex.md Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor Author

alamb commented May 20, 2024

Thank you for the comments @mapleFU @tustvold and @wgtmac - I believe I have implemented your suggestions and I think the PR is much clearer because of it.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM

pages when scanning data in ordered and unordered columns.
In Parquet, a *page index* is optional metadata for a
ColumnChunk, containing statistics for DataPages that can be used
to skip those pages when scanning in ordered and unordered columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With statistics enabled, only OffsetIndex is presented, with can be used to planning IO and get first_row_ordinal for ColumnChunk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry @mapleFU -- I don't quite follow this suggestion

Do you mean that some writers may write only OffsetIndex and that both structures may not be present?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ColumnIndex will not be created when there is no valid min/max statistics. In parquet-cpp, we can also disable creating ColumnIndex but enabling OffsetIndex.

Copy link
Member

@mapleFU mapleFU May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess OffsetIndex might be used separately ( But personally I don't think Column Index might be used like that, lol)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added

--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -1005,6 +1005,8 @@ struct PageLocation {
  * Optional offsets for each data page in a ColumnChunk.
  *
  * Forms part of the page index, along with ColumnIndex.
+ *
+ * OffsetIndex may be present even if ColumnIndex is not.
  */
 struct OffsetIndex {
   /**
@@ -1026,6 +1028,8 @@ struct OffsetIndex {
  *
  * Forms part the page index, along with OffsetIndex.
  *
+ * If this structure is present, OffsetIndex must also be present.
+ *
  * For each field in this structure, <field>[i] refers to the page at
  * OffsetIndex.page_locations[i]
  */

in 8d59c7d to try and clarify this

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@wgtmac
Copy link
Member

wgtmac commented May 21, 2024

@gszadovszky @pitrou @emkornfield @julienledem Would you like to take a look?

@wgtmac wgtmac merged commit be0478a into apache:master May 22, 2024
3 checks passed
@alamb alamb deleted the alamb/page-index branch May 28, 2024 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants