Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support V2 encodings in Parquet reader and writer #13501

Closed
GregoryKimball opened this issue Jun 2, 2023 · 2 comments
Closed

[FEA] Support V2 encodings in Parquet reader and writer #13501

GregoryKimball opened this issue Jun 2, 2023 · 2 comments
Labels
2 - In Progress Currently a work in progress cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jun 2, 2023

Parquet V1 format supports three types of page encodings: PLAIN, DICTIONARY, and RLE (run-length encoded) (reference from Spark Jira). The newer and evolving Parquet V2 specification adds support for several additional encodings, including DELTA_BINARY_PACKED for INT32 and INT64 types, DELTA_BYTE_ARRAY for strings logical type, and DELTA_LENGTH_BYTE_ARRAY for strings logical type.

In the parquet reader and writer, libcudf should support V2 metadata as well as the three variants of DELTA encoding.

Feature Status Notes
Add V2 reader support #11778
Multi-warp decode of Dremel data streams #13203
Use efficient strings column factory in decoder #13302
Implement DELTA_BINARY_PACKED decoding #13637 see #12948 for reference
Implement DELTA_BYTE_ARRAY decoding #14101 see #12948 for reference
Add V2 writer support #13751
Implement DELTA_BINARY_PACKED encoding #14100
Add python bindings for V2 header and options #14316
Implement DELTA_BYTE_ARRAY encoding #15239 some outdated reviews in #14938
Implement DELTA_LENGTH_BYTE_ARRAY encoding and decoding for unsorted data #14590
Add C++ API support for specifying encodings #15081
Add cuDF-python API support for specifying encodings #15613
Add BYTE_STREAM_SPLIT encoding and decoding #15311 see issue #15226 and parquet reference
@GregoryKimball GregoryKimball added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function labels Jun 2, 2023
rapids-bot bot pushed a commit that referenced this issue Aug 15, 2023
Part of #13501. This adds the ability to write V2 page headers to the Parquet writer.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13751
rapids-bot bot pushed a commit that referenced this issue Aug 17, 2023
While working on #13707 it was noticed that RLE encoding of booleans had been implemented and then disabled (see [this comment](#13707 (comment)) for details). This PR re-enables RLE encoding for booleans, but only when V2 headers are being used.

Part of #13501.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #13886
rapids-bot bot pushed a commit that referenced this issue Aug 23, 2023
)

Part of #13501. This adds support for decoding Parquet pages that are DELTA_BINARY_PACKED.

In addition to adding delta support, this PR incorporates changes introduced in #13622, such as using a mask to determine which decoding kernels to run, and adding parameters to  the `page_state_buffers_s` struct to reduce the amount of shared memory used.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/nvdbaranec
  - Bradley Dice (https://github.com/bdice)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #13637
@etseidl
Copy link
Contributor

etseidl commented Sep 14, 2023

@GregoryKimball should there be an entry for adding python bindings for the V2 options?

rapids-bot bot pushed a commit that referenced this issue Oct 20, 2023
Part of #13501. Adds ability to fall back on DELTA_BINARY_PACKED encoding when V2 page headers are selected and dictionary encoding is not possible.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #14100
rapids-bot bot pushed a commit that referenced this issue Nov 16, 2023
Part of #13501. Adds ability to decode DELTA_BYTE_ARRAY encoded pages.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/nvdbaranec
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #14101
rapids-bot bot pushed a commit that referenced this issue Dec 20, 2023
Part of #13501. This adds the ability to read and write Parquet pages with DELTA_LENGTH_BYTE_ARRAY encoding.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Michael Wang (https://github.com/isVoid)
  - Nghia Truong (https://github.com/ttnghia)

URL: #14590
@GregoryKimball GregoryKimball changed the title [FEA] Support DELTA encoding in Parquet reader and writer [FEA] Support V2 encodings in Parquet reader and writer Mar 6, 2024
rapids-bot bot pushed a commit that referenced this issue Mar 8, 2024
Re-submission of #14938. Final (delta) piece of #13501.

Adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY. Python testing wlll be added as a follow-on when per-column encoding selection is added to the python API (ref this [comment](#15081 (comment))).

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #15239
rapids-bot bot pushed a commit that referenced this issue Apr 24, 2024
Closes #15226. Part of #13501.  Adds support for reading and writing `BYTE_STREAM_SPLIT` encoded Parquet data. Includes a "microkernel" version like those introduced by #15159.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15311
rapids-bot bot pushed a commit that referenced this issue May 22, 2024
…PI (#15613)

Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was [suggested](#15081 (comment)) that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a `list<int32>` column 'b', the fully qualified column names would be 'a' and 'b.list.element'.

Addresses "Add cuDF-python API support for specifying encodings" task in #13501.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #15613
@GregoryKimball
Copy link
Contributor Author

Congratulations @etseidl! Everyone, please stay tuned for a technical blog on this topic! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: Done
Status: Story Issue
Development

No branches or pull requests

2 participants