PARQUET-2369: Add new test file concatenated_gzip_members.parquet #41

amassalha · 2023-10-18T07:21:48Z

For testing parquet file with multiple gzip members, following discussion in PARQUET-2369.

amassalha · 2023-10-18T07:23:05Z

wgtmac

Thanks for adding this! I have left a minor comment with regard to the data type.

wgtmac · 2023-10-18T09:02:10Z

data/README.md

@@ -45,6 +45,7 @@
 | plain-dict-uncompressed-checksum.parquet         | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
 | rle-dict-uncompressed-corrupt-checksum.parquet   | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC |
 | large_string_map.brotli.parquet       | MAP(STRING, INT32) with a string column chunk of more than 2GB. See [note](#large-string-map) below |
+| concatenated_gzip_members.parquet     | 513 int64 numbers compressed using 2 gzip members |


why you choosed [U]INT64 ?
parquet-mr shows INT64:

type encodings count avg size nulls min / max long_col INT64 G RB_ 513 2.86 B "1" / "513"

The schema shows that its logical type is unsigned.

message root { optional int64 long_col (INTEGER(64,false)); }

yes, I missed the logical type. thanks

wgtmac

LGTM +1.

cc @mapleFU @pitrou

pitrou · 2023-10-18T09:55:19Z

What does "2 gzip members" mean? The description is not really understandable as-is. You might expand in the description table, or create a separate paragraph to explain it in detail.

amassalha · 2023-10-18T11:52:00Z

@pitrou more details added

pitrou · 2023-10-18T11:54:44Z

Ok, but are these "gzip members" concatenated in a single data page? Which Parquet writer produces such files?

pitrou · 2023-10-18T12:03:48Z

I'll note that the compression spec doesn't say anything about potentially concatenating multiple GZip streams inside a single data page.

Given that data pages are typically small, there does not seem to be any reason for not compressing an entire data page at once.

@rdblue What do you think?

amassalha · 2023-10-18T12:08:48Z

I'll note that the compression spec doesn't say anything about potentially concatenating multiple GZip streams inside a single data page.

Given that data pages are typically small, there does not seem to be any reason for not compressing an entire data page at once.

@rdblue What do you think?

see this:
https://www.zlib.net/manual.html
" inflate() will not automatically decode concatenated gzip members. inflate() will return Z_STREAM_END at the end of the gzip member. The state would need to be reset to continue decoding a subsequent gzip member. This must be done if there is more data after a gzip member, in order for the decompression to be compliant with the gzip standard (RFC 1952)."

and in your link:
the implementation provided by the zlib compression library is authoritative.

its not about parquet data pages, its about gzip compressed data, size doesn't matter, you can concatenate members of any sizes. and then split to parquet pages if needed

mapleFU · 2023-10-18T12:14:31Z

I and Gang try to parse the file using arrow-rs and parquet-mr ( See apache/arrow#38272 (comment) ). Seems parquet-mr can read file like this, and arrow-rs report the bad page size. But I'm not sure:

arrow-rs report the decompressed size is a bit larger than size in page header, I need to recheck the size in the file
Whether the standard allow file like this.

But since parquet-mr is able to read it, maybe it's ok to support decompression it?

pitrou · 2023-10-18T12:19:33Z

its not about parquet data pages, its about gzip compressed data, size doesn't matter, you can concatenate members of any sizes. and then split to parquet pages if needed

Well, for now, you cannot. You are asking to complicate implementations to support this use case, so it's reasonable to ask: are there concrete reasons to support it?

pitrou · 2023-10-18T12:19:47Z

I and Gang try to parse the file using arrow-rs and parquet-mr

How was the file produced?

amassalha · 2023-10-18T12:21:28Z

Ok, but are these "gzip members" concatenated in a single data page? Which Parquet writer produces such files?

to generate this file we use out own writer implementation (which is fully adheres to the gzip RFC)

I and Gang try to parse the file using arrow-rs and parquet-mr

How was the file produced?

its produced using our internal high-performance gzip implementation which is fully adheres to the gzip RFC.

amassalha · 2023-10-18T12:22:39Z

concrete reasons

gzip/zlib standard (+we produce such files, parquet-mr + spark-shell can read this file)

pitrou · 2023-10-18T12:24:16Z

its produced using our internal high-performance gzip implementation which is fully adheres to the gzip RFC.

Is it deliberate that you produce several gzip fragments? This will decrease compression efficiency btw (since compressed fragments are independent from each other).

amassalha · 2023-10-18T12:45:59Z

its produced using our internal high-performance gzip implementation which is fully adheres to the gzip RFC.

Is it deliberate that you produce several gzip fragments? This will decrease compression efficiency btw (since compressed fragments are independent from each other).

yes we undrestand that, its part of our internal implementation.

pitrou · 2023-10-18T12:50:02Z

@tustvold What is your opinion here?

tustvold · 2023-10-18T13:06:06Z

So the parquet specification states

A codec based on the GZIP format (not the closely-related "zlib" or "deflate" formats) defined by RFC 1952. If any ambiguity arises when implementing this format, the implementation provided by the zlib compression library is authoritative.

The zlib decompression method states

Unlike the gunzip utility and gzread() (see below), inflate() will not automatically decode concatenated gzip members. inflate() will return Z_STREAM_END at the end of the gzip member. The state would need to be reset to continue decoding a subsequent gzip member.

Given the specification makes no explicit mention of members, and the zlib library explicitly does not support them "out of the box", I would be inclined to read the specification as disallowing multiple members. Certainly it would appear multiple implementations, arrow-cpp and arrow-rs, interpreted the specification this way and regardless of the original author's intent, we should be optimising for maximum compatibility.

I think if we want to alter the specification to support this, it should go through the usual process for a specification modification, including a clear articulation of the benefit to the wider ecosystem to incentivize adoption.

Edit: On further reading, the gzip specification would appear to suggest that the member abstraction exists to encode the notion of compressing multiple source files into a single gzip file, as such it seems very surprising that this would be used to compress multiple parts of a single parquet page...

amassalha · 2023-10-18T13:40:59Z

@tustvold regards your edit (" it seems very surprising that this would be used to compress multiple parts of a single parquet page..."): unless you consider a parallel compression...

pitrou · 2023-10-18T13:43:54Z

@amassalha That's true, but given the average size of a Parquet data page, wouldn't it be more efficient to parallelize accross data pages or columns?

amassalha · 2023-10-18T14:04:35Z

@amassalha That's true, but given the average size of a Parquet data page, wouldn't it be more efficient to parallelize accross data pages or columns?

Im not sure I can explain more in details (its related to the startup internal Arch), but we parallelize alot of the compression parts, even inside same page (and columns ofcurse).
we can generate parquet files really fast, so there are not enough pages available

btw, see this for parralel compression:
https://zlib.net/pigz/

pitrou · 2023-10-18T14:07:54Z

Yes, I know that parallel GZip compression exists. The fun thing is that you need at least 5 threads for parallel GZip compression to be as fast as single-thread Zstd compression... so, unless you also use HW acceleration for GZip, it still seems a bit wasteful.

tustvold · 2023-10-18T14:15:57Z

I don't think anyone disputes that being able to introduce more parallelism would be nice for some workloads, the question is whether this is permissible according to the standard. Unfortunately in places where the standard is vague, in the interests of interoperability we normally have to take a lowest common denominator approach, which in this case would be to not support this without a new addition to the specification explicitly adding support for it.

pitrou · 2023-10-18T14:17:51Z

I would also stress that it's probably beneficial for your implementation to be compatible with previous versions of existing readers @amassalha . So, unless it's really painful for you, you should probably avoid creating data pages with several concatenated GZip streams.

amassalha · 2023-10-18T14:20:59Z

I don't think anyone disputes that being able to introduce more parallelism would be nice for some workloads, the question is whether this is permissible according to the standard. Unfortunately in places where the standard is vague, in the interests of interoperability we normally have to take a lowest common denominator approach, which in this case would be to not support this without a new addition to the specification explicitly adding support for it.

why do you think its "vague" when it says:
"Unlike the gunzip utility and gzread() (see below), inflate() will not automatically decode concatenated gzip members. inflate() will return Z_STREAM_END at the end of the gzip member. The state would need to be reset to continue decoding a subsequent gzip member."

I agree they should have saying it more "loudly" , not just in documentation of "inflateInit2" , but they "expliclty" explain how to deal with this situation, and I think not supporting it would be explained as not supporting RFC 1952

and other readers (like spark-shell) support this.

pitrou · 2023-10-18T14:28:33Z

@amassalha I think you're just overinterpreting the zlib docs and the Parquet specs. Saying "the GZIP codec is based on the GZIP format" does not imply "the GZIP codec should transparently decompress multiple members into a single data page".

amassalha · 2023-10-18T14:36:03Z

@amassalha I think you're just overinterpreting the zlib docs and the Parquet specs. Saying "the GZIP codec is based on the GZIP format" does not imply "the GZIP codec should transparently decompress multiple members into a single data page".

see this :
" 2.2. File format

  A gzip file consists of a series of "**members**" (compressed data
  sets).  The format of each member is specified in the following
  section.  The members simply appear one after another in the file,
  with no additional information before, between, or after them."

its from RFC1952
its explained very staritly without any "overinterpreting ". not talking about parquet pages, but about libarrow to be aligned to the standard, it should support reading those files...

tustvold · 2023-10-18T14:43:55Z

I think this discussion has gotten a bit circular, as a path forward how about:

We update the parquet specification to permit multiple members but explicitly call out that some readers may not support this, it is not recommended, and writers should provide a mechanism to disable this behaviour
We update arrow-cpp and arrow-rs to support reading multiple members

wgtmac · 2023-10-18T14:48:07Z

I think this discussion has gotten a bit circular, as a path forward how about:

We update the parquet specification to permit multiple members but explicitly call out that some readers may not support this, it is not recommended, and writers should provide a mechanism to disable this behaviour

We update arrow-cpp and arrow-rs to support reading multiple members

+1. It would be safe to change the parquet specs explicitly before adding support.

pitrou · 2023-10-18T14:57:16Z

We update the parquet specification to permit multiple members but explicitly call out that some readers may not support this, it is not recommended, and writers should provide a mechanism to disable this behaviour

+1, but I feel like there should be a discussion on the ML first (or on the parquet-format repo with interested parties).

amassalha · 2023-10-19T03:54:59Z

how should it be forwarded to parquet-format discussion ? should I add pull request to this: https://github.com/apache/parquet-format/blob/master/Compression.md ?
I still can't understand why you don't want to "fully" support this, while it clearly part of RC1952 that arrow is based on. how you decide when to "deviate" from RC1952 ?
we generate smaller parquet files very fast to save memory footprint and storage bandwidth. in this case we do not have enough parallelism from the number of pages, so we need sub-page compression

this scenario is supported in parquet-mr, spark-shell and by the gzip spec. why not by arrow?

tustvold · 2023-10-19T06:46:33Z

why you don't want to "fully" support this

Because there is a substantial install base that won't support it, and users should be alerted to this fact, and be given options other than forcing an upgrade. It takes a long time for updates to percolate through the ecosystem...

how should it be forwarded to parquet-format discussion

I've created apache/parquet-format#218

tustvold · 2023-10-19T09:21:52Z

As a further datapoint, duckdb has a somewhat peculiar behaviour when confronted with this file

>>> duckdb.query("select * from 'concatenated_gzip_members.parquet'")
┌────────────┐
│  long_col  │
│   uint64   │
├────────────┤
│          1 │
│          2 │
│          3 │
│          4 │
│          5 │
│          6 │
│          7 │
│          8 │
│          9 │
│         10 │
│          · │
│          · │
│          · │
│        504 │
│        505 │
│        506 │
│        507 │
│        508 │
│        509 │
│        510 │
│        511 │
│        512 │
│          0 │
├────────────┤
│  513 rows  │
│ (20 shown) │
└────────────┘

It reports the final row as 0, not 513

amassalha · 2023-10-19T09:22:51Z

As a further datapoint, duckdb has a somewhat peculiar behaviour when confronted with this file

>>> duckdb.query("select * from 'concatenated_gzip_members.parquet'")
┌────────────┐
│  long_col  │
│   uint64   │
├────────────┤
│          1 │
│          2 │
│          3 │
│          4 │
│          5 │
│          6 │
│          7 │
│          8 │
│          9 │
│         10 │
│          · │
│          · │
│          · │
│        504 │
│        505 │
│        506 │
│        507 │
│        508 │
│        509 │
│        510 │
│        511 │
│        512 │
│          0 │
├────────────┤
│  513 rows  │
│ (20 shown) │
└────────────┘

It reports the final row as 0, not 513

its same behaviour that libarrow cpp has before my patch

mapleFU · 2023-10-19T09:23:36Z

@tustvold I think this is same as arrow-cpp logic. I guess it's because during decompressing page, the page-size haven't been checked. As a result, the tail of the encoding buffer is not initialized.

mapleFU · 2023-11-14T05:54:48Z

@pitrou @wgtmac Can we push this forward to merge this? Since the apache/parquet-format#218 is merged.

wgtmac · 2023-11-14T08:55:27Z

It seems that rebase is required.

amassalha · 2023-11-14T08:58:47Z

It seems that rebase is required.

I just rebased it

wgtmac

Thanks for rebasing.

@pitrou Do you have any comment?

data/README.md

amassalha · 2023-11-15T10:13:32Z

@pitrou done

pitrou

LGTM as well.

amassalha mentioned this pull request Oct 18, 2023

GH-38271: [C++][Parquet] Support reading parquet files with multiple gzip members apache/arrow#38272

Merged

wgtmac approved these changes Oct 18, 2023

View reviewed changes

tustvold mentioned this pull request Oct 18, 2023

Support multiple GZip members in parquet page apache/arrow-rs#4951

Merged

tustvold mentioned this pull request Oct 19, 2023

PARQUET-2369: Clarify GZIP Support for Multiple GZIP Members apache/parquet-format#218

Merged

Add new test file concatenated_gzip_members.parquet

fd9dbf9

amassalha force-pushed the atheel/add_gzip_members_test_file branch from 7891cc6 to fd9dbf9 Compare November 14, 2023 08:57

wgtmac approved these changes Nov 15, 2023

View reviewed changes

pitrou reviewed Nov 15, 2023

View reviewed changes

data/README.md Outdated Show resolved Hide resolved

Update README.md

9bd4d55

pitrou approved these changes Nov 15, 2023

View reviewed changes

pitrou changed the title ~~Add new test file concatenated_gzip_members.parquet~~ PARQUET-2369: Add new test file concatenated_gzip_members.parquet Nov 15, 2023

pitrou merged commit 89b685a into apache:master Nov 15, 2023

asfimport mentioned this pull request Jun 23, 2024

Clarify Support for Pages Compressed with Multiple GZIP Members apache/parquet-format#423

Closed

	\| concatenated_gzip_members.parquet \| 513 int64 numbers compressed using 2 gzip members \|
	\| concatenated_gzip_members.parquet \| 513 UINT64 numbers compressed using 2 gzip members \|

PARQUET-2369: Add new test file concatenated_gzip_members.parquet #41

PARQUET-2369: Add new test file concatenated_gzip_members.parquet #41

Conversation

amassalha commented Oct 18, 2023 • edited by pitrou Loading

amassalha commented Oct 18, 2023

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac Oct 18, 2023

Choose a reason for hiding this comment

amassalha Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

wgtmac Oct 18, 2023

Choose a reason for hiding this comment

amassalha Oct 18, 2023

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

pitrou commented Oct 18, 2023

amassalha commented Oct 18, 2023

pitrou commented Oct 18, 2023

pitrou commented Oct 18, 2023

amassalha commented Oct 18, 2023 • edited Loading

mapleFU commented Oct 18, 2023 • edited Loading

pitrou commented Oct 18, 2023

pitrou commented Oct 18, 2023

amassalha commented Oct 18, 2023

amassalha commented Oct 18, 2023

pitrou commented Oct 18, 2023

amassalha commented Oct 18, 2023

pitrou commented Oct 18, 2023

tustvold commented Oct 18, 2023 • edited Loading

amassalha commented Oct 18, 2023

pitrou commented Oct 18, 2023

amassalha commented Oct 18, 2023 • edited Loading

pitrou commented Oct 18, 2023

tustvold commented Oct 18, 2023 • edited Loading

pitrou commented Oct 18, 2023

amassalha commented Oct 18, 2023 • edited Loading

pitrou commented Oct 18, 2023 • edited Loading

amassalha commented Oct 18, 2023 • edited Loading

tustvold commented Oct 18, 2023

wgtmac commented Oct 18, 2023

pitrou commented Oct 18, 2023 • edited Loading

amassalha commented Oct 19, 2023

tustvold commented Oct 19, 2023 • edited Loading

tustvold commented Oct 19, 2023

amassalha commented Oct 19, 2023

mapleFU commented Oct 19, 2023 • edited Loading

mapleFU commented Nov 14, 2023

wgtmac commented Nov 14, 2023

amassalha commented Nov 14, 2023

wgtmac left a comment

Choose a reason for hiding this comment

amassalha commented Nov 15, 2023

pitrou left a comment

Choose a reason for hiding this comment

amassalha commented Oct 18, 2023 •

edited by pitrou

Loading

amassalha Oct 18, 2023 •

edited

Loading

amassalha commented Oct 18, 2023 •

edited

Loading

mapleFU commented Oct 18, 2023 •

edited

Loading

tustvold commented Oct 18, 2023 •

edited

Loading

amassalha commented Oct 18, 2023 •

edited

Loading

tustvold commented Oct 18, 2023 •

edited

Loading

amassalha commented Oct 18, 2023 •

edited

Loading

pitrou commented Oct 18, 2023 •

edited

Loading

amassalha commented Oct 18, 2023 •

edited

Loading

pitrou commented Oct 18, 2023 •

edited

Loading

tustvold commented Oct 19, 2023 •

edited

Loading

mapleFU commented Oct 19, 2023 •

edited

Loading