Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2369: Add new test file concatenated_gzip_members.parquet #41

Merged
merged 2 commits into from
Nov 15, 2023

Conversation

amassalha
Copy link
Contributor

@amassalha amassalha commented Oct 18, 2023

For testing parquet file with multiple gzip members, following discussion in PARQUET-2369.

@amassalha
Copy link
Contributor Author

@wgtmac

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! I have left a minor comment with regard to the data type.

data/README.md Outdated
@@ -45,6 +45,7 @@
| plain-dict-uncompressed-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
| rle-dict-uncompressed-corrupt-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC |
| large_string_map.brotli.parquet | MAP(STRING, INT32) with a string column chunk of more than 2GB. See [note](#large-string-map) below |
| concatenated_gzip_members.parquet | 513 int64 numbers compressed using 2 gzip members |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| concatenated_gzip_members.parquet | 513 int64 numbers compressed using 2 gzip members |
| concatenated_gzip_members.parquet | 513 UINT64 numbers compressed using 2 gzip members |

Copy link
Contributor Author

@amassalha amassalha Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you choosed [U]INT64 ?
parquet-mr shows INT64:

          type      encodings count     avg size   nulls   min / max
long_col  INT64     G RB_     513       2.86 B             "1" / "513"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema shows that its logical type is unsigned.

message root {
  optional int64 long_col (INTEGER(64,false));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I missed the logical type. thanks

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1.

cc @mapleFU @pitrou

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

What does "2 gzip members" mean? The description is not really understandable as-is. You might expand in the description table, or create a separate paragraph to explain it in detail.

@amassalha
Copy link
Contributor Author

@pitrou more details added

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

Ok, but are these "gzip members" concatenated in a single data page? Which Parquet writer produces such files?

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

I'll note that the compression spec doesn't say anything about potentially concatenating multiple GZip streams inside a single data page.

Given that data pages are typically small, there does not seem to be any reason for not compressing an entire data page at once.

@rdblue What do you think?

@amassalha
Copy link
Contributor Author

amassalha commented Oct 18, 2023

I'll note that the compression spec doesn't say anything about potentially concatenating multiple GZip streams inside a single data page.

Given that data pages are typically small, there does not seem to be any reason for not compressing an entire data page at once.

@rdblue What do you think?

see this:
https://www.zlib.net/manual.html
" inflate() will not automatically decode concatenated gzip members. inflate() will return Z_STREAM_END at the end of the gzip member. The state would need to be reset to continue decoding a subsequent gzip member. This must be done if there is more data after a gzip member, in order for the decompression to be compliant with the gzip standard (RFC 1952)."

and in your link:
the implementation provided by the zlib compression library is authoritative.

its not about parquet data pages, its about gzip compressed data, size doesn't matter, you can concatenate members of any sizes. and then split to parquet pages if needed

@mapleFU
Copy link
Member

mapleFU commented Oct 18, 2023

I and Gang try to parse the file using arrow-rs and parquet-mr ( See apache/arrow#38272 (comment) ). Seems parquet-mr can read file like this, and arrow-rs report the bad page size. But I'm not sure:

  1. arrow-rs report the decompressed size is a bit larger than size in page header, I need to recheck the size in the file
  2. Whether the standard allow file like this.

But since parquet-mr is able to read it, maybe it's ok to support decompression it?

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

its not about parquet data pages, its about gzip compressed data, size doesn't matter, you can concatenate members of any sizes. and then split to parquet pages if needed

Well, for now, you cannot. You are asking to complicate implementations to support this use case, so it's reasonable to ask: are there concrete reasons to support it?

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

I and Gang try to parse the file using arrow-rs and parquet-mr

How was the file produced?

@amassalha
Copy link
Contributor Author

Ok, but are these "gzip members" concatenated in a single data page? Which Parquet writer produces such files?

to generate this file we use out own writer implementation (which is fully adheres to the gzip RFC)

I and Gang try to parse the file using arrow-rs and parquet-mr

How was the file produced?

its produced using our internal high-performance gzip implementation which is fully adheres to the gzip RFC.

@amassalha
Copy link
Contributor Author

concrete reasons

gzip/zlib standard (+we produce such files, parquet-mr + spark-shell can read this file)

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

its produced using our internal high-performance gzip implementation which is fully adheres to the gzip RFC.

Is it deliberate that you produce several gzip fragments? This will decrease compression efficiency btw (since compressed fragments are independent from each other).

@amassalha
Copy link
Contributor Author

its produced using our internal high-performance gzip implementation which is fully adheres to the gzip RFC.

Is it deliberate that you produce several gzip fragments? This will decrease compression efficiency btw (since compressed fragments are independent from each other).

yes we undrestand that, its part of our internal implementation.

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

@tustvold What is your opinion here?

@tustvold
Copy link

tustvold commented Oct 18, 2023

So the parquet specification states

A codec based on the GZIP format (not the closely-related "zlib" or "deflate" formats) defined by RFC 1952. If any ambiguity arises when implementing this format, the implementation provided by the zlib compression library is authoritative.

The zlib decompression method states

Unlike the gunzip utility and gzread() (see below), inflate() will not automatically decode concatenated gzip members. inflate() will return Z_STREAM_END at the end of the gzip member. The state would need to be reset to continue decoding a subsequent gzip member.

Given the specification makes no explicit mention of members, and the zlib library explicitly does not support them "out of the box", I would be inclined to read the specification as disallowing multiple members. Certainly it would appear multiple implementations, arrow-cpp and arrow-rs, interpreted the specification this way and regardless of the original author's intent, we should be optimising for maximum compatibility.

I think if we want to alter the specification to support this, it should go through the usual process for a specification modification, including a clear articulation of the benefit to the wider ecosystem to incentivize adoption.

Edit: On further reading, the gzip specification would appear to suggest that the member abstraction exists to encode the notion of compressing multiple source files into a single gzip file, as such it seems very surprising that this would be used to compress multiple parts of a single parquet page...

@amassalha
Copy link
Contributor Author

@tustvold regards your edit (" it seems very surprising that this would be used to compress multiple parts of a single parquet page..."): unless you consider a parallel compression...

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

@amassalha That's true, but given the average size of a Parquet data page, wouldn't it be more efficient to parallelize accross data pages or columns?

@amassalha
Copy link
Contributor Author

amassalha commented Oct 18, 2023

@amassalha That's true, but given the average size of a Parquet data page, wouldn't it be more efficient to parallelize accross data pages or columns?

Im not sure I can explain more in details (its related to the startup internal Arch), but we parallelize alot of the compression parts, even inside same page (and columns ofcurse).
we can generate parquet files really fast, so there are not enough pages available

btw, see this for parralel compression:
https://zlib.net/pigz/

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

Yes, I know that parallel GZip compression exists. The fun thing is that you need at least 5 threads for parallel GZip compression to be as fast as single-thread Zstd compression... so, unless you also use HW acceleration for GZip, it still seems a bit wasteful.

@tustvold
Copy link

tustvold commented Oct 18, 2023

I don't think anyone disputes that being able to introduce more parallelism would be nice for some workloads, the question is whether this is permissible according to the standard. Unfortunately in places where the standard is vague, in the interests of interoperability we normally have to take a lowest common denominator approach, which in this case would be to not support this without a new addition to the specification explicitly adding support for it.

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

I would also stress that it's probably beneficial for your implementation to be compatible with previous versions of existing readers @amassalha . So, unless it's really painful for you, you should probably avoid creating data pages with several concatenated GZip streams.

@amassalha
Copy link
Contributor Author

amassalha commented Oct 18, 2023

I don't think anyone disputes that being able to introduce more parallelism would be nice for some workloads, the question is whether this is permissible according to the standard. Unfortunately in places where the standard is vague, in the interests of interoperability we normally have to take a lowest common denominator approach, which in this case would be to not support this without a new addition to the specification explicitly adding support for it.

why do you think its "vague" when it says:
"Unlike the gunzip utility and gzread() (see below), inflate() will not automatically decode concatenated gzip members. inflate() will return Z_STREAM_END at the end of the gzip member. The state would need to be reset to continue decoding a subsequent gzip member."

I agree they should have saying it more "loudly" , not just in documentation of "inflateInit2" , but they "expliclty" explain how to deal with this situation, and I think not supporting it would be explained as not supporting RFC 1952

and other readers (like spark-shell) support this.

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

@amassalha I think you're just overinterpreting the zlib docs and the Parquet specs. Saying "the GZIP codec is based on the GZIP format" does not imply "the GZIP codec should transparently decompress multiple members into a single data page".

@amassalha
Copy link
Contributor Author

amassalha commented Oct 18, 2023

@amassalha I think you're just overinterpreting the zlib docs and the Parquet specs. Saying "the GZIP codec is based on the GZIP format" does not imply "the GZIP codec should transparently decompress multiple members into a single data page".

see this :
" 2.2. File format

  A gzip file consists of a series of "**members**" (compressed data
  sets).  The format of each member is specified in the following
  section.  The members simply appear one after another in the file,
  with no additional information before, between, or after them."

its from RFC1952
its explained very staritly without any "overinterpreting ". not talking about parquet pages, but about libarrow to be aligned to the standard, it should support reading those files...

@tustvold
Copy link

I think this discussion has gotten a bit circular, as a path forward how about:

  • We update the parquet specification to permit multiple members but explicitly call out that some readers may not support this, it is not recommended, and writers should provide a mechanism to disable this behaviour
  • We update arrow-cpp and arrow-rs to support reading multiple members

@wgtmac
Copy link
Member

wgtmac commented Oct 18, 2023

I think this discussion has gotten a bit circular, as a path forward how about:

  • We update the parquet specification to permit multiple members but explicitly call out that some readers may not support this, it is not recommended, and writers should provide a mechanism to disable this behaviour
  • We update arrow-cpp and arrow-rs to support reading multiple members

+1. It would be safe to change the parquet specs explicitly before adding support.

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

  • We update the parquet specification to permit multiple members but explicitly call out that some readers may not support this, it is not recommended, and writers should provide a mechanism to disable this behaviour

+1, but I feel like there should be a discussion on the ML first (or on the parquet-format repo with interested parties).

@amassalha
Copy link
Contributor Author

  1. how should it be forwarded to parquet-format discussion ? should I add pull request to this: https://github.com/apache/parquet-format/blob/master/Compression.md ?

  2. I still can't understand why you don't want to "fully" support this, while it clearly part of RC1952 that arrow is based on. how you decide when to "deviate" from RC1952 ?
    we generate smaller parquet files very fast to save memory footprint and storage bandwidth. in this case we do not have enough parallelism from the number of pages, so we need sub-page compression

this scenario is supported in parquet-mr, spark-shell and by the gzip spec. why not by arrow?

@tustvold
Copy link

tustvold commented Oct 19, 2023

why you don't want to "fully" support this

Because there is a substantial install base that won't support it, and users should be alerted to this fact, and be given options other than forcing an upgrade. It takes a long time for updates to percolate through the ecosystem...

how should it be forwarded to parquet-format discussion

I've created apache/parquet-format#218

@tustvold
Copy link

As a further datapoint, duckdb has a somewhat peculiar behaviour when confronted with this file

>>> duckdb.query("select * from 'concatenated_gzip_members.parquet'")
┌────────────┐
│  long_col  │
│   uint64   │
├────────────┤
│          1 │
│          2 │
│          3 │
│          4 │
│          5 │
│          6 │
│          7 │
│          8 │
│          9 │
│         10 │
│          · │
│          · │
│          · │
│        504 │
│        505 │
│        506 │
│        507 │
│        508 │
│        509 │
│        510 │
│        511 │
│        512 │
│          0 │
├────────────┤
│  513 rows  │
│ (20 shown) │
└────────────┘

It reports the final row as 0, not 513

@amassalha
Copy link
Contributor Author

As a further datapoint, duckdb has a somewhat peculiar behaviour when confronted with this file

>>> duckdb.query("select * from 'concatenated_gzip_members.parquet'")
┌────────────┐
│  long_col  │
│   uint64   │
├────────────┤
│          1 │
│          2 │
│          3 │
│          4 │
│          5 │
│          6 │
│          7 │
│          8 │
│          9 │
│         10 │
│          · │
│          · │
│          · │
│        504 │
│        505 │
│        506 │
│        507 │
│        508 │
│        509 │
│        510 │
│        511 │
│        512 │
│          0 │
├────────────┤
│  513 rows  │
│ (20 shown) │
└────────────┘

It reports the final row as 0, not 513

its same behaviour that libarrow cpp has before my patch

@mapleFU
Copy link
Member

mapleFU commented Oct 19, 2023

@tustvold I think this is same as arrow-cpp logic. I guess it's because during decompressing page, the page-size haven't been checked. As a result, the tail of the encoding buffer is not initialized.

@mapleFU
Copy link
Member

mapleFU commented Nov 14, 2023

@pitrou @wgtmac Can we push this forward to merge this? Since the apache/parquet-format#218 is merged.

@wgtmac
Copy link
Member

wgtmac commented Nov 14, 2023

It seems that rebase is required.

@amassalha amassalha force-pushed the atheel/add_gzip_members_test_file branch from 7891cc6 to fd9dbf9 Compare November 14, 2023 08:57
@amassalha
Copy link
Contributor Author

It seems that rebase is required.

I just rebased it

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for rebasing.

@pitrou Do you have any comment?

data/README.md Outdated Show resolved Hide resolved
@amassalha
Copy link
Contributor Author

@pitrou done

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well.

@pitrou pitrou changed the title Add new test file concatenated_gzip_members.parquet PARQUET-2369: Add new test file concatenated_gzip_members.parquet Nov 15, 2023
@pitrou pitrou merged commit 89b685a into apache:master Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants