-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts about a first-class GEOMETRY data type in Parquet? #222
Comments
Sorry I missed the last meeting! Got called in for jury duty. But this all sounds like amazing progress - to me the biggest goal of GeoParquet was to bring great geo support to Parquet, and that if it Parquet were to have top-level support for geometries then we could just 'dissolve' GeoParquet.
+1 - that seems to me the ideal thing, to bring in all the debate and choices we made into that PR so the core of Parquet works well for the most advanced geospatial use cases while being easy to use for anyone just dumping in some geo data. And +1 to all your suggestions, they sound good, though one I'm unsure on:
I am curious about this one. I would have guessed that the ideal parquet native geometry type would not be WKB, but something more columnar? Is the argument that using WKB in parquet will make it so readers are easier to write, and that it can use |
I fail to see how a non-geoaware Parquet writer could generate column statistics from a WKB encoded geometry column without having access to a WKB decoder (at least a partial one able to recognize where coordinate tuples are located to compute the bounding box), given there's no easy to access geometry envelope in regular WKB (contrary to PostGIS enhanced WKB or GeoPackage blob, which incorporate one in the header preceding the WKB). On the other hand WKB can express complex geometries (curves, collections, polyhedral surface) that may be difficult to express using "native" array constructs (everything is possible, but that might involve multiple cascaded arrays pointing at each other) |
I was thinking something more like the envelope would just be stored with it. Like instead of 'bbox' as a separate column it's just somehow embedded in the 'geometry' data structure. I don't know enough about Parquet internals to know if that's possible, and I'm just shooting in the dark - having a native columnar geometry structure in Parquet would make more sense to me. |
A byte_array/blob column (e.g., simple storage but requires parsing) and a nested list are both complicated in different ways (e.g., not all Parquet readers support nested things). It seems like WKB is something that it easy to agree on as an initial encoding (not as much precedent for the columnar encodings in practice).
I think it might be possible to get a WKB reader into parquet-cpp that generates the bounding box statistics on write (I'm happy to take a stab at it); however, I am not sure we have access to a JSON parser at that point in the code and thus its ability to be CRS-aware (e.g., deal with the antimeridian) might be limited. It may also be that this (plus other proposed changes in response to some other formats announced recently) is the push that parquet-cpp needs to inject some extensibility into the read/write process or reclaim its independence from Arrow C++. |
I am not sure that there is any intention to ever support anything other than geometry types POINT--GEOMETRYCOLLECTION. |
https://github.com/OSGeo/gdal/blob/12dab86ca2d8b1a65c4c085e137c62294682ac1d/ogr/ogr_wkb.cpp#L590 could serve as a starting point as well |
I think if Parquet adds the bbox info as its native statistics and we use WKB as the encoding, this could be a time saver to GeoParquet and maybe increase the adoption of GeoParquet.
I also support adding S2/H3 to Parquet but not sure if this will actually make it. |
If the WKB encoding is embedded at the same level as (e.g.) the compression, there is a chance that the reader (or a custom geo-specific one) can decode this directly into the columnar representation. Many types in parquet are encoded differently than their Arrow representation (e.g, using run-length encoding).
I had forgotten about column metadata, which is arbitrary key/value storage at the row group -> column level. I think that might be a better place for S2/H3 coverings. We'd have to expose it from Parquet C++ (I don't know if it's exposed from Java or Rust). |
Sorry for chiming in late. From all comments here and from apache/parquet-format#240, I think I need to clarify what parquet can do and cannot do with a new geometry type.
Yes, that's good suggestion. The current limitation of parquet is that each column can have only one type of column statistics. This means we can choose only one (via writer options) among bbox, S2, or H3, not all. I agree that
In terms of column metadata, GeoParquet now stores all of them in the key-value metadata in the parquet footer. I'm not sure if it sounds good to add a key-value metadata to the geometry logical type to offload some to there. This might look like below:
|
I'd lean towards putting as much of the geoparquet column level metadata as we can into geometry logical type. I think it would help the compatibility between iceberg geospatial support and good geospatial parquet, as it sounds like the core iceberg committers are not into the idea of trying to write 'special' metadata for geo just for parquet. They prefer to just store it at the table format level, but the downside is that would require readers to go 'through' iceberg for geo, and not able to read the geo information from the lower level. I fully sympathize with their reasoning, but I think we as a larger community can do better. One of the main reasons is they also need to write to ORC and Avro, which also don't support geospatial. @wgtmac - it'd be great to chat in realtime about your work and figure out how our GeoParquet community can help and support you. We have bi-weekly meetings every other monday - the next one is June 3rd at 10am pacific time - or I think we'd be happy to find another time. |
I am actually not sure this is possible (at the file level), nor is it relevant for WKB (which can store XY geometries alongside XYZ geometries). The dimensions (and/or geometry types, and/or the unique combinations of both) are more like column statistics (i.e., which are present). It is, of course, useful to know at the type level whether the geometry type and/or dimensions are constant. The bbox could always be 8 numbers (min, max / x, y, z, m) and set to
I agree here...column metadata (as I understand it) is duplicated for each row group and not all readers implement it (again, I could be mistaken here). To the extent that we want to keep file level metadata around, I think we just want it to help readers that don't support the GEOMETRY logical type or care deeply about the primary geometry column name.
I think this is a battle for another day 🙂 |
Yeah, that primary column attribute is the only thing that's maybe useful. And afaik it's really just a concern for geospatial readers that need to know which geometry to render if there's multiple columns. So doesn't seem worth trying to get that metadata into Parquet, unless it fits nicely somewhere. Still has some value - would be pretty funny if we ended up keeping the geoparquet spec as literally one file level field just to communicate the primary column name, but it doesn't seem crazy. |
Sounds good! But I'd like to mention that parquet logical type is only suitble to store static attributes like
I can try my best to join but it seems too late in UTC+8.
Thanks for suggestion! I thought the dimension should be constant accross values in the same column.
After a second thought, S2 and H3 convering does not fit the model of statistcis in parquet which requires min and max pairs. We might need to extend parquet ColumnChunk thrift object to put a geo-specific stats field to support them. Again, thanks all feedbacks! I will modify the PR to adopt all good suggestions. |
On a historical note, since this question came up in the community meeting last week about why we didn't take the path of a first-class geometry type in the past: apart from some (potentially misguided) reasons about thinking the Parquet community might not be interested, the main reason was because it was the easier path for adoption and usage. Now that there is a community around GeoParquet, it makes sense to re-evaluate that path, and I find the idea of a native geometry type definitely interesting and exciting. But I do want to point out it is still a trade-off, there are still downsides to this as well. There are of course also very clear advantages, as already have been discussed above. In addition, a native type will also be a good motivation to actually do push some of the capabilities into the lower-level parquet implementations, avoiding some headaches with having to add metadata after the fact.
Another +1 to that. I think ideally we would put everything from the geoparquet spec for a single column and that is not data-dependent (like bbox) in the logical type information. And I think as much as possible of this information can go into a single "metadata" field (which would make it easier to still evolve this somewhat, like adding a new key). |
Update: I have simplified the proposal: apache/parquet-format#240. Please let me know what you think. Thanks! |
For what it's worth, this is my main worry, but maybe the long term gains are worth the near-term incompatibility hurdles? |
If we are able to get implementations out relatively quickly I think we can probably avoid this (I'm certainly game to help with the C++ implementation). Some of the relevant database vendors might be pinning old versions of Parquet readers/Arrow implementations; however, I am not sure if they were the ones hopping on the bandwagon to support file-level metadata either. |
From #222 (comment)
We could also turn this argument around: let's focus on finding consensus around a native encoding for mixed geometries, and then this type can simply use the existing numeric min/max statistics, without the need to add a new |
I agree that we should find a good way to represent mixed geometries in a geoarrow-ish way (and use it in Parquet if it makes sense to!). I also think that the effort required to do that and the ability to spread an implementation of it into the wider geospatial world to get encoding/decoding supported everywhere is much greater than the effort required to implement a WKB parser in a few Parquet readers. (It is common for other types like integers or floating point numbers to have alternative encodings in the Parquet file and parse them into other structures on the fly, too). |
@jorisvandenbossche @paleolimbot I agree that finding a good way to represent mixed geometries natively is a good idea. I think this does not mean we can get away with WKB parser in the parquet implementations. WKB-encoded geometry type has been widely used in many projects and would be straight-forward to be adopted by them. Also we want to support Apache Iceberg which exactly encodes geometry using WKB. For any parquet implementation, it could simply disable |
It looks like at least one contributor to the Parquet standard is interested in adding a first-class logical GEOMETRY type ( apache/parquet-format#240 )! There's quite a bit of conversation on the PR (@jiayuasu has also been reviewing), but it seems like the motivation is to make it easier to implement geospatial support in iceberg ( https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI ) whilst retaining the ability to represent GeoParquet. One excerpt from the early comments:
The advantage of first-class data type support in Parquet is mostly that we can embed geo-specific column statistics at the page/rowgroup level: currently we use a combination of the optional bounding box column (and its embedded statistics) and file-level metadata (e.g., bbox, geometry types) to encode data-dependent information readers can use to skip reading. The "geometry is just another column with just another data type" model is also possibly easier to reconcile with non-spatial Parquet readers/writers (that aren't designed to look for data-dependent statistics and/or data types in file-level metadata).
The disadvantage is, of course, that we've spent several years negotiating the language here and have thrown our collective capital (fiscal and social) into promoting this way of representing geospatial data in Parquet, including support in major data wharehouses, data engines, data producers, mainstream geospatial tools, and the OGC.
I am slightly concerned that proposal in the Parquet repo will end up dumbed down to an extent that it would not be usable for us; however, I think that if we have somewhat unified feedback on the PR we can probably avoid that. There are some Parquet-specific encoding details that the Parquet community is better suited to handling, but at a high level we might be able to achieve something like:
"crs"
,"edges"
,"orientation"
, and"epoch"
encoded as JSON as part of the logical type (preferably exactly as we've defined them here)"geometry_types"
and"bbox"
as column statistics (I think these would be available at the page and row group levels), or in the worst case at theColumnChunk
level (== column within a rowgroup, which has arbitrary key/value metadata as well).I'd also personally like to see S2 and/or H3 coverings as an optional column statistic because they more elegantly solve the antimeridian/polar bbox problem and because adding them later might be hard (but totally optional).
We discussed this at the community meeting yesterday at length, but I think an issue thread is probably a better place to collect our thoughts on this!
The text was updated successfully, but these errors were encountered: