Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2471: Add geometry logical type #240

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open
Changes from 21 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 209 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,135 @@ struct SizeStatistics {
3: optional list<i64> definition_level_histogram;
}

/**
* Physical type and encoding for the geometry type.
*/
enum GeometryEncoding {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an enum rather than being a requirement? This seems like we're being too generic to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is here to ensure we do not have to break compatibility if a newer or better encoding achieves the same level of ubiquitousness of WKB. The GeoParquet spec already has more than one encoding that we are experimenting with, and so it is not unreasonable that there may be reasons to evolve in the future (even if there are no plans to do so right now).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that anything should be here unless it has a clear and necessary use right now. We can always add new enums later, and we can also add new optional fields in the logical type struct. I would cut this to keep the spec simple.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeometryEncoding is not something that we have to set for now as we have only one option. Perhaps we can simply add a comment to GEOMETRY type to enforce WKB. Later on an optional enum GeometryEncoding can be added once new encoding has been proposed officially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK as long as the next encoding doesn't use a ByteArray as storage (otherwise an older reader would have no way to error when it encounters the new encoding that it does not understand, if I'm understanding it correctly). I am not sure that is likely but such encodings do exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next encoding can still use ByteArray. We can simply apply following rules once WKB is not the only encoding:

  • If GeometryEncoding is set, use the set encoding to interpret the binary data.
  • Otherwise, GeometryEncoding.WKB should be the only option to interpret the binary data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readers not checking if a GeometryEncoding is present will then still read wrong data in that scenario.

To me, it feels safer to simply keep this enum.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To workaround the legacy reader, we have to add separate geometry type if a new encoding has been introduced. From the discussion so far, it seems that we'd better keep enum GeometryEncoding.

/**
* Allowed for physical type: BYTE_ARRAY.
*
* Well-known binary (WKB) representations of geometries.
*
* To be clear, we follow the same rule of WKB and coordinate axis order from
* GeoParquet [1][2]. Geometries SHOULD be encoded as ISO WKB [3][4]
* supporting XY, XYZ, XYM, XYZM and the standard geometry types
* Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon,
* and GeometryCollection). Coordinate order is always (x, y) where x is
* easting or longitude and y is northing or latitude. This ordering explicitly
* overrides the axis order as specified in the CRS following the GeoPackage
* specification [5].
*
* This is the preferred encoding for maximum portability. It also supports
* GeometryStatistics to be set in the column chunk and page index.
*
* [1] https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92
* [2] https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L155
* [3] https://portal.ogc.org/files/?artifact_id=18241
* [4] https://www.iso.org/standard/60343.html
* [5] https://www.geopackage.org/spec130/#gpb_spec
*/
WKB = 0;
}

/**
* Interpretation for edges of elements of a GEOMETRY logical type. In other
* words, whether a point between two vertices should be interpolated in
* its XY dimensions as if it were a Cartesian line connecting the two
* vertices (planar) or the shortest spherical arc between the longitude

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* vertices (planar) or the shortest spherical arc between the longitude
* vertices (planar) or the shortest geodesic arc between the longitude

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever wording we choose here must make clear that the interpolation assumes a sphere (not an ellipsoid). Even though an ellipsoid is a better approximation of the surface, there are no geometry engines we know about that are interested in storing Parquet files that define edges in this way. Notably, BigQuery geography and Snowflake geography types define edges in this way.

* and latitude represented by the two vertices (spherical). This value

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* and latitude represented by the two vertices (spherical). This value
* and latitude represented by the two vertices (spherical or spheroidal). This value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, we very specifically are approximating this interpolation using a sphere (perhaps there is more clear language to clarify that!)

* applies to all non-point geometry objects and is independent of the
* coordinate reference system.
*
* Because most systems currently assume planar edges and do not support
* spherical edges, planar should be used as the default value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* spherical edges, planar should be used as the default value.
* spherical or spheroidal edges, planar should be used as the default value.

*/
enum EdgeInterpolation {
PLANAR = 0;
SPHERICAL = 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this really mean? That we only allow either Cartesian geometries, or geometries embedded on a sphere? How would one indicate geometries embedded on a spheroid (like WGS84)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkaravel

If one has a column for cartesian geometries, use Edges.Planar. If one has a column for a geometry on sphere (aka Geography type), use Edges.Spherical.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about geodesic edges? Do we disallow them? How would you represent a geometry in WGS84?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't duplicating the information provided by the CRS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about geodesic edges?

If producing and/or storing data with geodesic edges becomes a widespread use-case, another enum value could be added. Until that is the case, spherical edges tesselated with enough vertices to minimize that error to an acceptable level are almost certainly sufficient for any practical use case.

Isn't duplicating the information provided by the CRS?

The coordinate system (i.e., context for individual coordinate tuples) and how the dataset author intended a line to be drawn between any two coordinates are orthogonal concepts: GIS systems, for better or worse, frequently are used to "simplify" datasets and can do things like model the very long segment of the US--Canada border along the 49th parallel as a single line segment. Because there is quite a lot of data that was authored in this way (and few ways to produce data with spherical edges, even though this makes more sense in a global context), we need a way to express that intent here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coordinate system (i.e., context for individual coordinate tuples) and how the dataset author intended a line to be drawn between any two coordinates are orthogonal concepts

Then the intend is to express the interpolation method between points? Should the enumeration have "interpolation" or something similar in its name for making that intend clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the word interpolation would be great to include in the documentation! EdgeInterpolation is a bit of a mouthful and I prefer Edges (but with no strong feelings either way 🙂 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer adding documentation to Edges for interpolation but keep the name simple.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SPHERICAL = 1;
SPHERICAL = 1;
SPHEROIDAL = 2;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I am not sure how this enum is useful. As @desruisseaux mentioned, this information can be deciphered from the CRS. Why do we have it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several threads here that discuss this, but briefly, we need to ensure that the intent of mainstream producers that (1) have a geometry type and (2) export Parquet files can be expressed. BigQuery and Snowflake geography types approximate connecting two vertices using an arc assuming a sphere; all other geometry types connect vertices with a straight line regardless of the CRS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that PLANAR and SPHERICAL are already adopted by engines (e.g. BigQuery and Snowflake mentioned above) and standards (e.g. GeoParquet) which support Parquet file format. It is clear that we should support PLANAR and SPHERICAL at the moment. Since it may take some time to discuss topics on SPHEROIDAL, perhaps we can add it as a follow-up work item?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the confusion may come from the name of the enumeration values. If I'm understanding right, the intend of PLANAR is "shorted path between two points on a plane". Problems:

  • It is not the only way to connect two points on a plane. We may also want Bézier curves, splines, etc.
  • PLANE is a bit restrictive as it seems to exclude the 3D space.

What about renaming PLANE as STRAIGHT_LINE? It leaves room for other interpolation methods on the plane in the future if desired. It also adds real value compared to just saying that the coordinate system is Cartesian, which can already be inferred from the CRS.

Likewise, SPHERICAL and SPHEROIDAL could be renamed as GEODESIC_LINE. It leaves room for different interpolation methods in the future, e.g. with smooth changes of headings instead of using the shortest paths. This is equivalent, in the plane, to Spline curves featuring smooth changes of derivatives instead of sudden changes of direction at every points.

I saw the discussion about SPHERICAL meaning the use of spherical formulas even if the CRS is ellipsoidal. But I'm not sure how it would work in practice. Which sphere radius should the computation use? The authalic sphere? Something else?

}

/**
* A custom binary-encoded polygon or multi-polygon to represent a covering of
* geometries. For example, it may be a bounding box or an envelope of geometries
Comment on lines +288 to +289

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I fully understand this statement.

Let's say we have a single line table and all it contains is LINESTRING(5 10,15 10) in WGS84. What would be the bounding box for this? What would be the polygon for this?

If we think of the bounding box as a Cartesian product of longitude x latitude values, then the box should be (approximately) [5,15] x [10,10.03769]. Now if you represent it as a polygon this would be something like (assuming CW orientation) POLYGON((5 10,15 10,15 10.03769,5 10.03769)). Is this to be understood as a Cartesian polygon or a a polygon in WGS84? If the latter then it refers to something completely different. If the intent is to represent a polygon in WGS84 the only thing that comes to mind is POLYGON((5 10,15 10,5 10)) which I am not sure is a valid representation of a polygon. The other option would be POLYGON((5 10,15 10,x 10.03769,5 10)) where x is the longitude value for which the latitude on the single segment of the linestring is 10.03769.

I understand that the keyword here is "custom" which is what leaves a lot of room for whatever one would like to implement. Maybe the only thing to change here is the phrasing. My main point is that bounding box and polygon are not the same things in geographic coordinate systems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point that there is room for improvement on the phrasing here!

I believe the intent is that this can be any polygon that completely covers the values that it is representing such that for all the values it is representing, both st_intersects(arbitrary_geometry, covering) is guaranteed to be true if st_intersects(arbitrary_geometry, value) is true. One easy way to generate this is to take the bounding box (as defined here) and return its vertices as a polygon. Your example is a horizontal line (in Cartesian space, which it could be defined as if the EdgeInterpolation was set to PLANAR), and so this would be a degenerate Polygon (but could still be defined). For spherical edges, one could compute a discrete global grid covering (e.g., S2 or H3) and convert the boundary of that to a polygon.

* when a bounding box cannot be built (e.g. a geometry has spherical edges, or if
* an edge of geographic coordinates crosses the antimeridian). It may be
* extended in future versions to provide vendor-agnostic coverings like
* vectors of cells on a discrete global grid (e.g., S2 or H3 cells).
*/
struct Covering {
/**
* A type of covering. Currently accepted values: "WKB".
*/
1: required string kind;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'kind' a typical thing in Parquet? If not I'd call this 'type' - kind sounds weird to me, but not sure if 'type' as a very specific meaning for parquet. This is totally just a stylistic thing, so feel free to ignore. But note in the description we say 'a type of covering', not a 'the kind of covering'. Kind seems 'fuzzier', that it's wkb or close to it, while type seems to indicate just one to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was a discussion on it but I cannot find it. I'm happy to switch kind to type if it looks better. @jiayuasu @paleolimbot WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with that but note we also need to change crs kind to be consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? It is not in the Iceberg version so I'm wondering if it is something that is required for the initial release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, no, please don't. Parquet already has types (physical and logical). There is a reason for the use of another term here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? It is not in the Iceberg version so I'm wondering if it is something that is required for the initial release.

Yes, it is not used by Iceberg. Covering was discussed in #240 (comment) to make it flexible enough to add more kinds of geospatial indexes without changing the Parquet spec in the future.

Parquet already has types (physical and logical). There is a reason for the use of another term here.

Sorry that I cannot find the original discussion. IIRC, Covering/type was proposed in the beginning and then switched to Covering/kind for similar reasons. So I'll keep using kind unless there is a better idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue For the WKB geometry encoding, we might introduce vectorized encoding for geometry in the future and allow both WKB and vectorized encoding co-exist. So we want to leave the door open.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiayuasu I think your last comment for the thread at #240 (comment)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'll keep using kind unless there is a better idea.

Cool - sorry for the noise, I missed the original discussion to move it to kind.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with that but note we also need to change crs kind to be consistent.

If this is referencing to the encoding (WKT versus JSON, etc.), I really think that it needs to be renamed "CRS encoding". CRS kind means something else (geographic versus projected versus engineering versus parametric, etc.).

/**
* A payload specific to kind. Below are the supported values:
* - WKB: well-known binary of a POLYGON or MULTI-POLYGON that completely
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
* covers the contents. This will be interpreted according to the same CRS
* and edges defined by the logical type.
Comment on lines +303 to +304

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this answers one of my questions above, whether the polygon or multipolygon is in the CRS as the input geometry. See my other comment. I do not think it is as simple as it looks.

*/
2: required binary value;
}

/**
* Bounding box of geometries in the representation of min/max value pair of
* coordinates from each axis. Values of Z and M are omitted for 2D geometries.
* Filter pushdown on geometries using this is only safe for planar spatial
* filters.
*/
struct BoundingBox {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spherical edges, are there any restrictions as to what is considered a valid bounding box?

I would even ask the same for Cartesian edges, although it is less ambiguous here. For example, what happens if xmax is smaller than xmin? Is it up to the reader to decide what to do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkaravel

The bounding box statistics will be always calculated by getting the min/max of X and Y values. For cartesian geometries, this is not a problem at all because there is no antimeridian and orientation issue at all in this case.

For spherical geometries, this is a problem so that's why the spec says that it is not safe to use it for spherical geometries.

In this proposal, (1) there is another type of statistics called covering that attempts to solve this issue. (2) there is another field called string metadata which can be used to indicate orientation (clockwise / counter clockwise) of the geoparquet metadata: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L46

However, in this proposal, the main focus is the geometry not geography. The Geography problem (aka spherical geometries) will be the focus of the next phase.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bounding box statistics will be always calculated by getting the min/max of X and Y values.

For geometries on a sphere or a spheroid this may no longer produce a bounding box if you restrict to just the vertices of the geometries.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, in this proposal, the main focus is the geometry not geography. The Geography problem (aka spherical geometries) will be the focus of the next phase.

If a GEOGRAPHY going to be a different logical type? Maybe this not such a bad idea, in which case the current proposal may want to focus just on the Cartesian space and nothing else.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While PostGIS uses a distinct type for geography, this is not necessarily a good idea that other formats should reproduce (I think it was due to historical constraints). Any geometry associated to a geographic CRS may be considered "geography". And as said by mkaravel, naive computation of min/max values will not work in the latter case.

I suggest to remove the sentence saying "it is recommended that the writer always generates bounding box statistics, regardless of whether the geometries are planar or spherical". I suggest to even avoid saying that the values are min/max, or said that it is the case in Cartesian coordinate system but not necessarily in other coordinate system.

In particular I suggest to keep open the possibility that "min" > "max". This is the only approach that I know which works with anti-meridian. Bounding boxes in the EPSG database are already expressed that way. Some standards such as GGXF also mandate it. It makes calculation of union and intersection more complicated, but libraries only need to define union and intersection methods in one place and use it consistently.

I'm not saying that the specification should support "min" > "max" now, just suggesting to keep this door open.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigQuery (that only supports spherical geographies) supports ST_IntersectsBox(geo, lng1, lat1, lng2, lat2) function, and uses it for partition pruning. The conventions match ones described above: lng1 is assumed to be westmost point, and lng2 to be eastmost point, rather than lng1 being less than lng2. Thus it allows describing boxes that cross anti-meridian, e.g. ST_IntersectsBox(ST_GeogPoint(180, 0), 170, -10, -170, 10) can also be written as ST_IntersectsBox(ST_GeogPoint(180, 0), 170, -10, 190, 10) and returns TRUE.

I also blogged about our semantics: https://medium.com/geospatial-analytics/squaring-the-sphere-974d1e4a875e

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for covering, we discussed using S2 cell list for covering, and there is a proposal how to do partitioning transform using S2: https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit?usp=sharing but this is probably longer-term solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these types of coverings could be encoded as a WKB Covering (e.g., polygon boundary of an S2CellUnion or S2LatLngRect) with the current wording of the specification. Using min/max values with wrapping semantics or a vector of int64s is more compact, but for the initial version perhaps this would suffice?

Perhaps "leaving the door open" in this case could mean saying that implementations MUST omit min/max statistics for spherical edges (which would let future implementations write wrapped bounding boxes for spherical edges).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another consideration here is that Covering is more computationally expensive to compute, compared to a bounding box. I can imagine that in some write-performance-sensitive scenarios, simpler bbox stats would be the preferred approach. Are there any concerns with adopting @mentin's west-east convention for spherical edges?

@mentin thanks for sharing the S2 transform proposal - will have a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A west-east convention for spherical edges works for me (provided that it is very clear such that Cartesian bounding boxes are not written for spherical edge data by accident!). I also don't think there's much overhead to the WKB covering (basically 12 memcpys to convert a bounding box to a WKB polygon).

1: required double xmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
2: required double xmax;
3: required double ymin;
4: required double ymax;
5: optional double zmin;
6: optional double zmax;
7: optional double mmin;
8: optional double mmax;
}

/** Statistics specific to GEOMETRY logical type */
struct GeometryStatistics {
/** A bounding box of geometries */
1: optional BoundingBox bbox;

/**
* A list of coverings of geometries.
* Note that It is allowed to have more than one covering of the same kind and
* implementation is free to use any of them. It is recommended to have at most
* one covering for each kind.
*/
2: optional list<Covering> coverings;

/**
* The geometry types of all geometries, or an empty array if they are not
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
* known. This is borrowed from `geometry_types` column metadata of GeoParquet [1]
* except that values in the list are WKB (ISO variant) integer codes [2]. Table
* below shows the most common geometry types and their codes:
*
* | Type | XY | XYZ | XYM | XYZM |
* | :----------------- | :--- | :--- | :--- | :--: |
* | Point | 0001 | 1001 | 2001 | 3001 |
* | LineString | 0002 | 1002 | 2002 | 3002 |
* | Polygon | 0003 | 1003 | 2003 | 3003 |
* | MultiPoint | 0004 | 1004 | 2004 | 3004 |
* | MultiLineString | 0005 | 1005 | 2005 | 3005 |
* | MultiPolygon | 0006 | 1006 | 2006 | 3006 |
* | GeometryCollection | 0007 | 1007 | 2007 | 3007 |
*
* In addition, the following rules are used:
* - A list of multiple values indicates that multiple geometry types are
* present (e.g. `[0003, 0006]`).
* - An empty array explicitly signals that the geometry types are not known.
* - The geometry types in the list must be unique (e.g. `[0001, 0001]`
* is not valid).
*
* Please refer to links below for more detail:
* [1] https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
* [2] https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159
*/
3: optional list<i32> geometry_types;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this used for? Is it for some type of pushdown?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you know in advance what all of the geometry types are (notably, if they are all the same, which is common), you can often choose a simpler or more performant code path, or provide a more informative error sooner. The declaration of a geometry type at a metadata level is very common when describing geospatial datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, can you give an example? The critical information is whether this needs to be in the spec for a purpose. Specs that give lots of freedom without specific guidance and requirements can create long-lasting problems.

A performance improvement is a good reason to have this kind of thing, but I'd like to understand more to make sure the requirements here are clearly stated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concrete example of performance would be reading a Parquet geometry column containing only points. A generic geometry column is generally read into memory as something like a list of abstract "geometry" objects, which can be a geometry of any type (e.g., a vector of JTS geometries or a GeoPandas GeoSeries). This has a high memory requirement and is inefficient for a number of things you might want to do with a huge number of points like build an index. A reader that knows in advance that there are only points can choose to decode the column into a vector of x values and y values.

For geospatial datasets this is almost always available to a caller inspecting a dataset...for geospatial practitioners, not knowing a geometry type is a little like not knowing if an integer is signed or unsigned (i.e., a very basic piece of information we need to know).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the geometry_types attribute also is a trade-off to avoid adding a set of explicit subtypes like POINT, LINESTRING, POLYGON, etc. Another possible use case is that the application is able to quickly detect unexpected data (e.g. any non-polygon geometry) by checking geometry_types and decide if a certain function like ST_CONTAINS can be applied safely.

}

/**
* Statistics per row group and per page
* All fields are optional.
Expand Down Expand Up @@ -279,6 +408,9 @@ struct Statistics {
7: optional bool is_max_value_exact;
/** If true, min_value is the actual minimum value for a column */
8: optional bool is_min_value_exact;

/** statistics specific to geometry logical type */
9: optional GeometryStatistics geometry_stats;
}

/** Empty structs to use as logical type annotations */
Expand Down Expand Up @@ -373,6 +505,78 @@ struct JsonType {
struct BsonType {
}

/**
* Geometry logical type annotation (added in 2.11.0)
*/
struct GeometryType {
/**
* Physical type and encoding for the geometry type.
* Please refer to the definition of GeometryEncoding for more detail.
*/
1: required GeometryEncoding encoding;
/**
* Interpretation for edges of elements of a GEOMETRY logical type, i.e. whether
* the interpolation between points along an edge represents a straight cartesian
* line or the shortest line on the sphere.
* Please refer to the definition of Edges for more detail.
*/
2: required EdgeInterpolation edges;
/**
* Coordinate Reference System, i.e. mapping of how coordinates refer to
* precise locations on earth. Writers are not required to set this field.
* Once crs is set, crs_encoding field below MUST be set together.
* For example, "OGC:CRS84" can be set in the form of PROJJSON as below:
pitrou marked this conversation as resolved.
Show resolved Hide resolved
* {
* "$schema": "https://proj.org/schemas/v0.5/projjson.schema.json",
* "type": "GeographicCRS",
* "name": "WGS 84 longitude-latitude",
* "datum": {
* "type": "GeodeticReferenceFrame",
* "name": "World Geodetic System 1984",
* "ellipsoid": {
* "name": "WGS 84",
* "semi_major_axis": 6378137,
* "inverse_flattening": 298.257223563
* }
* },
* "coordinate_system": {
* "subtype": "ellipsoidal",
* "axis": [
* {
* "name": "Geodetic longitude",
* "abbreviation": "Lon",
* "direction": "east",
* "unit": "degree"
* },
* {
* "name": "Geodetic latitude",
* "abbreviation": "Lat",
* "direction": "north",
* "unit": "degree"
* }
* ]
* },
* "id": {
* "authority": "OGC",
* "code": "CRS84"
* }
* }
*/
3: optional string crs;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
pitrou marked this conversation as resolved.
Show resolved Hide resolved
/**
* Encoding used in the above crs field. It MUST be set if crs field is set.
* Currently the only allowed value is "PROJJSON".
*/
4: optional string crs_encoding;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of defining the type, I don't think that this encoding is relevant. The type should reference a CRS, but it is not the type's job to pass around the CRS definition or to support multiple ways to encode it (or in this case, contemplate that there could be other ways to encode it).

I would prefer passing CRS as an identifier string (which is what we're mostly agreed on in Iceberg) and adding ways to pass the CRS definition either in file metadata or other ways.

Copy link
Member

@paleolimbot paleolimbot Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ability to include a parameterized CRS is absolutely essential for the GEOMETRY type in Parquet to be useful: not all CRSes have been catalogued, and many can't be because they're too specific (e.g., a CRS optimized for a small locality or specific project, or the view of a satellite orbiting a planet) or too old (e.g., one of my projects with the Canadian government digitizing several decades of sea ice coverage where the first four decades were in a CRS that had never been catalogued but could be expressed in PROJJSON).

The crs_encoding piece is to make the crs string unambiguous. I happen to think this is an improvement over many existing systems that just provide a string and force the reader to guess the intent; however, it is not strictly necessary (e.g., we could just define the CRS as a string).

Iceberg has a different set of use cases to Parquet...Parquet is useful to geospatial practitioners operating at a smaller scale that need to deal with these issues and want to use Parquet to do so. It may be that an identifier-based format may fit the Iceberg use case well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one catch here is that if you have a longitude start value of 60° and a longitude end value of -60° then "going to +infinity, then wrapping to -infinity" approach produces a longitude range that is more than 180°

Right, in my discussion about allowing "min" > "max" in a bounding box, I forget to specify that doing a wraparound at infinity works for intersection and union calculations, or generally for everything that involves the <, > and = operations. It does not work for arithmetic. But if the purpose of the bounding boxes is fast searches (indexing), unions and intersections are all we need, aren't there?

/**
* Additional informative metadata as a list of key-value pair of UTF-8 string.
* It is not strictly required by the low-level Parquet implementation for
* features like statistics or filter pushdown. Using a list of key-value pair
* provides maximum flexibility for adding future informative metadata.
*/
5: optional list<KeyValue> key_value_metadata;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GeoParquet specification contains some concepts not covered here like orientation (describing whether polygons can be assumed to be correctly wound) and epoch (to better contextualize coordinates in something like WGS84, where continental movements might affect locations over time). The KeyValue list here lets the specification evolve without a change to Thrift (which would necessitate a new version of an implementation in most cases).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary for the definition of the logical type? Can these undocumented properties be part of the file's key/value metadata instead? And if these properties are important to the type definition, why are they undocumented key/value properties rather than defined in the spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier review pushed the definition of the geometry type towards something that would be able to fully accommodate the GeoParquet specification without further modification of Thrift.. We could include the definition for epoch and orientation (which have been in two released versions of the GeoParquet specification, which is on its way to becoming an official OGC specification), or we could omit this for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are they undocumented key/value properties rather than defined in the spec?

They are documented, but in the GeoParquet spec (until the last commit that changed this from a string to list<KeyValue> attribute, the comment linked to https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L46, IMO we should add back that guidance that those key_value_metadata metadata are meant for that)

In general there is a tension between defining everything here (getting too much in geospatial-specific details for Parquet) and just referring to the GeoParquet spec for details and additional (optional) metadata (making the Parquet spec less complete and self-describing).

Throughout the discussion, we have gone back and forth on this. Initially, everything was included here, but there was a desire to remove as much as possible the geo-specific metadata. And then later only the most essential pieces of metadata were added back (encoding, crs, geometry_types, edges)

Copy link
Member Author

@wgtmac wgtmac Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are metadata which may be unimportant to Parquet but important to GeoParquet. As we can see that GeoParquet is still fast-evolving, we do not want to modify the thrift frequently to adopt new proposal from GeoParquet every time. Metadata fields like crs and edges are required to interpret the data so we have added them. We can of course add more explicit metadata in the future when they are required to interpret the data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that edges probably has to be available to the Parquet implementation because it affects how (or whether) to push down a spatial filter. I would prefer to keep the CRS in thrift because it would mean that something like the C++ implementation would have to parse JSON to pass on the CRS to the Arrow type (which is possible, but ugly).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not making it clear. Just edited my previous response.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are documented, but in the GeoParquet spec (until the last commit that changed this from a string to list attribute, the comment linked to https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L46, IMO we should add back that guidance that those key_value_metadata metadata are meant for that)

Good suggestion! I've added back the comment.

}

/**
* LogicalType annotations to replace ConvertedType.
*
Expand Down Expand Up @@ -403,6 +607,7 @@ union LogicalType {
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: GeometryType GEOMETRY // no compatible ConvertedType
}

/**
Expand Down Expand Up @@ -954,6 +1159,7 @@ union ColumnOrder {
* ENUM - unsigned byte-wise comparison
* LIST - undefined
* MAP - undefined
* GEOMETRY - undefined, use GeometryStatistics instead.
*
* In the absence of logical types, the sort order is determined by the physical type:
* BOOLEAN - false, true
Expand Down Expand Up @@ -1084,6 +1290,9 @@ struct ColumnIndex {
* Same as repetition_level_histograms except for definitions levels.
**/
7: optional list<i64> definition_level_histograms;

/** A list containing statistics of GEOMETRY logical type for each page */
8: optional list<GeometryStatistics> geometry_stats;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there stats for each page? Each bbox is up to 64 bytes, which seems like a lot of overhead at the page level, especially given that WKB objects are also considerably larger than most values stored in a Parquet page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fields of GeometryStatistics are optional and the geometry_stats field itself is optional. Isn't it better to provide freedom to writer implementation to turn on features they need?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is better to give guidance and keep the spec small. I think we should only add things that have a clear use right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to data pages since I am not familiar with that level of the specification; however, these are absolutely essential at the column chunk level. I will say that even for very small objects, knowing the bounding box is typically worth it (e.g., nearly all spatial formats cache this information for every single geometry object). This is because many geometry operations, particularly with polygons, are incredibly expensive and can often be skipped for features that don't intersect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Page level stats are super useful in a needle-in-the-haystack search. Computation on geometry type can be very slow due to its mathematical complexity. Page-level stats such as bounding box can help filtering out unnecessary pages because computation on bounding box is faster in order of magnitude than on complex polygons.

}

struct AesGcmV1 {
Expand Down