Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2471: Add geometry logical type #240

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,78 @@ struct SizeStatistics {
3: optional list<i64> definition_level_histogram;
}

/**
* Interpretation for edges of GEOMETRY logical type, i.e. whether the edge
* between points represent a straight cartesian line or the shortest line on
* the sphere.
*/
enum Edges {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we avoid the plural? Perhaps make it EdgeShape or EdgeKind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the same definition from GeoParquet (https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md) and we try to be consistent with GeoParquet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pity. If you want to store a collection of Edges, how do you name it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I understand it's probably not a practical concern :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm.. Why would someone want to store a collection of edges? For each geometry column, the Edges field has only 1 possible value.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say EdgeInterpolation is the better name than EdgeKind. +1 for keeping the attribute of the type the same as GeoParquet for consistency and simplicity, but makes sense to me that the enum name could be more descriptive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed enum Edges to EdgeInterpolation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the term EdgeInterpolation quite misleading. What we seem to want to convey here is how the edges between points are interpreted. How they are interpreted has to do with the underlying space in which the geometry is embedded. For example for WGS84 the edges are by definition spheroidal edges (not spherical, which is why I have made other comments about the use of the word SPHERICAL). If you have SRID 3857, then it is a planar coordinate system. I guess the word "interpolation" could mean what you use to interpolate between vertices, but this feels like an awkward use of the word.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkaravel can you suggest a better name here, or can we fall back to a generic name like Edges and put the explanation in the comments, which will leave the door open in the future without breaking the existing implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that the "edge interpolation" depends on the underlying space, but not only. For example on a plane, we can connect two points by the shortest path (a straight line), or we can try to keep the first derivatives continuous (splines). Likewise on a sphere, we can connect with the shorted paths (geodesic lines), or try to keep the headings continuous at each point.

So it is a little bit more than only the underlying space. If we interpret "interpolation" as "the mathematical formulas that we choose to use for computing intermediate points", than "edge interpolation" seems appropriate to me.

PLANAR = 0;
SPHERICAL = 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this really mean? That we only allow either Cartesian geometries, or geometries embedded on a sphere? How would one indicate geometries embedded on a spheroid (like WGS84)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkaravel

If one has a column for cartesian geometries, use Edges.Planar. If one has a column for a geometry on sphere (aka Geography type), use Edges.Spherical.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about geodesic edges? Do we disallow them? How would you represent a geometry in WGS84?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't duplicating the information provided by the CRS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about geodesic edges?

If producing and/or storing data with geodesic edges becomes a widespread use-case, another enum value could be added. Until that is the case, spherical edges tesselated with enough vertices to minimize that error to an acceptable level are almost certainly sufficient for any practical use case.

Isn't duplicating the information provided by the CRS?

The coordinate system (i.e., context for individual coordinate tuples) and how the dataset author intended a line to be drawn between any two coordinates are orthogonal concepts: GIS systems, for better or worse, frequently are used to "simplify" datasets and can do things like model the very long segment of the US--Canada border along the 49th parallel as a single line segment. Because there is quite a lot of data that was authored in this way (and few ways to produce data with spherical edges, even though this makes more sense in a global context), we need a way to express that intent here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coordinate system (i.e., context for individual coordinate tuples) and how the dataset author intended a line to be drawn between any two coordinates are orthogonal concepts

Then the intend is to express the interpolation method between points? Should the enumeration have "interpolation" or something similar in its name for making that intend clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the word interpolation would be great to include in the documentation! EdgeInterpolation is a bit of a mouthful and I prefer Edges (but with no strong feelings either way 🙂 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer adding documentation to Edges for interpolation but keep the name simple.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SPHERICAL = 1;
SPHERICAL = 1;
SPHEROIDAL = 2;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I am not sure how this enum is useful. As @desruisseaux mentioned, this information can be deciphered from the CRS. Why do we have it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several threads here that discuss this, but briefly, we need to ensure that the intent of mainstream producers that (1) have a geometry type and (2) export Parquet files can be expressed. BigQuery and Snowflake geography types approximate connecting two vertices using an arc assuming a sphere; all other geometry types connect vertices with a straight line regardless of the CRS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that PLANAR and SPHERICAL are already adopted by engines (e.g. BigQuery and Snowflake mentioned above) and standards (e.g. GeoParquet) which support Parquet file format. It is clear that we should support PLANAR and SPHERICAL at the moment. Since it may take some time to discuss topics on SPHEROIDAL, perhaps we can add it as a follow-up work item?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the confusion may come from the name of the enumeration values. If I'm understanding right, the intend of PLANAR is "shorted path between two points on a plane". Problems:

  • It is not the only way to connect two points on a plane. We may also want Bézier curves, splines, etc.
  • PLANE is a bit restrictive as it seems to exclude the 3D space.

What about renaming PLANE as STRAIGHT_LINE? It leaves room for other interpolation methods on the plane in the future if desired. It also adds real value compared to just saying that the coordinate system is Cartesian, which can already be inferred from the CRS.

Likewise, SPHERICAL and SPHEROIDAL could be renamed as GEODESIC_LINE. It leaves room for different interpolation methods in the future, e.g. with smooth changes of headings instead of using the shortest paths. This is equivalent, in the plane, to Spline curves featuring smooth changes of derivatives instead of sudden changes of direction at every points.

I saw the discussion about SPHERICAL meaning the use of spherical formulas even if the CRS is ellipsoidal. But I'm not sure how it would work in practice. Which sphere radius should the computation use? The authalic sphere? Something else?

}

/**
* A custom WKB-encoded geometry data to be used in geometry statistics.
* The geometry may be a polygon to encode an s2 or h3 covering to provide
* vendor-agnostic coverings, or an evelope of geometries when a bounding
* box cannot be built (e.g. a geometry has spherical edges, or if an edge
* of geographic coordinates crosses the antimeridian).
*/
struct Geometry {
/** Bytes of a WKB-encoded geometry */
1: required binary geometry;
/**
* Edges of the geometry if it is a polygon. It may be different to the
* edges attribute from the GEOMETRY logical type.
*/
2: optional Edges edges;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* Bounding box of geometries in the representation of min/max value pair of
* coordinates from each axis. Values of Z and M are omitted for 2D geometries.
*/
struct BoundingBox {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spherical edges, are there any restrictions as to what is considered a valid bounding box?

I would even ask the same for Cartesian edges, although it is less ambiguous here. For example, what happens if xmax is smaller than xmin? Is it up to the reader to decide what to do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkaravel

The bounding box statistics will be always calculated by getting the min/max of X and Y values. For cartesian geometries, this is not a problem at all because there is no antimeridian and orientation issue at all in this case.

For spherical geometries, this is a problem so that's why the spec says that it is not safe to use it for spherical geometries.

In this proposal, (1) there is another type of statistics called covering that attempts to solve this issue. (2) there is another field called string metadata which can be used to indicate orientation (clockwise / counter clockwise) of the geoparquet metadata: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L46

However, in this proposal, the main focus is the geometry not geography. The Geography problem (aka spherical geometries) will be the focus of the next phase.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bounding box statistics will be always calculated by getting the min/max of X and Y values.

For geometries on a sphere or a spheroid this may no longer produce a bounding box if you restrict to just the vertices of the geometries.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, in this proposal, the main focus is the geometry not geography. The Geography problem (aka spherical geometries) will be the focus of the next phase.

If a GEOGRAPHY going to be a different logical type? Maybe this not such a bad idea, in which case the current proposal may want to focus just on the Cartesian space and nothing else.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While PostGIS uses a distinct type for geography, this is not necessarily a good idea that other formats should reproduce (I think it was due to historical constraints). Any geometry associated to a geographic CRS may be considered "geography". And as said by mkaravel, naive computation of min/max values will not work in the latter case.

I suggest to remove the sentence saying "it is recommended that the writer always generates bounding box statistics, regardless of whether the geometries are planar or spherical". I suggest to even avoid saying that the values are min/max, or said that it is the case in Cartesian coordinate system but not necessarily in other coordinate system.

In particular I suggest to keep open the possibility that "min" > "max". This is the only approach that I know which works with anti-meridian. Bounding boxes in the EPSG database are already expressed that way. Some standards such as GGXF also mandate it. It makes calculation of union and intersection more complicated, but libraries only need to define union and intersection methods in one place and use it consistently.

I'm not saying that the specification should support "min" > "max" now, just suggesting to keep this door open.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigQuery (that only supports spherical geographies) supports ST_IntersectsBox(geo, lng1, lat1, lng2, lat2) function, and uses it for partition pruning. The conventions match ones described above: lng1 is assumed to be westmost point, and lng2 to be eastmost point, rather than lng1 being less than lng2. Thus it allows describing boxes that cross anti-meridian, e.g. ST_IntersectsBox(ST_GeogPoint(180, 0), 170, -10, -170, 10) can also be written as ST_IntersectsBox(ST_GeogPoint(180, 0), 170, -10, 190, 10) and returns TRUE.

I also blogged about our semantics: https://medium.com/geospatial-analytics/squaring-the-sphere-974d1e4a875e

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for covering, we discussed using S2 cell list for covering, and there is a proposal how to do partitioning transform using S2: https://docs.google.com/document/d/1tG13UpdNH3i0bVkjFLsE2kXEXCuw1XRpAC2L2qCUox0/edit?usp=sharing but this is probably longer-term solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these types of coverings could be encoded as a WKB Covering (e.g., polygon boundary of an S2CellUnion or S2LatLngRect) with the current wording of the specification. Using min/max values with wrapping semantics or a vector of int64s is more compact, but for the initial version perhaps this would suffice?

Perhaps "leaving the door open" in this case could mean saying that implementations MUST omit min/max statistics for spherical edges (which would let future implementations write wrapped bounding boxes for spherical edges).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another consideration here is that Covering is more computationally expensive to compute, compared to a bounding box. I can imagine that in some write-performance-sensitive scenarios, simpler bbox stats would be the preferred approach. Are there any concerns with adopting @mentin's west-east convention for spherical edges?

@mentin thanks for sharing the S2 transform proposal - will have a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A west-east convention for spherical edges works for me (provided that it is very clear such that Cartesian bounding boxes are not written for spherical edge data by accident!). I also don't think there's much overhead to the WKB covering (basically 12 memcpys to convert a bounding box to a WKB polygon).

1: required double xmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
2: required double xmax;
3: required double ymin;
4: required double ymax;
5: optional double zmin;
6: optional double zmax;
7: optional double mmin;
8: optional double mmax;
}

struct Covering {
optional BoundingBox bbox // A bounding box of geometries if it can be built.
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
optional Geometry covering // A covering polygon of geometries if bbox is unavailable.
}

/** Statistics specific to GEOMETRY logical type */
struct GeometryStatistics {
/** Covering of geometries */
1: optional Covering covering;

/**
* The geometry types of all geometries, or an empty array if they are not
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
* known. It follows the same rule of `geometry_types` column metadata of
* GeoParquet. Accepted geometry types are: "Point", "LineString", "Polygon",
* "MultiPoint", "MultiLineString", "MultiPolygon", "GeometryCollection".
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
*
* In addition, the following rules are used:
* - In case of 3D geometries, a `" Z"` suffix gets added (e.g. `["Point Z"]`).
* - A list of multiple values indicates that multiple geometry types are
* present (e.g. `["Polygon", "MultiPolygon"]`).
* - An empty array explicitly signals that the geometry types are not known.
* - The geometry types in the list must be unique (e.g. `["Point", "Point"]`
* is not valid).
*
* Please refer to link below for more detail:
* https://github.com/opengeospatial/geoparquet/blob/v1.0.0/format-specs/geoparquet.md?plain=1#L91
*/
2: optional list<string> geometry_types;
}

/**
* Statistics per row group and per page
* All fields are optional.
Expand Down Expand Up @@ -279,6 +351,9 @@ struct Statistics {
7: optional bool is_max_value_exact;
/** If true, min_value is the actual minimum value for a column */
8: optional bool is_min_value_exact;

/** statistics specific to geometry logical type */
9: optional GeometryStatistics geometry_stats;
}

/** Empty structs to use as logical type annotations */
Expand Down Expand Up @@ -373,6 +448,51 @@ struct JsonType {
struct BsonType {
}

/**
* Phyiscal type and encoding for the geometry type.
*/
enum GeometryEncoding {
/**
* Allowed for phyiscal type: BYTE_ARRAY.
*
* Well-known binary (WKB) representations of geometries. It supports 2D or
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
* 3D geometries of the standard geometry types (Point, LineString, Polygon,
* MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection). This
* is the preferred option for maximum portability.
*
* This encoding enables GeometryStatistics to be set in the column chunk
* and page index.
*/
WKB = 0;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved

// TODO: add native encoding from GeoParquet/GeoArrow
}

/**
* Geometry logical type annotation (added in 2.11.0)
*/
struct GeometryType {
/**
* Phyiscal type and encoding for the geometry type. Please refer to the
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
* definition of GeometryEncoding for more detail.
*/
1: required GeometryEncoding encoding;
/**
* Coordinate Reference System, i.e. mapping of how coordinates refer to
* precise locations on earth, e.g. OGC:CRS84
*/
2: optional string crs;
/**
* Edges of polygon.
*/
3: optional Edges edges;
/**
* Additional informative metadata.
* It can be used by GeoParquet to offload some of the column metadata.
*/
4: optional string metadata;
Copy link
Contributor

@emkornfield emkornfield Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be string or binary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, binary sounds better which allows custom encoding.

}

/**
* LogicalType annotations to replace ConvertedType.
*
Expand Down Expand Up @@ -403,6 +523,7 @@ union LogicalType {
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: GeometryType GEOMETRY // no compatible ConvertedType
}

/**
Expand Down Expand Up @@ -954,6 +1075,7 @@ union ColumnOrder {
* ENUM - unsigned byte-wise comparison
* LIST - undefined
* MAP - undefined
* GEOMETRY - undefined, use GeometryStatistics instead.
*
* In the absence of logical types, the sort order is determined by the physical type:
* BOOLEAN - false, true
Expand Down Expand Up @@ -1084,6 +1206,9 @@ struct ColumnIndex {
* Same as repetition_level_histograms except for definitions levels.
**/
7: optional list<i64> definition_level_histograms;

/** A list containing statistics of GEOMETRY logical type for each page */
8: optional list<GeometryStatistics> geometry_stats;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there stats for each page? Each bbox is up to 64 bytes, which seems like a lot of overhead at the page level, especially given that WKB objects are also considerably larger than most values stored in a Parquet page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fields of GeometryStatistics are optional and the geometry_stats field itself is optional. Isn't it better to provide freedom to writer implementation to turn on features they need?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is better to give guidance and keep the spec small. I think we should only add things that have a clear use right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to data pages since I am not familiar with that level of the specification; however, these are absolutely essential at the column chunk level. I will say that even for very small objects, knowing the bounding box is typically worth it (e.g., nearly all spatial formats cache this information for every single geometry object). This is because many geometry operations, particularly with polygons, are incredibly expensive and can often be skipped for features that don't intersect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Page level stats are super useful in a needle-in-the-haystack search. Computation on geometry type can be very slow due to its mathematical complexity. Page-level stats such as bounding box can help filtering out unnecessary pages because computation on bounding box is faster in order of magnitude than on complex polygons.

}

struct AesGcmV1 {
Expand Down