diff --git a/LogicalTypes.md b/LogicalTypes.md index 3aa5ceb9..0f3fad84 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -803,6 +803,188 @@ optional group my_map (MAP_KEY_VALUE) { } ``` +## Geospatial Types + +### GEOMETRY + +`GEOMETRY` is used for geometry features from [OGC – Simple feature access][simple-feature-access]. +See [Geospatial Notes](#geospatial-notes). + +The type has three type parameters: +- `encoding`: A required enum value for annonated physical type and encoding + for the `GEOMETRY` type. See [Geometry Encoding](#geometry-encoding). +- `edges`: A required enum value for interpretation for edges of elements of the + `GEOMETRY` type, i.e. whether the interpolation between points along + an edge represents a straight cartesian line or the shortest line on + the sphere. See [Edges](#edges). +- `crs`: An optional string value for CRS (coordinate reference system), which + is a mapping of how coordinates refer to precise locations on earth. + See [Coordinate Reference System](#coordinate-reference-system). + +The sort order used for `GEOMETRY` is undefined. When writing data, no min/max +statistics should be saved for this type and if such non-compliant statistics +are found during reading, they must be ignored. Instead, [GeometryStatistics](#geometry-statistics) +is introduced for `GEOMETRY` type. + +#### Geometry Encoding + +Physical type and encoding for the `GEOMETRY` type. Supported values: +- `WKB`: `GEOMETRY` type with `WKB` encoding can only be used to annotate the + `BYTE_ARRAY` primitive type. See [WKB](#well-known-binary-wkb). + +Note that geometry encoding is required for `GEOMETRY` type. In order to correctly +interpret geometry data, writer implementations SHOULD always set this field, and +reader implementations SHOULD fail for an unknown geometry encoding value. + +##### Well-known binary (WKB) + +Well-known binary (WKB) representations of geometries, see [Geospatial Notes](#geospatial-notes). + +To be clear, we follow the same definitions of GeoParquet for [WKB][geoparquet-wkb] +and [coordinate axis order][coordinate-axis-order]: +- Geometries SHOULD be encoded as ISO WKB supporting XY, XYZ, XYM, XYZM. Supported +standard geometry types: Point, LineString, Polygon, MultiPoint, MultiLineString, +MultiPolygon, and GeometryCollection. +- Coordinate axis order is always (x, y) where x is easting or longitude, and +y is northing or latitude. This ordering explicitly overrides the axis order +as specified in the CRS following the [GeoPackage specification][geopackage-spec]. + +This is the preferred encoding for maximum portability. + +[geoparquet-wkb]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92 +[coordinate-axis-order]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L155 +[geopackage-spec]: https://www.geopackage.org/spec130/#gpb_spec + +#### Edges + +Interpretation for edges of elements of `GEOMETRY` type. In other words, it +specifies how a point between two vertices should be interpolated in its XY +dimensions. Supported values and corresponding interpolation approaches are: +- `PLANAR`: a Cartesian line connecting the two vertices. +- `SPHERICAL`: a shortest spherical arc between the longitude and latitude + represented by the two vertices. + +This value applies to all non-point geometry objects and is independent of the +[Coordinate Reference System](#coordinate-reference-system). + +Because most systems currently assume planar edges and do not support spherical +edges, `PLANAR` should be used as the default value. + +Note that edges is required for `GEOMETRY` type. In order to correctly +interpret geometry data, writer implementations SHOULD always set this field, +and reader implementations SHOULD fail for an unknown edges value. + +#### Coordinate Reference System + +CRS (coordinate reference system) is a mapping of how coordinates refer to +precise locations on earth. A CRS is specified by a key-value entry in the +`key_value_metadata` field of `FileMetaData` whose key is a short name of +the CRS and value is the CRS representation. An additional entry in the +`key_value_metadata` field with the suffix ".type" is required to describe +the encoding of this CRS representation. + +For example, if a geometry column (e.g., "geom1") uses the CRS "OGC:CRS84", the +writer may write two entries to `key_value_metadata` field of `FileMetaData` as +below, and set the `crs` field of the `GEOMETRY` type to "geom1_crs": +``` + "geom1_crs": an UTF-8 encoded PROJJSON representation of OGC:CRS84 + "geom1_crs.type": "PROJJSON" +``` + +The PROJJSON representation of OGC:CRS84 can be seen at [OGC:CRS84][ogc-crs84]. +Multiple geometry columns can refer to the same CRS metadata field +(e.g., "geom1_crs") if they share the same CRS. + +[ogc-crs84]: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#ogccrs84-details + +#### Geometry Statistics + +`GeometryStatistics` is a struct to store geometry statistics of a column chunk +of `GEOMETRY` type. It is an optional field of `ColumnMetaData` and contains +[Bounding Box](#bounding-box) and [Geometry Types](#geometry-types). + +##### Bounding Box + +A geometry has at least two coordinate dimensions: X and Y for 2D coordinates +of each point. A geometry can optionally have Z and / or M values associated +with each point in the geometry. + +The Z values introduce the third dimension coordinate. Usually they are used +to indicate the height, or elevation. + +M values are an opportunity for a geometry to express a fourth dimension as +a coordinate value. These values can be used as a linear reference value +(e.g., highway milepost value), a timestamp, or some other value as defined +by the CRS. + +Bounding box is defined as the thrift struct below in the representation of +min/max value pair of coordinates from each axis. Note that X and Y Values +are always present. Z and M are omitted for 2D geometries. + +```thrift +struct BoundingBox { + /** Min X value when edges = PLANAR, westmost value if edges = SPHERICAL */ + 1: required double xmin; + /** Max X value when edges = PLANAR, eastmost value if edges = SPHERICAL */ + 2: required double xmax; + /** Min Y value when edges = PLANAR, southmost value if edges = SPHERICAL */ + 3: required double ymin; + /** Max Y value when edges = PLANAR, northmost value if edges = SPHERICAL */ + 4: required double ymax; + /** Min Z value if the axis exists */ + 5: optional double zmin; + /** Max Z value if the axis exists */ + 6: optional double zmax; + /** Min M value if the axis exists */ + 7: optional double mmin; + /** Max M value if the axis exists */ + 8: optional double mmax; +} +``` + +The meaning of each value depends on the `Edges` attribute of the `GEOMETRY` type: +- If Edges is `PLANAR`, the values are literally the actual min/max value from each axis. +- If Edges is `SPHERICAL`, the values for X and Y are `[westmost, eastmost, southmost, northmost]`, + with necessary min/max values for Z and M if needed. + +##### Geometry Types + +A list of geometry types from all geometries in the `GEOMETRY` column, or an +empty list if they are not known. + +This is borrowed from [geometry_types of GeoParquet][geometry-types] +except that values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code]. +Table below shows the most common geometry types and their codes: + +| Type | XY | XYZ | XYM | XYZM | +| :----------------- | :--- | :--- | :--- | :--: | +| Point | 0001 | 1001 | 2001 | 3001 | +| LineString | 0002 | 1002 | 2002 | 3002 | +| Polygon | 0003 | 1003 | 2003 | 3003 | +| MultiPoint | 0004 | 1004 | 2004 | 3004 | +| MultiLineString | 0005 | 1005 | 2005 | 3005 | +| MultiPolygon | 0006 | 1006 | 2006 | 3006 | +| GeometryCollection | 0007 | 1007 | 2007 | 3007 | + +In addition, the following rules are applied: +- A list of multiple values indicates that multiple geometry types are present (e.g. `[0003, 0006]`). +- An empty array explicitly signals that the geometry types are not known. +- The geometry types in the list must be unique (e.g. `[0001, 0001]` is not valid). + +[geometry-types]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159 +[wkb-integer-code]: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary + +#### Geospatial Notes + +The Geometry class hierarchy and its WKT and WKB serializations (ISO supporting +XY, XYZ, XYM, XYZM) are defined by [OpenGIS Implementation Specification for +Geographic information – Simple feature access – Part 1: Common architecture]( +https://portal.ogc.org/files/?artifact_id=25355), from [OGC (Open Geospatial +Consortium)](https://www.ogc.org/standard/sfa/). + +The version of the OGC standard first used here is 1.2.1, but future versions +may also used if the WKB representation remains wire-compatible. + ## UNKNOWN (always null) Sometimes, when discovering the schema of existing data, values are always null diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 5d4431d9..6378a58e 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -237,6 +237,37 @@ struct SizeStatistics { 3: optional list definition_level_histogram; } +/** + * Bounding box of geometries in the representation of min/max value pair of + * coordinates from each axis. + */ +struct BoundingBox { + /** Min X value when edges = PLANAR, westmost value if edges = SPHERICAL */ + 1: required double xmin; + /** Max X value when edges = PLANAR, eastmost value if edges = SPHERICAL */ + 2: required double xmax; + /** Min Y value when edges = PLANAR, southmost value if edges = SPHERICAL */ + 3: required double ymin; + /** Max Y value when edges = PLANAR, northmost value if edges = SPHERICAL */ + 4: required double ymax; + /** Min Z value if the axis exists */ + 5: optional double zmin; + /** Max Z value if the axis exists */ + 6: optional double zmax; + /** Min M value if the axis exists */ + 7: optional double mmin; + /** Max M value if the axis exists */ + 8: optional double mmax; +} + +/** Statistics specific to GEOMETRY logical type */ +struct GeometryStatistics { + /** A bounding box of geometries */ + 1: optional BoundingBox bbox; + /** Geometry type codes of all geometries, or an empty list if not known */ + 2: optional list geometry_types; +} + /** * Statistics per row group and per page * All fields are optional. @@ -380,6 +411,40 @@ struct JsonType { struct BsonType { } +/** Physical type and encoding for the geometry type */ +enum GeometryEncoding { + /** + * Allowed for physical type: BYTE_ARRAY. + * + * Well-known binary (WKB) representations of geometries. + */ + WKB = 0; +} + +/** Interpretation for edges of elements of a GEOMETRY type */ +enum Edges { + PLANAR = 0; + SPHERICAL = 1; +} + +/** + * GEOMETRY logical type annotation (added in 2.11.0) + * + * GeometryEncoding and Edges are required. In order to correctly interpret + * geometry data, writer implementations SHOULD always them, and reader + * implementations SHOULD fail for unknown values. + * + * CRS is optional. Once CRS is set, it MUST be a key to an entry in the + * `key_value_metadata` field of `FileMetaData`. + * + * See LogicalTypes.md for detail. + */ +struct GeometryType { + 1: required GeometryEncoding encoding; + 2: required Edges edges; + 3: optional string crs; +} + /** * Embedded Variant logical type annotation */ @@ -417,6 +482,7 @@ union LogicalType { 14: UUIDType UUID // no compatible ConvertedType 15: Float16Type FLOAT16 // no compatible ConvertedType 16: VariantType VARIANT // no compatible ConvertedType + 17: GeometryType GEOMETRY // no compatible ConvertedType } /** @@ -857,6 +923,9 @@ struct ColumnMetaData { * filter pushdown. */ 16: optional SizeStatistics size_statistics; + + /** Optional statistics specific to GEOMETRY logical type */ + 17: optional GeometryStatistics geometry_stats; } struct EncryptionWithFooterKey { @@ -988,6 +1057,7 @@ union ColumnOrder { * LIST - undefined * MAP - undefined * VARIANT - undefined + * GEOMETRY - undefined * * In the absence of logical types, the sort order is determined by the physical type: * BOOLEAN - false, true