Add raster type & functions to manage rasters in the spatial extension #298

ahuarte47 · 2024-04-02T23:36:04Z

This PR adds support to manage rasters (A wrapper of GDALDataset) in the extension.

It provides a new type RASTER, and a set of functions to load, process and save this data type.

Description of this code here.

Maxxen · 2024-04-03T08:45:36Z

Hi! Thanks for this PR, it's really cool! Im currently on vacation so I've only skimmed this quickly, but I have some concerns:

Using a POINTER for the RASTER type doesn't seem like it actually stores anything in the database? What happens if I load some rasters into a table and restart DuckDB? IIRC PostGIS has the option to either embed the raster into a blob (which limits you to 4gb in duckdbs case) or store a file path to the raster on disk, reopening it when needed.
The reason why I have been hesitant to add raster support to spatial is that I don't think DuckDB's vectorized execution engine is a good fit for raster processing. DuckDB's vector engine operates on batches of 2048 elements at a time which is great when all elements are small-ish (the full vector fit into memory), but as rasters generally are much bigger this boon quickly becomes a disatvantage, e.g. a single vector of 512x512x4 tiles is already 2GB of ram (external ram held by GDAL in this case, which DuckDB is unaware of and can't buffer manager). Additionally DuckDB vectors need to be immutable, so if you chain a bunch of operations in order every intermediate vector will be held in memory too for the duration of the pipeline. This whole approach is in stark contrast to e.g. GDALs VRT's that are designed to avoid holding anything in RAM at all, or PostGIS/SpatiaLite where only a single row is processed at a time.
A big part of PostGIS raster processing is setting up constraints to make sure mosaics don't have seams, but DuckDB is currently pretty limited when it comes to constraints in general.
GDAL has turned out to be a very difficult dependency to wrangle in practice (I think like a 1/3 of all open issues here are related to GDAL filesystem issues and we still have problems in WASM) and I've been trying to slowly lessen our dependency on it by implementing native readers for common vector formats, but its OK now because we strip a lot of the extra drivers and have a build setup that somewhat works even if its very hacky. I suspect with the PR we would have to add back a lot of the raster drivers (and their dependencies in turn).

That said im not entirely against merging this, I just need some time to look it over once I get back to work. As I mentioned I've been toying with the idea of splitting all the GDAL functionality into a separate extension with additional drivers (and caveats) relying on VCPKG to build the dependencies, in which case I think this would fit in great.

Maxxen · 2024-04-03T08:51:55Z

Also, you need to update the DuckDB submodule as the main CI always builds for latest DuckDB main or switch the target brach to v0.10.1 as that has the latest changes (a big documentation rework) and is pinned to DuckDB v0.10.1.

ahuarte47 · 2024-04-03T14:37:32Z

Thanks @Maxxen for your comments,

* Using a POINTER for the RASTER type doesn't seem like it actually stores anything in the database? What happens if I load some rasters into a table and restart DuckDB? IIRC PostGIS has the option to either embed the raster into a blob (which limits you to 4gb in duckdbs case) or store a file path to the raster on disk, reopening it when needed.

Yes, nothing is stored in the database. I am managing POINTERs (temporary objects) to avoid to load in RAM that so big objects such as rasters. I share your concerns. Maybe this implementation does not fit all use cases you are thinking, I see a raster as temporary object, loaded from external data sources, use them in a query and finally save to other external store if it is necesary. Load & save methods from/to BLOB can exist to read/write the database, but I do not know if this is the way to go.

* GDAL has turned out to be a very difficult dependency to wrangle in practice (I think like a 1/3 of all open issues here are related to GDAL filesystem issues and we still have problems in WASM) and I've been trying to slowly lessen our dependency on it by implementing native readers for common vector formats, but its OK now because we strip a lot of the extra drivers and have a build setup that somewhat works even if its very hacky. I suspect with the PR we would have to add back a lot of the raster drivers (and their dependencies in turn).

Yes, you are right, this PR is adding more dependencies to GDAL, but studing PostGIS source code, many operations are implemented using GDAL as well, I thought this was as a valid solution to do not reinvent wheels. Anyway, we could mitigate this dependency separating this raster support in other new extension, the spatial extension could drop its GDAL dependency when it comes possible.

Anyway, of course, feel free about this MR, this is a PoC, and you feel it raises unacceptable issues to fit in the DuckDB engine model, I am comfortable with that.

Maxxen · 2024-04-03T15:24:10Z

Alright, let me have another look when I get back.

I think the primary blocker right now is just how to deal with persistence. I agree that it might be a good idea to avoid copying the raster data itself into DuckDB as blobs, but I think there should be some way to "restore" a set of rasters from disk when opening a duckdb database with a raster table (if they are available on disk - otherwise throw an error or return NULL or something, we can see how PostGIS handles it). Maybe by storing a file path or an ID to some other lookup structure instead of the raw pointers.

Regarding GDAL - let's maybe not worry about this too much right now. We already have it as a dependency, and it's probably going to stay for a while. If we just use the built-in raster drivers for now it won't add much complexity to the build.

Thanks again for your work!

armaanv · 2024-07-18T23:12:41Z

is merging this on the roadmap?

ahuarte47 · 2024-10-23T13:30:50Z

Hi everyone. I am thinking about this old PR. Maybe is it better to manage a raster as other tabular Dataframe? Something similar how RasterFrames is managing them?

A raster could be serialized as a table of chunks or tiles. A parquet (Rasquet :-)) con several metadata columns (CRS, BBOX width, height, datatype, ...) and 1-N band columns with the binary for each raster band.

frafra · 2024-11-13T11:29:56Z

Hi everyone. I am thinking about this old PR. Maybe is it better to manage a raster as other tabular Dataframe? Something similar how RasterFrames is managing them?

I really appreciate your effort, and I haven't replied before because it is challenging for me to understand how such a decision would affect how the data would be queried. Could you provide an example?

A common need that I have when dealing with vector/tabular data is to add some information to them based on some rasters, such as DTM. Would the new approach make it radically different? If that is not the case, using a more common/popular way to handle raster might be better, since people would be already familiar with that way of working.

Thanks again for this great PR :) We really need to have a way to connect raster data to duckdb.

ahuarte47 · 2024-11-15T20:51:02Z

Hi, the idea is to split the input raster when loading into a collection of small tiles distributed by overlays, and manage them as a table for rows with several metadata columns (size, crs, pixel_type...) and binary columns for each exiting band in the original raster.

This approach is radically different than using the original input raster as a unique source. Managing rasters as collection of tiles can be better for the internal vector engine that duckdb implements, but I think it makes much difficult to manage the tiles when you want to perform raster algebra on them.

I think the current approach is simpler, knowing that duckdb can not use its vector engine to parallelize the computing of queries from rasters.

Anyway, it would be great a debate about what approach seems better.

ahuarte47 added 21 commits April 3, 2024 22:52

Load & show GDAL Drivers that support raster

fb953e4

Add ST_ReadRaster function

ac99dbe

Add ST_ReadRaster_Meta table function

2960227

Add ST_SRID function

9393387

Add ST_GetGeometry function

2c7ad12

Add methods to get metadata from a raster

7f79e36

Add methods to get metadata from a raster band

64a25b6

Add ST_WorldToRasterCoord[XY] & ST_RasterToWorldCoord[XY] functions

1846161

Add ST_Value function

bfdc07f

Add ST_RasterFromFile function

d932a7c

Add ST_RasterAsFile function

0cc5aad

Add aggregate functions (ST_RasterMosaic_Agg, ST_RasterUnion_Agg)

3bb40c6

Add ST_RasterWarp function

a22a85c

Add ST_RasterClip function

ef7f2d3

Add ResampleAlg enum

8c93e3a

Ass CopyTo raster function

397c71e

Add tests & sample data for testing

e2c37b3

Add docs of functions

faec181

Add missing registration of RASTER_COORD

3ebb335

Changes for v0.10.1

fef62bf

Remove docs in old style, they will be added later in current style

03fa25b

ahuarte47 force-pushed the main_raster-type branch from 9f701db to 03fa25b Compare April 4, 2024 00:38

ahuarte47 changed the base branch from main to v0.10.1 April 4, 2024 00:40

ahuarte47 added 3 commits April 7, 2024 19:29

Add documentation of raster functions

14a5ff7

Fix format

9aef365

Fix undefined symbols for MacOS

9ca99bb

ahuarte47 force-pushed the main_raster-type branch from f5dbaaf to 9ca99bb Compare April 8, 2024 22:12

ahuarte47 changed the base branch from v0.10.1 to v0.10.2 April 29, 2024 12:54

ahuarte47 added 3 commits May 1, 2024 08:20

Merge remote-tracking branch 'upstream/v0.10.2' into main_raster-type

529c5e2

Changes for updating to v0.10.2

678b4de

Temporal fix when creating a Geometry from a BBOX

7a8862e

ahuarte47 changed the base branch from v0.10.2 to v1.0.0 June 9, 2024 17:29

ahuarte47 added 2 commits June 9, 2024 20:56

Merge remote-tracking branch 'upstream/v1.0.0' into main_raster-type

3233443

Changes for updating to v1.0.0

029ff33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add raster type & functions to manage rasters in the spatial extension #298

Add raster type & functions to manage rasters in the spatial extension #298

ahuarte47 commented Apr 2, 2024 •

edited

Loading

Maxxen commented Apr 3, 2024 •

edited

Loading

Maxxen commented Apr 3, 2024

ahuarte47 commented Apr 3, 2024 •

edited

Loading

Maxxen commented Apr 3, 2024 •

edited

Loading

armaanv commented Jul 18, 2024

ahuarte47 commented Oct 23, 2024 •

edited

Loading

frafra commented Nov 13, 2024

ahuarte47 commented Nov 15, 2024

Add raster type & functions to manage rasters in the spatial extension #298

Are you sure you want to change the base?

Add raster type & functions to manage rasters in the spatial extension #298

Conversation

ahuarte47 commented Apr 2, 2024 • edited Loading

Maxxen commented Apr 3, 2024 • edited Loading

Maxxen commented Apr 3, 2024

ahuarte47 commented Apr 3, 2024 • edited Loading

Maxxen commented Apr 3, 2024 • edited Loading

armaanv commented Jul 18, 2024

ahuarte47 commented Oct 23, 2024 • edited Loading

frafra commented Nov 13, 2024

ahuarte47 commented Nov 15, 2024

ahuarte47 commented Apr 2, 2024 •

edited

Loading

Maxxen commented Apr 3, 2024 •

edited

Loading

ahuarte47 commented Apr 3, 2024 •

edited

Loading

Maxxen commented Apr 3, 2024 •

edited

Loading

ahuarte47 commented Oct 23, 2024 •

edited

Loading