Schema question when performing an external materialization and registering with Glue #422

firewall413 · 2024-08-02T12:44:14Z

I'm trying to understand how schema registration works using the Glue.py plugin

You run your DBT logic e.g.
{{ config(materialized='external',location='s3://mybucket/hello, glue_register=true, ... )}}

select 1,2,3 from source

After which it it looks like the table is being materialized -> parquet file is written to s3 -> a view is built on top of this location using select * from s3://mybucket/*/*.parquet -> columns are extracted from this view -> this schema is registered in Glue.

This works neatly when all files have the same columns.

However, when adding new columns to your materialized parquet files (which Glue/Athena supports) and save those in a new partition, the next time the you run this model, it will still register the old schema (likely because of the */*.parquet of the s3 location), and seems to ignore the new schema.

Wouldn't it be better to register the schema of your last-run model? Is this a matter of reordering/adapting the macros in materializations/external.sql? Or would this be undesirable?

The text was updated successfully, but these errors were encountered:

jwills · 2024-08-02T16:58:42Z

I thought the intended behavior was to update the schema if the columns change (viz. https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/plugins/glue.py#L328 )-- so if that isn't happening sometimes, it seems like a bug?

firewall413 · 2024-10-14T14:58:54Z

Indeed, but the way that a view is created on top of the externally written file is a bit flawed I feel.

when duckdbt writes an external file to s3://mybucket/year=2024/month=10/day=14/hello.parquet using config:

{{ config(materialized='external',location='s3://mybucket, glue_register=true, partition_by...)}}

The external file will be written to the proper location, with the proper latest schema. However, after writing the file it will
create a view where read_location is used (in our case: s3://mybucket/ * / * / * /hello.parquet) and pick up the first (I think, but certainly not the last) file in the hive structure, in our case an old partition of 2018. Based off this view, the glue table will be registered with an old 2018 schema.

jwills · 2024-10-16T21:41:24Z

gotcha, that makes sense + seems worthwhile to fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema question when performing an external materialization and registering with Glue #422

Schema question when performing an external materialization and registering with Glue #422

firewall413 commented Aug 2, 2024 •

edited

Loading

jwills commented Aug 2, 2024

firewall413 commented Oct 14, 2024 •

edited

Loading

jwills commented Oct 16, 2024

Schema question when performing an external materialization and registering with Glue #422

Schema question when performing an external materialization and registering with Glue #422

Comments

firewall413 commented Aug 2, 2024 • edited Loading

jwills commented Aug 2, 2024

firewall413 commented Oct 14, 2024 • edited Loading

jwills commented Oct 16, 2024

firewall413 commented Aug 2, 2024 •

edited

Loading

firewall413 commented Oct 14, 2024 •

edited

Loading