feat(api): avoid caching physical tables #9976
Open
+80
−12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #6195.
This makes
table_expr.cache()
a no-op for physical tables and other expressions not worth caching (defined here as simple column subsets of physical tables).In the original issue (#6195) I wanted to only avoid caching things that were backed by an existing physical table (don't cache tables, do cache views, do cache expressions, ...). In practice writing
_is_this_a_real_database_table
for every backend would be annoying (maybe less annoying once the.ddl
accessor lands). I'm also not sure if avoiding caching views is that bad.We do make exceptions for two backends though -
duckdb
andpyspark
. Both of these backends have lazy-loading for file inputs (meaning creating theibis.Table
doesn't load the data yet), and loading the data can sometimes be expensive. In both cases the output ofread_parquet
et. al. is backed by aTEMP VIEW
. For these backends alone we also will cache temp views to work around this. This means that the pattern of:will work for both these backends still. Other backends where
read_parquet
reads immediately (not as a view) will now have.cache()
as a no-op (where before it would duplicate the data).All that said, I'm not happy with how squishy the definition of "do we cache this expression" is. I think the above is pragmatic and easy to do, but it's also hard to explain since there's so much "it depends" going on. Mostly pushing this up for discussion for now (I haven't added tests yet).
A few options that would make this simpler/more uniform/easier to explain:
ops.DatabaseTable
, but the uniformity there may be worth it.PhysicalTable
. This makes the choice much easier, since it requires no backend introspection. Forduckdb
/pyspark
/others add an optional kwarg to control whether the data is loaded immediately or lazily (feat(io): support way to ensureread_*
apis create physical tables in the backend #9931). I know I closed that one in favor of feat: makeTable.cache()
a no-op for tables that are already concrete in a backend #6195, but now I'm waffling. This moves the decision about how to load data to the caller rather than after the call, which feels better to me (more localized control). Could be a backend-specific kwarg (so not there for all our backends), or something more general. Possible names:cache=True
,view=False
,lazy=False
, idk..cache()
uniformly to ensure I was working on a physical table for efficiency, but now I'm not sure if the complications here are worth it.Currently leaning towards option 2, but could hear arguments for any of them.