feat(pyarrow): support arrow PyCapsule interface in more places #9663

jcrist · 2024-07-22T20:27:48Z

~~Two changes here:~~

~~Support __arrow_c_schema__ in ibis.Schema objects~~
Support objects implementing __arrow_c_stream__ as inputs to ibis.memtable

…ibis.memtable`

jcrist · 2024-07-22T20:59:10Z

I'm going to split the schema one out into a new PR, I'm not sure if the __arrow_c_stream__ input support makes sense (done in #9665).

jcrist · 2024-07-22T21:07:56Z

I'm not sure if supporting __arrow_c_stream__-implementing objects in ibis.memtable makes sense. It certainly is easy to do. But these objects don't necessarily already hold data in memory - there's a reason that "stream" instead of "table" is used here. In particular RecordBatchReaders and other streaming interfaces also implement this method. To support this in a memtable, we'd necessarily have to load all data into memory, which may be a performance footgun.

Indeed, this change originally ran into test failures because supporting this implicitly added support for con.create_table("new", record_batch_reader) to multiple new backends. We could fix this by explicitly erroring for RecordBatchReader types (as done here) asking the user to write the code needed to explicitly load everything into memory themselves.

I'm tempted to do the same here for all objects exposing __arrow_c_stream__. This would mean that the only code change here would be an error message for these objects telling users how to fix their code, but I'd rather the user fully understand the performance impact than accidentally load GBs of data into RAM.

Thoughts @kylebarron?

jcrist · 2024-07-23T15:35:19Z

I'm tempted to do the same here for all objects exposing __arrow_c_stream__.

Thinking more on this, I think this is the correct approach. I really don't want to be encouraging users into doing inefficient things, and implicitly slurping up data in anything exposing a __arrow_c_stream__ method seems risky. Barring comments from others I plan to add a nice user-facing error that tells users how to explicitly load data when an object like this is passed to ibis.memtable, but not doing it for the user since they may not have understood the implications. Will hold off for a bit before doing this for comment from others (@kylebarron :))

kylebarron · 2024-07-23T19:08:29Z

Managing when to materialize an Arrow stream is indeed tricky. I'd say it's best to be a conscious decision by the end user, where they can choose among APIs that do or don't materialize the stream. E.g. this is why both pyarrow.table and pyarrow.RecordBatchReader.from_stream exist for the same C stream input, so that the user can decide whether to materialize it or keep it lazy.

In terms of ibis, is it possible for something like a StreamTable to exist? For ibis to hold the stream under the hood as a representation of future data, and then to pass that stream off to whatever backend when it executes the query?

As long as it's documented that an ibis.memtable is always fully materialized in memory, I'd argue that it's still beneficial to accept an __arrow_c_stream__, because it's the standard way for a user to transfer some Arrow table that is already in memory.

kylebarron · 2024-07-23T19:11:28Z

Indeed, this change originally ran into test failures because supporting this implicitly added support for con.create_table("new", record_batch_reader) to multiple new backends. We could fix this by explicitly erroring for RecordBatchReader types (as done here) asking the user to write the code needed to explicitly load everything into memory themselves.

I think this is fair. As I understand it: users can explicitly pass data with __arrow_c_stream__ to ibis.memtable, so it's explicit that data will be fully loaded in memory. If you did have some sort of "streaming table" that you could pass to backends, I think that would be fine to use implicitly within con.create_table.

jcrist · 2024-07-23T20:38:30Z

In terms of ibis, is it possible for something like a StreamTable to exist? For ibis to hold the stream under the hood as a representation of future data, and then to pass that stream off to whatever backend when it executes the query?

Not really, no. Ibis expects that data in an ibis.memtable is reusable in multiple queries. Since a RecordBatchReader (or other streams) cannot be traversed multiple times, using them in this way isn't viable. Some backends do support registering RecordBatchReaders as sources - in these cases you can query once on them, but that's expected. Likewise we have some streaming backends like flink where a source table is a stream, but that's backend specific.

TL;DR: ibis.memtable objects need to hold data in memory as part of the op graph, but there are other mechanisms in some backends for wrapping streaming sources.

Re: your other comments, I'm having a hard time understanding if you're agreeing or disagreeing with my proposal above :). To clarify, there are two options here:

Implicitly coerce any stream (whether in memory or not) into a fully concrete in-memory source when passed to ibis.memtable. This has performance impacts for things that aren't already in memory, and may not be the "right" thing to do in all cases. Right now we explicitly don't do this for RecordBatchReaders, but if we were to support __arrow_c_stream__-implementing-types in ibis.memtable, then we'd want to support those too.

Add a nice user-facing error for if a user calls ibis.memtable on a RecordBatchReader/__arrow_c_stream__-object that tells the user how to explicitly load the data they're asking for. This requires a bit more effort from the user, but avoids letting the user accidentally do the wrong thing. I've done that here for one type already in this PR:

ibis/ibis/expr/api.py

Lines 558 to 562 in 28aff8a

    
           raise TypeError( 
        
               "Creating an `ibis.memtable` from a `pyarrow.RecordBatchReader` would " 
        
               "load _all_ data into memory. If you want to do this, please do so " 
        
               "explicitly like `ibis.memtable(reader.read_all())`" 
        
           )

.

Right now I'm leaning towards option 2, but can see both sides. Note that if we do decide that option 1 is what we want, this would require a larger lift since currently create_table/insert implicitly use ibis.memtable in some cases where we'd want a more efficient path for streaming sources. Option 2 is really easy for us to do, but that's because it's just adding some nicer error messages and no actual support :).

kylebarron · 2024-07-30T02:36:05Z

If it's well-documented that ibis.memtable is a fully in-memory object (as the name suggests), then I think option 1 is appropriate. If a user passes an object into the ibis.memtable constructor, then it should follow that the source will be fully materialized.

You had previously referenced

con.create_table("new", record_batch_reader)

Does that implicitly use an ibis.memtable object? I wouldn't infer from the name con.create_table that record_batch_reader would be materialized; this is where maybe Arrow C Stream objects shouldn't be automatically used, instead erroring and suggesting the user to call

con.create_table("new", ibis.memtable(record_batch_reader))

Add a nice user-facing error for if a user calls ibis.memtable on a RecordBatchReader/arrow_c_stream-object

It happens that pyarrow.RecordBatchReader has a read_all method, but that's a library-specific API, not something inherent to all stream holders. And if the producer isn't pyarrow-based, then the user may not want to pass it through a pyarrow object (maybe this is moot if ibis itself will always depend on pyarrow).

jcrist added 2 commits July 22, 2024 15:02

feat(pyarrow): support __arrow_c_schema__ on ibis.Schema objects

c9ff617

feat(pyarrow): support objects implementing __arrow_c_stream__ in `…

28aff8a

…ibis.memtable`

jcrist force-pushed the arrow-py-capsule branch from fc4f0a2 to 28aff8a Compare July 22, 2024 20:46

kylebarron mentioned this pull request Jul 26, 2024

feat: Support for Arrow PyCapsule interface manzt/quak#23

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pyarrow): support arrow PyCapsule interface in more places #9663

feat(pyarrow): support arrow PyCapsule interface in more places #9663

jcrist commented Jul 22, 2024 •

edited

Loading

jcrist commented Jul 22, 2024 •

edited

Loading

jcrist commented Jul 22, 2024

jcrist commented Jul 23, 2024

kylebarron commented Jul 23, 2024

kylebarron commented Jul 23, 2024

jcrist commented Jul 23, 2024

kylebarron commented Jul 30, 2024

feat(pyarrow): support arrow PyCapsule interface in more places #9663

Are you sure you want to change the base?

feat(pyarrow): support arrow PyCapsule interface in more places #9663

Conversation

jcrist commented Jul 22, 2024 • edited Loading

jcrist commented Jul 22, 2024 • edited Loading

jcrist commented Jul 22, 2024

jcrist commented Jul 23, 2024

kylebarron commented Jul 23, 2024

kylebarron commented Jul 23, 2024

jcrist commented Jul 23, 2024

kylebarron commented Jul 30, 2024

jcrist commented Jul 22, 2024 •

edited

Loading

jcrist commented Jul 22, 2024 •

edited

Loading