Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pyarrow): support arrow PyCapsule interface in more places #9663

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Jul 22, 2024

Two changes here:

  • Support __arrow_c_schema__ in ibis.Schema objects
  • Support objects implementing __arrow_c_stream__ as inputs to ibis.memtable

Fixes #9660.

@jcrist
Copy link
Member Author

jcrist commented Jul 22, 2024

I'm going to split the schema one out into a new PR, I'm not sure if the __arrow_c_stream__ input support makes sense (done in #9665).

@jcrist
Copy link
Member Author

jcrist commented Jul 22, 2024

I'm not sure if supporting __arrow_c_stream__-implementing objects in ibis.memtable makes sense. It certainly is easy to do. But these objects don't necessarily already hold data in memory - there's a reason that "stream" instead of "table" is used here. In particular RecordBatchReaders and other streaming interfaces also implement this method. To support this in a memtable, we'd necessarily have to load all data into memory, which may be a performance footgun.

Indeed, this change originally ran into test failures because supporting this implicitly added support for con.create_table("new", record_batch_reader) to multiple new backends. We could fix this by explicitly erroring for RecordBatchReader types (as done here) asking the user to write the code needed to explicitly load everything into memory themselves.

I'm tempted to do the same here for all objects exposing __arrow_c_stream__. This would mean that the only code change here would be an error message for these objects telling users how to fix their code, but I'd rather the user fully understand the performance impact than accidentally load GBs of data into RAM.

Thoughts @kylebarron?

@jcrist
Copy link
Member Author

jcrist commented Jul 23, 2024

I'm tempted to do the same here for all objects exposing __arrow_c_stream__.

Thinking more on this, I think this is the correct approach. I really don't want to be encouraging users into doing inefficient things, and implicitly slurping up data in anything exposing a __arrow_c_stream__ method seems risky. Barring comments from others I plan to add a nice user-facing error that tells users how to explicitly load data when an object like this is passed to ibis.memtable, but not doing it for the user since they may not have understood the implications. Will hold off for a bit before doing this for comment from others (@kylebarron :))

@kylebarron
Copy link

Managing when to materialize an Arrow stream is indeed tricky. I'd say it's best to be a conscious decision by the end user, where they can choose among APIs that do or don't materialize the stream. E.g. this is why both pyarrow.table and pyarrow.RecordBatchReader.from_stream exist for the same C stream input, so that the user can decide whether to materialize it or keep it lazy.

In terms of ibis, is it possible for something like a StreamTable to exist? For ibis to hold the stream under the hood as a representation of future data, and then to pass that stream off to whatever backend when it executes the query?

As long as it's documented that an ibis.memtable is always fully materialized in memory, I'd argue that it's still beneficial to accept an __arrow_c_stream__, because it's the standard way for a user to transfer some Arrow table that is already in memory.

@kylebarron
Copy link

Indeed, this change originally ran into test failures because supporting this implicitly added support for con.create_table("new", record_batch_reader) to multiple new backends. We could fix this by explicitly erroring for RecordBatchReader types (as done here) asking the user to write the code needed to explicitly load everything into memory themselves.

I think this is fair. As I understand it: users can explicitly pass data with __arrow_c_stream__ to ibis.memtable, so it's explicit that data will be fully loaded in memory. If you did have some sort of "streaming table" that you could pass to backends, I think that would be fine to use implicitly within con.create_table.

@jcrist
Copy link
Member Author

jcrist commented Jul 23, 2024

In terms of ibis, is it possible for something like a StreamTable to exist? For ibis to hold the stream under the hood as a representation of future data, and then to pass that stream off to whatever backend when it executes the query?

Not really, no. Ibis expects that data in an ibis.memtable is reusable in multiple queries. Since a RecordBatchReader (or other streams) cannot be traversed multiple times, using them in this way isn't viable. Some backends do support registering RecordBatchReaders as sources - in these cases you can query once on them, but that's expected. Likewise we have some streaming backends like flink where a source table is a stream, but that's backend specific.

TL;DR: ibis.memtable objects need to hold data in memory as part of the op graph, but there are other mechanisms in some backends for wrapping streaming sources.


Re: your other comments, I'm having a hard time understanding if you're agreeing or disagreeing with my proposal above :). To clarify, there are two options here:

  1. Implicitly coerce any stream (whether in memory or not) into a fully concrete in-memory source when passed to ibis.memtable. This has performance impacts for things that aren't already in memory, and may not be the "right" thing to do in all cases. Right now we explicitly don't do this for RecordBatchReaders, but if we were to support __arrow_c_stream__-implementing-types in ibis.memtable, then we'd want to support those too.

  2. Add a nice user-facing error for if a user calls ibis.memtable on a RecordBatchReader/__arrow_c_stream__-object that tells the user how to explicitly load the data they're asking for. This requires a bit more effort from the user, but avoids letting the user accidentally do the wrong thing. I've done that here for one type already in this PR:

    ibis/ibis/expr/api.py

    Lines 558 to 562 in 28aff8a

    raise TypeError(
    "Creating an `ibis.memtable` from a `pyarrow.RecordBatchReader` would "
    "load _all_ data into memory. If you want to do this, please do so "
    "explicitly like `ibis.memtable(reader.read_all())`"
    )
    .

Right now I'm leaning towards option 2, but can see both sides. Note that if we do decide that option 1 is what we want, this would require a larger lift since currently create_table/insert implicitly use ibis.memtable in some cases where we'd want a more efficient path for streaming sources. Option 2 is really easy for us to do, but that's because it's just adding some nicer error messages and no actual support :).

@kylebarron
Copy link

If it's well-documented that ibis.memtable is a fully in-memory object (as the name suggests), then I think option 1 is appropriate. If a user passes an object into the ibis.memtable constructor, then it should follow that the source will be fully materialized.

You had previously referenced

con.create_table("new", record_batch_reader)

Does that implicitly use an ibis.memtable object? I wouldn't infer from the name con.create_table that record_batch_reader would be materialized; this is where maybe Arrow C Stream objects shouldn't be automatically used, instead erroring and suggesting the user to call

con.create_table("new", ibis.memtable(record_batch_reader))

Add a nice user-facing error for if a user calls ibis.memtable on a RecordBatchReader/arrow_c_stream-object

It happens that pyarrow.RecordBatchReader has a read_all method, but that's a library-specific API, not something inherent to all stream holders. And if the producer isn't pyarrow-based, then the user may not want to pass it through a pyarrow object (maybe this is moot if ibis itself will always depend on pyarrow).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Implement Arrow PyCapsule Interface for data import, schema
2 participants