-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(duckdb/pyspark): load data from duckdb into pyspark #8440
Comments
We don't support this kind of cross backend data loading, not even with two different instances of the same backend. Don't we already have an issue for this? |
If you can add a use case that would also be helpful. It's not trivial to make this work, so having some rationale might help justify the effort. |
I just ran into this in getting demo data for #8090, I don't consider that worth this effort. the overall issue is in #8115, which I do think is worth prioritizing. similar to #8426 I think it adds a lot of value to be able to easily (and ideally efficiently) move data cross all the systems Ibis supports for a few use cases:
for this issue, feel free to just close in favor of #8115, though I also think it'd be good to have a better error message here ("Error: cannot transfer data between DuckDB and PySpark backends") -- I'm not sure how difficult detecting and adding that is for the purposes of the tutorial I'll just write to CSV/Parquet and read from that |
Converting to a feature request given the above. |
I agree. And also, this is a monstrous problem for anything that doesn't have native arrow support. If we plan to try to recreate |
limiting to backends that have native arrow support seems fine to me. perhaps w/ exceptions for postgres and sqlite given how common they are for source data into analytics (idk if they support Arrow but I assume not) |
they do not. we could accomplish this in the short-term by using duckdb's |
oh yeah I like using DuckDB for that -- would vote for optional dependency |
I think this issue issue is a duplicate of #4800? Implementation-wise, the common case would be iterating over I'd vote to close this in favor of #4800, but with a focus on designing #4800 so we can handle the fast path support in |
What happened?
trying to create example data in a PySpark connection and running into errors
repro:
try with
to_pyarrow()
:What version of ibis are you using?
main
What backend(s) are you using, if any?
duckdb + pyspark
Relevant log output
Code of Conduct
The text was updated successfully, but these errors were encountered: