Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the best way to read a pin from Posit Connect directly into polars? #233

Open
SamEdwardes opened this issue Jun 12, 2024 · 7 comments
Labels
.enhancement New feature or request

Comments

@SamEdwardes
Copy link

What is the best way to read a pin from Posit Connect directly into polars?

I have found a few options that work, but none of them are exactly what I am looking for.

(1) Use pin_download

This method is my favourite, but it requires one extra step, I can't get a polars dataframe directly from pin_read.

import pins
import polars as pl

board = pins.board_connect()
paths = board.pin_download("sam.edwardes/vessel_verbose_raw")
pl.read_parquet(paths)

(2) Use pandas

This works, but requires me to have pandas installed. I also assume there is some kind of performance hit b/c you first read into pandas, then into polars?

import pins
import pandas as pd
import polars as pl

board = pins.board_connect()
pl.DataFrame(board.pin_read("sam.edwardes/vessel_verbose_raw"))

(3) Use fsspec

I have not figured out how to implement this yet, but I found this example for duckdb:

#193

Is there a way to do this with polars?

Related issues

Suggestions

It would be nice if pins gave me a choice about what DataFrame library is used with pin_read. For example:

import pins
import polars as pl

board = pins.board_connect()
board.pin_read("sam.edwardes/vessel_verbose_raw", df_library='polars')
@isabelizimm
Copy link
Collaborator

Oh this is a great question! There is currently no official way to read pins directly into polars. I would likely lean towards your option 2. pandas is a dependency of pins, so pandas will already be installed; you don't have to import it or have a special extra install. That way, you're able to keep the standard pin_read <> pin_write workflow. This preference is partially due to the fact that pins are usually storing small/medium sized data sets (the general rule of thumb is 500MB or less), so the performance difference is less important than in a big data environment.

I do see many packages moving to a dataframe library agnostic approach and am definitely interested in what that would look like for pins 👀

@SamEdwardes
Copy link
Author

Thanks for your help @isabelizimm! I am working on some examples for posit::conf, I will update them to use option 2. I did not realize Pandas was already a dependency!

@isabelizimm isabelizimm added the .documentation Improvements or additions to documentation label Jul 19, 2024
@nathanjmcdougall
Copy link
Contributor

nathanjmcdougall commented Jul 20, 2024

Some thoughts about a design for supporting multiple DataFrame families in pin_read:

I agree with @SamEdwardes that the best thing would be a new string argument like df_library, defaulting to None. To begin with, None might be interpreted to mean "use pandas".

In the longer run, something global might change the behaviour of None:

import pins
pins.set_default_df_provider("polars")

Perhaps this would be at the board level rather than completely globally.

Also, the python_df_library could be stored as metadata in the pin itself, and by default pin_read would use whatever DataFrame type was used in pin_write (if available).

On the other hand, it's not too hard to run pl.DataFrame(board.pin_read("mypin")) to undergo the conversion; it also doesn't add much readability above board.pin_read("mypin", df_library="polars"). So I think this is fairly low priority to add.

Lastly, I think that geopandas (#249 / #254) has a similar issue where a user might want a standard .parquet file loaded as a GeoDataFrame, rather than pandas.DataFrame (although supporting this adds very little value).

@machow
Copy link
Collaborator

machow commented Jul 23, 2024

Hey, just to add 2 cents here -- since pins has Board classes, there could be an option for the DataFrame class constructor to use?

Something like...

board = BaseBoard(..., frame_cls=pd.DataFrame)

# or in the board constructors
board_s3(..., frame_cls=pd.DataFrame)

This way...

  • people could set whatever frame constructor they want
  • we could still default to pandas (e.g. frame_cls=None by default uses pandas)
  • there is a direct connection between the option and Board

(Alternatively, I def think some kind of global option like @nathanjmcdougall is super reasonable!)

Avoiding recoding option docstrings over and over

One downside to setting it as a parameter on things like BaseBoard and constructors is you'd have to document parameters like frame_cls repeatedly.

One option for avoiding tht could be osmething like an Options cls:

BaseBoard(..., Options(frame_cls=pl.DataFrame))

This way, you'd just have to document a set of pins.Options in one place.

@isabelizimm isabelizimm added .enhancement New feature or request and removed .documentation Improvements or additions to documentation labels Jul 24, 2024
@isabelizimm
Copy link
Collaborator

What do people feel about some blend of a global option (with the default being pandas) + Options(frame_cls) at the board level? The behavior would be: if the board level option is specified, use that, otherwise fall back to the global option. Then people would be able to have multiple boards with different dataframe types if that's their style or set it once they're sole pandas/polars users.

@nathanjmcdougall
Copy link
Contributor

nathanjmcdougall commented Jul 24, 2024

I think that's a good balance and has the advantage that it avoids adding a Options class argument to pin_read which I think would be a bit clunky.

One question to answer would be: what if the file type has multiple df libraries supported, but none of them are the global deafult or board default? There's still an ambiguity in that case. So maybe the API needs to allow you to set an ordered prioritization between libraries. Alternatively, the user could just temporarily change the global default... that's a bit hacky though.

Also, what if a user does want to deal with multiple df libraries per board? One way to handle this might be to have a context manager implementation like this, which would over-ride any other board or global config:

with pins.force_df_provider("polars"):
    df = pin_read(...)

@SamEdwardes
Copy link
Author

I don't have a strong preference on the implantation, but one outcome that would be nice for for things like intellisense to work. For example, VS Code should know if it is getting a pandas dataframe or polars dataframe back so that it can show me the correct auto-completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants