What is the best way to read a pin from Posit Connect directly into polars? #233

SamEdwardes · 2024-06-12T18:06:00Z

What is the best way to read a pin from Posit Connect directly into polars?

I have found a few options that work, but none of them are exactly what I am looking for.

(1) Use pin_download

This method is my favourite, but it requires one extra step, I can't get a polars dataframe directly from pin_read.

import pins
import polars as pl

board = pins.board_connect()
paths = board.pin_download("sam.edwardes/vessel_verbose_raw")
pl.read_parquet(paths)

(2) Use pandas

This works, but requires me to have pandas installed. I also assume there is some kind of performance hit b/c you first read into pandas, then into polars?

import pins
import pandas as pd
import polars as pl

board = pins.board_connect()
pl.DataFrame(board.pin_read("sam.edwardes/vessel_verbose_raw"))

(3) Use fsspec

I have not figured out how to implement this yet, but I found this example for duckdb:

#193

Is there a way to do this with polars?

Related issues

Suggestions

It would be nice if pins gave me a choice about what DataFrame library is used with pin_read. For example:

import pins
import polars as pl

board = pins.board_connect()
board.pin_read("sam.edwardes/vessel_verbose_raw", df_library='polars')

The text was updated successfully, but these errors were encountered:

isabelizimm · 2024-06-12T21:18:17Z

Oh this is a great question! There is currently no official way to read pins directly into polars. I would likely lean towards your option 2. pandas is a dependency of pins, so pandas will already be installed; you don't have to import it or have a special extra install. That way, you're able to keep the standard pin_read <> pin_write workflow. This preference is partially due to the fact that pins are usually storing small/medium sized data sets (the general rule of thumb is 500MB or less), so the performance difference is less important than in a big data environment.

I do see many packages moving to a dataframe library agnostic approach and am definitely interested in what that would look like for pins 👀

SamEdwardes · 2024-06-13T12:57:48Z

Thanks for your help @isabelizimm! I am working on some examples for posit::conf, I will update them to use option 2. I did not realize Pandas was already a dependency!

nathanjmcdougall · 2024-07-20T01:46:24Z

Some thoughts about a design for supporting multiple DataFrame families in pin_read:

I agree with @SamEdwardes that the best thing would be a new string argument like df_library, defaulting to None. To begin with, None might be interpreted to mean "use pandas".

In the longer run, something global might change the behaviour of None:

import pins
pins.set_default_df_provider("polars")

Perhaps this would be at the board level rather than completely globally.

Also, the python_df_library could be stored as metadata in the pin itself, and by default pin_read would use whatever DataFrame type was used in pin_write (if available).

On the other hand, it's not too hard to run pl.DataFrame(board.pin_read("mypin")) to undergo the conversion; it also doesn't add much readability above board.pin_read("mypin", df_library="polars"). So I think this is fairly low priority to add.

Lastly, I think that geopandas (#249 / #254) has a similar issue where a user might want a standard .parquet file loaded as a GeoDataFrame, rather than pandas.DataFrame (although supporting this adds very little value).

machow · 2024-07-23T13:17:40Z

Hey, just to add 2 cents here -- since pins has Board classes, there could be an option for the DataFrame class constructor to use?

Something like...

board = BaseBoard(..., frame_cls=pd.DataFrame)

# or in the board constructors
board_s3(..., frame_cls=pd.DataFrame)

This way...

people could set whatever frame constructor they want
we could still default to pandas (e.g. frame_cls=None by default uses pandas)
there is a direct connection between the option and Board

(Alternatively, I def think some kind of global option like @nathanjmcdougall is super reasonable!)

Avoiding recoding option docstrings over and over

One downside to setting it as a parameter on things like BaseBoard and constructors is you'd have to document parameters like frame_cls repeatedly.

One option for avoiding tht could be osmething like an Options cls:

BaseBoard(..., Options(frame_cls=pl.DataFrame))

This way, you'd just have to document a set of pins.Options in one place.

isabelizimm · 2024-07-24T22:41:58Z

What do people feel about some blend of a global option (with the default being pandas) + Options(frame_cls) at the board level? The behavior would be: if the board level option is specified, use that, otherwise fall back to the global option. Then people would be able to have multiple boards with different dataframe types if that's their style or set it once they're sole pandas/polars users.

nathanjmcdougall · 2024-07-24T22:54:00Z

I think that's a good balance and has the advantage that it avoids adding a Options class argument to pin_read which I think would be a bit clunky.

One question to answer would be: what if the file type has multiple df libraries supported, but none of them are the global deafult or board default? There's still an ambiguity in that case. So maybe the API needs to allow you to set an ordered prioritization between libraries. Alternatively, the user could just temporarily change the global default... that's a bit hacky though.

Also, what if a user does want to deal with multiple df libraries per board? One way to handle this might be to have a context manager implementation like this, which would over-ride any other board or global config:

with pins.force_df_provider("polars"):
    df = pin_read(...)

SamEdwardes · 2024-07-31T23:29:53Z

I don't have a strong preference on the implantation, but one outcome that would be nice for for things like intellisense to work. For example, VS Code should know if it is getting a pandas dataframe or polars dataframe back so that it can show me the correct auto-completion.

isabelizimm added the .documentation Improvements or additions to documentation label Jul 19, 2024

This was referenced Jul 20, 2024

Should pandas be an optional dependency? #261

Closed

Feature/153 support polars in pin_write to parquet #263

Open

isabelizimm added .enhancement New feature or request and removed .documentation Improvements or additions to documentation labels Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the best way to read a pin from Posit Connect directly into polars? #233

What is the best way to read a pin from Posit Connect directly into polars? #233

SamEdwardes commented Jun 12, 2024

isabelizimm commented Jun 12, 2024

SamEdwardes commented Jun 13, 2024

nathanjmcdougall commented Jul 20, 2024 •

edited

Loading

machow commented Jul 23, 2024

isabelizimm commented Jul 24, 2024

nathanjmcdougall commented Jul 24, 2024 •

edited

Loading

SamEdwardes commented Jul 31, 2024

What is the best way to read a pin from Posit Connect directly into polars? #233

What is the best way to read a pin from Posit Connect directly into polars? #233

Comments

SamEdwardes commented Jun 12, 2024

Related issues

Suggestions

isabelizimm commented Jun 12, 2024

SamEdwardes commented Jun 13, 2024

nathanjmcdougall commented Jul 20, 2024 • edited Loading

machow commented Jul 23, 2024

Avoiding recoding option docstrings over and over

isabelizimm commented Jul 24, 2024

nathanjmcdougall commented Jul 24, 2024 • edited Loading

SamEdwardes commented Jul 31, 2024

nathanjmcdougall commented Jul 20, 2024 •

edited

Loading

nathanjmcdougall commented Jul 24, 2024 •

edited

Loading