Feature/153 support polars in pin_write to parquet #263

nathanjmcdougall · 2024-07-20T01:08:39Z

To resolve #153.

A harder one is #233 - at the moment, pandas is the hard-coded return type for pin_read.

I have added some lightweight documentation in the README files; perhaps it would be good to add the documentation associated with the current recommended approach for #233 at the same time to cover both the read and write cases.

isabelizimm · 2024-07-23T20:54:42Z

pins/drivers.py

+    elif is_pandas_df:
+        return "pandas"
+    else:
+        assert_never(df)


nit: are you able to remove

if not is_polars_df and not is_pandas_df: return "unknown"

chunk then return "unknown" to make this a bit more concise?

OR, WDYT of raising NotImplementedError here rather than returning "unknown", since we already know "unknown" will cause a failure? (We could try/except NotImplementedError for creating the default title potentially)

I like the NotImplementedError idea a lot 😄

isabelizimm · 2024-07-23T21:28:09Z

pins/drivers.py

@@ -15,14 +16,38 @@


 def _assert_is_pandas_df(x):
-    import pandas as pd
+    df_family = _get_df_family(x)


Is there a benefit to using _get_df_family() over checking for a pandas dataframe?

The motivation is to centralize the logic for deciding a DataFrame's type into one single location.

I think my preference for consistency's sake is to refactor the _assert_is_pandas_df function away, and use a similar strategy of case checking as in these lines:

https://github.com/nathanjmcdougall/pins-python/blob/2c2e5029c300631822d1326d5b94cde895b4294b/pins/drivers.py#L200-L211

The disadvantage is that the error messages could get a bit repetitive.

I agree with eventually refactoring _assert_is_pandas_df out in favor of _get_df_family! I think we can streamline the errors a bit 😄

isabelizimm · 2024-07-24T21:58:20Z

pins/drivers.py

        raise NotImplementedError(
            f"Currently only pandas.DataFrame can be saved as type {file_type!r}."
        )


+def _get_df_family(df) -> Literal["pandas", "polars"]:


Suggested change

def _get_df_family(df) -> Literal["pandas", "polars"]:

def _get_df_family(df, file_type: str) -> Literal["pandas", "polars"]:

Good idea, I think I see a nice abstraction for this...

machow · 2024-07-30T20:42:43Z

From pairing w/ @isabelizimm, one potentially handy pattern for handling multiple DataFrames could be an adaptor. Basically, everywhere a piece of DataFrame logic is, could become a method on an adaptor class. For example...

from abc import abstractmethod
from typing_extensions import TYPE_CHECKING, TypeAlias


class Adaptor:
    def __init__(self, data):
        self._d = data

    @abstractmethod
    def write_parquet(self, name: str):
        raise NotImplementedError()

    @abstractmethod
    def default_title(self):
        # return whatever is needed for use in default_title()
        raise NotImplementedError()


class PandasAdaptor(Adaptor):
    def write_parquet(self, name: str):
        self._d.to_parquet(name)
    
    def default_title(self):
        return "I'm a n row pandas DataFrame"


class PolarsAdaptor(Adaptor):
    def write_parquet(self, name: str):
        self._d.write_parquet(name)


if TYPE_CHECKING:
    import pandas as pd
    import polars as pl
    DataFrame: TypeAlias = pd.DataFrame | pl.DataFrame

def create_adaptor(d: "DataFrame") -> Adaptor:
    # TODO: some kind of conditional importing
    # can use databackend to avoid imports: https://github.com/machow/databackend
    # this is what Great Tables uses here: https://github.com/posit-dev/great-tables/blob/main/great_tables/_tbl_data.py

    import pandas as pd
    import polars as pl

    if isinstance(d, pd.DataFrame):
        return PandasAdaptor(d)
    elif isinstance(d, pl.DataFrame):
        return PolarsAdaptor(d)

    raise NotImplementedError()

import pandas as pd

df = pd.DataFrame({"x": [1,2]})
create_adaptor(df).default_title()

We could use narwhals to replace the DataFrame logic (which is what is happening in this shiny PR posit-dev/py-shiny#1570). But it might be easier just to start with pandas and polars adaptors (to airgap out DataFrame logic), and then slot narwhals in after.

nathanjmcdougall · 2024-08-13T10:47:12Z

I've had a go at using that approach over in #298 - great idea.

isabelizimm reviewed Jul 23, 2024

View reviewed changes

isabelizimm mentioned this pull request Jul 23, 2024

Feature/249 support geoparquet #254

Open

nathanjmcdougall force-pushed the feature/153-support-polars branch from 2c2e502 to f41103e Compare July 23, 2024 22:52

isabelizimm reviewed Jul 24, 2024

View reviewed changes

nathanjmcdougall force-pushed the feature/153-support-polars branch from dcab779 to d682808 Compare July 24, 2024 22:34

nathanjmcdougall and others added 9 commits July 25, 2024 20:07

Add polars as a testing dependency

1a1536d

Support writing polars.DataFrame to parquet.

9667fc0

Support polars.DataFrame in default_title

998662a

Update docs.

f319afd

Use assert_never from typing_extensions not typing.

63895dc

Refactor _get_df_family to raise an error instead.

036c7e9

Write more robust df-library choosing logic.

85bd8c6

Use tmp_path instead of tmp_dir2

280cf52

Refactoring and adding tests.

fd2a12e

nathanjmcdougall force-pushed the feature/153-support-polars branch from 71e58c9 to fd2a12e Compare July 25, 2024 08:09

nathanjmcdougall mentioned this pull request Aug 13, 2024

Move to adaptor backend #298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/153 support polars in pin_write to parquet #263

Feature/153 support polars in pin_write to parquet #263

nathanjmcdougall commented Jul 20, 2024 •

edited

Loading

isabelizimm Jul 23, 2024

isabelizimm Jul 23, 2024

nathanjmcdougall Jul 23, 2024

isabelizimm Jul 23, 2024

nathanjmcdougall Jul 23, 2024

isabelizimm Jul 24, 2024

isabelizimm Jul 24, 2024

nathanjmcdougall Jul 24, 2024

machow commented Jul 30, 2024 •

edited

Loading

nathanjmcdougall commented Aug 13, 2024

	def _get_df_family(df) -> Literal["pandas", "polars"]:
	def _get_df_family(df, file_type: str) -> Literal["pandas", "polars"]:

Feature/153 support polars in pin_write to parquet #263

Are you sure you want to change the base?

Feature/153 support polars in pin_write to parquet #263

Conversation

nathanjmcdougall commented Jul 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machow commented Jul 30, 2024 • edited Loading

nathanjmcdougall commented Aug 13, 2024

nathanjmcdougall commented Jul 20, 2024 •

edited

Loading

machow commented Jul 30, 2024 •

edited

Loading