Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/153 support polars in pin_write to parquet #263

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

nathanjmcdougall
Copy link
Contributor

@nathanjmcdougall nathanjmcdougall commented Jul 20, 2024

To resolve #153.

A harder one is #233 - at the moment, pandas is the hard-coded return type for pin_read.

I have added some lightweight documentation in the README files; perhaps it would be good to add the documentation associated with the current recommended approach for #233 at the same time to cover both the read and write cases.

pins/drivers.py Outdated
elif is_pandas_df:
return "pandas"
else:
assert_never(df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: are you able to remove

if not is_polars_df and not is_pandas_df:
    return "unknown"

chunk then return "unknown" to make this a bit more concise?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OR, WDYT of raising NotImplementedError here rather than returning "unknown", since we already know "unknown" will cause a failure? (We could try/except NotImplementedError for creating the default title potentially)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the NotImplementedError idea a lot 😄

pins/drivers.py Outdated
@@ -15,14 +16,38 @@


def _assert_is_pandas_df(x):
import pandas as pd
df_family = _get_df_family(x)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benefit to using _get_df_family() over checking for a pandas dataframe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation is to centralize the logic for deciding a DataFrame's type into one single location.

I think my preference for consistency's sake is to refactor the _assert_is_pandas_df function away, and use a similar strategy of case checking as in these lines:

https://github.com/nathanjmcdougall/pins-python/blob/2c2e5029c300631822d1326d5b94cde895b4294b/pins/drivers.py#L200-L211

The disadvantage is that the error messages could get a bit repetitive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with eventually refactoring _assert_is_pandas_df out in favor of _get_df_family! I think we can streamline the errors a bit 😄

pins/drivers.py Outdated
raise NotImplementedError(
f"Currently only pandas.DataFrame can be saved as type {file_type!r}."
)


def _get_df_family(df) -> Literal["pandas", "polars"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _get_df_family(df) -> Literal["pandas", "polars"]:
def _get_df_family(df, file_type: str) -> Literal["pandas", "polars"]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I think I see a nice abstraction for this...

@machow
Copy link
Collaborator

machow commented Jul 30, 2024

From pairing w/ @isabelizimm, one potentially handy pattern for handling multiple DataFrames could be an adaptor. Basically, everywhere a piece of DataFrame logic is, could become a method on an adaptor class. For example...

from abc import abstractmethod
from typing_extensions import TYPE_CHECKING, TypeAlias


class Adaptor:
    def __init__(self, data):
        self._d = data

    @abstractmethod
    def write_parquet(self, name: str):
        raise NotImplementedError()

    @abstractmethod
    def default_title(self):
        # return whatever is needed for use in default_title()
        raise NotImplementedError()


class PandasAdaptor(Adaptor):
    def write_parquet(self, name: str):
        self._d.to_parquet(name)
    
    def default_title(self):
        return "I'm a n row pandas DataFrame"


class PolarsAdaptor(Adaptor):
    def write_parquet(self, name: str):
        self._d.write_parquet(name)


if TYPE_CHECKING:
    import pandas as pd
    import polars as pl
    DataFrame: TypeAlias = pd.DataFrame | pl.DataFrame

def create_adaptor(d: "DataFrame") -> Adaptor:
    # TODO: some kind of conditional importing
    # can use databackend to avoid imports: https://github.com/machow/databackend
    # this is what Great Tables uses here: https://github.com/posit-dev/great-tables/blob/main/great_tables/_tbl_data.py

    import pandas as pd
    import polars as pl

    if isinstance(d, pd.DataFrame):
        return PandasAdaptor(d)
    elif isinstance(d, pl.DataFrame):
        return PolarsAdaptor(d)

    raise NotImplementedError()

import pandas as pd

df = pd.DataFrame({"x": [1,2]})
create_adaptor(df).default_title()

We could use narwhals to replace the DataFrame logic (which is what is happening in this shiny PR posit-dev/py-shiny#1570). But it might be easier just to start with pandas and polars adaptors (to airgap out DataFrame logic), and then slot narwhals in after.

@nathanjmcdougall
Copy link
Contributor Author

I've had a go at using that approach over in #298 - great idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support for pinning polars dataframe
3 participants