Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shorthand for getting only the preprocessing part of the TableVectorizer #925

Open
jeromedockes opened this issue Jun 3, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@jeromedockes
Copy link
Member

Problem Description

Sometimes we may want to apply the preprocessing/cleaning steps of the TableVectorizer (parsing datetimes, handling pandas extension dtypes, etc.), while handling the actual encoding in separate pipeline steps.
This will probably become more relevant when the Recipe (or whatever its name will be) is introduced: we can use it to build exactly the pipeline we want, but we would still like to apply the default cleaning done by the TableVectorizer

If this sounds like a plausible use-case maybe we could have a shorthand for

TableVectorizer(
    high_cardinality_transformer="passthrough",
    low_cardinality_transformer="passthrough",
    datetime_transformer="passthrough",
    numeric_transformer="passthrough",
    specific_transformers=(),
)

maybe

TableSkrubber()

Feature Description

...

Alternative Solutions

No response

Additional Context

No response

@jeromedockes jeromedockes added the enhancement New feature or request label Jun 3, 2024
@jeromedockes
Copy link
Member Author

some examples of the kind of cleaning the tablevectorizer does:

>>> import pandas as pd
>>> from skrub import TableVectorizer


>>> skrubber = TableVectorizer(
...     high_cardinality_transformer="passthrough",
...     low_cardinality_transformer="passthrough",
...     datetime_transformer="passthrough",
...     numeric_transformer="passthrough",
...     specific_transformers=(),
... )

>>> df = pd.DataFrame({
...     'a': ['2020-01-02', '2020-01-03'],
...     'b': ['2.2', 'nan'],
...     'c': [1.5, pd.NA],
...     'd': [True, False],
...     'e': pd.Series([4.5, 'a'], dtype='category'),
... })
>>> df
            a    b     c      d    e
0  2020-01-02  2.2   1.5   True  4.5
1  2020-01-03  nan  <NA>  False    a
>>> df.dtypes
a      object
b      object
c      object
d        bool
e    category
dtype: object
>>> df['e'].cat.categories
Index([4.5, 'a'], dtype='object')

>>> skrubbed = skrubber.fit_transform(df)
>>> skrubbed
           a    b    c    d    e
0 2020-01-02  2.2  1.5  1.0  4.5
1 2020-01-03  NaN  NaN  0.0    a
>>> skrubbed.dtypes
a    datetime64[ns]
b           float32
c           float32
d           float32
e          category
dtype: object
>>> skrubbed['e'].cat.categories
Index(['4.5', 'a'], dtype='object')

@GaelVaroquaux
Copy link
Member

I like the name "Skrubber"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants