Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP add the Recipe #1064

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from
Draft

Conversation

jeromedockes
Copy link
Member

This is still in draft mode but I'll open the PR so we can discuss the example.

I still need to add more tests and reference documentation

@jeromedockes jeromedockes marked this pull request as draft September 9, 2024 14:35
@jeromedockes jeromedockes added this to the 0.4.0 milestone Sep 9, 2024
@jeromedockes
Copy link
Member Author

jeromedockes commented Sep 9, 2024

the example is example 10, "using the recipe"

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some first remarks on the example and high-level concepts of the Recipe. As the Recipe offers many new features, I think the example should be simplified and more focused on the Recipe itself.

examples/10_recipe.py Outdated Show resolved Hide resolved
# %%
from skrub import Recipe, datasets

dataset = datasets.fetch_employee_salaries()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good time to change our "default" demo dataset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked a bit but haven't found a good replacement yet. but as I suspect we will merge the fraud data example before this one, maybe employee salaries can be replaced with that one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, to do so we need to take care of the join operation with the recipe first, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, so that would come later. if anyone has suggestions for another dataset for this example I'd be happy to diversify a bit from employee salaries.

also, in employee salaries should we remove the "year first hired" in the fetcher? both here and in the tablevectorizer examples, the datetime encoder isn't useful because the feature it extracts has already been inserted in the dataset

examples/10_recipe.py Show resolved Hide resolved
examples/10_recipe.py Show resolved Hide resolved
from skrub import DatetimeEncoder
from skrub import selectors as s

recipe = recipe.add(DatetimeEncoder(), cols=s.any_date())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the recipe: as a user, I'm quite upset that the DatetimeEncoder doesn't perform the parsing with ToDatetime() for me. Sure, uncoupling all elements makes sense from a pure computer science perspective, but from the practitioner's (and the beginner's) point of view, this is a bit cumbersome.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can (I guess should) very easily add a ToDatetime inside the DatetimeEncoder

they are 2 transformers because in the TableVectorizer they have to be separate because the user provides the datetime encoder but does not control the column assignments. so the datetime columns must have been parsed to decide column assignments before they get assigned to the datetime encoder and reach it.

before, the main use case for datetime encoder was in the tablevectorizer. but now that it will become more practical to use without the tablevectorizer thanks to the recipe, adding datetime parsing to do it all in one go makes sense. (and the tablevectorizer just won't use this feature)

but there are also several other cleaning steps besides datetime parsing that the tablevectorizer does, so we might want either a transformer or an option for the recipe to apply all the cleaning / preprocessing (ie everything in the tablevectorizer except the user-provided final transformers)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can (I guess should) very easily add a ToDatetime inside the DatetimeEncoder

That would be great IMO

they are 2 transformers because in the TableVectorizer they have to be separate because the user provides the datetime encoder but does not control the column assignments. so the datetime columns must have been parsed to decide column assignments before they get assigned to the datetime encoder and reach it.

Yes, I remember the choices that led to this design, and I agree with them.

before, the main use case for datetime encoder was in the tablevectorizer. but now that it will become more practical to use without the tablevectorizer thanks to the recipe, adding datetime parsing to do it all in one go makes sense. (and the tablevectorizer just won't use this feature)

Ok, if that doesn't introduce too much complexity on the TV part, I'm all for it.

but there are also several other cleaning steps besides datetime parsing that the tablevectorizer does, so we might want either a transformer or an option for the recipe to apply all the cleaning / preprocessing (ie everything in the tablevectorizer except the user-provided final transformers)

That's interesting; I need to refresh my memory regarding this part.

Side question: Would using the TV with the Recipe make sense in general? I'm thinking about CVing the transformers and their hyper-parameters more easily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes using the TableVectorizer in the Recipe completely makes sense, and it will help tune the choice of the encoders and their hyperparameters (the choose_* can be arbitrarily nested). I didn't do it in this example because on this dataset the TableVectorizer does everything fine so there would be only one step and it would be harder to showcase some features of the recipe

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! What about showing the recipe with TV at the end? Or would that make the message less obvious?

examples/10_recipe.py Show resolved Hide resolved
examples/10_recipe.py Outdated Show resolved Hide resolved
examples/10_recipe.py Show resolved Hide resolved
examples/10_recipe.py Outdated Show resolved Hide resolved
# choices.

# %%
recipe.get_cv_results_table(randomized_search)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interaction between the HP tuner and the recipe is interesting. I like that the recipe ties different elements together and makes pragmatic assumptions about the user flow.

Would that work with another HP tuner e.g. HalvingRandomizedSearch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think I haven't added the halving search yet because when I made the recipe it was still experimental in scikit-learn (not sure if that's still the case) and its parameters are a bit hard for users to wrap their head around but at some point we should definitely add it.

atm it also has the gridsearch (although you can only use it if you don't have any continuous distributions in the hyperparameters of course)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious to see whether people using hp tuning libraries like optuna or hyperopt could use the recipe easily, provided we know how to extract some sort of cv_results_ from their tuners.

People could try something along the lines of:

model = recipe.get_pipeline()
tuner.fit(model, recipe.get_X(), recipe.get_y())
recipe.plot_parallel(tuner)

Of course, that would require us to know the methods used by other libraries, but it could be worth it in a subsequent iteration. WDYT?

):
if self._has_predictor():
pred_name = self._get_step_names()[-1]
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a high prio: should we allow more flexibility here and have estimators working as transformers? The hard part is making sure that's what the user wants and they are not stacking estimators by mistake.

sklego introduced this concept that might make sense for us: https://github.com/koaning/scikit-lego/blob/main/sklego/meta/estimator_transformer.py

@Vincent-Maladiere
Copy link
Member

Hey @jeromedockes, could you write a small TL;DR regarding the recent changes?

@jeromedockes
Copy link
Member Author

yes:

  • a small change to be compatible with the current version of the tablereport (columns that match a filter must now be given by their indices not column names)
  • removing get_x_test etc as we discussed in the first round of review
  • adding (developer) documentation to the _tuning module

@Vincent-Maladiere
Copy link
Member

Great, thanks!

Make a copy of a dataclass instance with different values for some of the attributes
"""
return obj.__class__(
**({f.name: getattr(obj, f.name) for f in dataclasses.fields(obj)} | fields)
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not:

from dataclasses import asdict
obj.__class__(**(asdict(obj) | fields))

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asdict recurses into attributes and makes a deep copy, here we want a shallow copy

@jeromedockes jeromedockes removed this from the 0.4.0 milestone Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants