WIP add the Recipe #1064

jeromedockes · 2024-09-09T13:54:29Z

This is still in draft mode but I'll open the PR so we can discuss the example.

I still need to add more tests and reference documentation

jeromedockes · 2024-09-09T14:48:17Z

the example is example 10, "using the recipe"

Vincent-Maladiere

Here are some first remarks on the example and high-level concepts of the Recipe. As the Recipe offers many new features, I think the example should be simplified and more focused on the Recipe itself.

examples/10_recipe.py

Vincent-Maladiere · 2024-09-11T16:47:49Z

examples/10_recipe.py

+# %%
+from skrub import Recipe, datasets
+
+dataset = datasets.fetch_employee_salaries()


Is this a good time to change our "default" demo dataset?

I looked a bit but haven't found a good replacement yet. but as I suspect we will merge the fraud data example before this one, maybe employee salaries can be replaced with that one

Ok, to do so we need to take care of the join operation with the recipe first, right?

good point, so that would come later. if anyone has suggestions for another dataset for this example I'd be happy to diversify a bit from employee salaries.

also, in employee salaries should we remove the "year first hired" in the fetcher? both here and in the tablevectorizer examples, the datetime encoder isn't useful because the feature it extracts has already been inserted in the dataset

examples/10_recipe.py

Vincent-Maladiere · 2024-09-11T17:07:41Z

examples/10_recipe.py

+from skrub import DatetimeEncoder
+from skrub import selectors as s
+
+recipe = recipe.add(DatetimeEncoder(), cols=s.any_date())


Unrelated to the recipe: as a user, I'm quite upset that the DatetimeEncoder doesn't perform the parsing with ToDatetime() for me. Sure, uncoupling all elements makes sense from a pure computer science perspective, but from the practitioner's (and the beginner's) point of view, this is a bit cumbersome.

we can (I guess should) very easily add a ToDatetime inside the DatetimeEncoder

they are 2 transformers because in the TableVectorizer they have to be separate because the user provides the datetime encoder but does not control the column assignments. so the datetime columns must have been parsed to decide column assignments before they get assigned to the datetime encoder and reach it.

before, the main use case for datetime encoder was in the tablevectorizer. but now that it will become more practical to use without the tablevectorizer thanks to the recipe, adding datetime parsing to do it all in one go makes sense. (and the tablevectorizer just won't use this feature)

but there are also several other cleaning steps besides datetime parsing that the tablevectorizer does, so we might want either a transformer or an option for the recipe to apply all the cleaning / preprocessing (ie everything in the tablevectorizer except the user-provided final transformers)

we can (I guess should) very easily add a ToDatetime inside the DatetimeEncoder

That would be great IMO

they are 2 transformers because in the TableVectorizer they have to be separate because the user provides the datetime encoder but does not control the column assignments. so the datetime columns must have been parsed to decide column assignments before they get assigned to the datetime encoder and reach it.

Yes, I remember the choices that led to this design, and I agree with them.

before, the main use case for datetime encoder was in the tablevectorizer. but now that it will become more practical to use without the tablevectorizer thanks to the recipe, adding datetime parsing to do it all in one go makes sense. (and the tablevectorizer just won't use this feature)

Ok, if that doesn't introduce too much complexity on the TV part, I'm all for it.

but there are also several other cleaning steps besides datetime parsing that the tablevectorizer does, so we might want either a transformer or an option for the recipe to apply all the cleaning / preprocessing (ie everything in the tablevectorizer except the user-provided final transformers)

That's interesting; I need to refresh my memory regarding this part.

Side question: Would using the TV with the Recipe make sense in general? I'm thinking about CVing the transformers and their hyper-parameters more easily.

yes using the TableVectorizer in the Recipe completely makes sense, and it will help tune the choice of the encoders and their hyperparameters (the choose_* can be arbitrarily nested). I didn't do it in this example because on this dataset the TableVectorizer does everything fine so there would be only one step and it would be harder to showcase some features of the recipe

Ok! What about showing the recipe with TV at the end? Or would that make the message less obvious?

examples/10_recipe.py

Vincent-Maladiere · 2024-09-11T17:30:42Z

examples/10_recipe.py

+# choices.
+
+# %%
+recipe.get_cv_results_table(randomized_search)


This interaction between the HP tuner and the recipe is interesting. I like that the recipe ties different elements together and makes pragmatic assumptions about the user flow.

Would that work with another HP tuner e.g. HalvingRandomizedSearch?

yes I think I haven't added the halving search yet because when I made the recipe it was still experimental in scikit-learn (not sure if that's still the case) and its parameters are a bit hard for users to wrap their head around but at some point we should definitely add it.

atm it also has the gridsearch (although you can only use it if you don't have any continuous distributions in the hyperparameters of course)

I'm also curious to see whether people using hp tuning libraries like optuna or hyperopt could use the recipe easily, provided we know how to extract some sort of cv_results_ from their tuners.

People could try something along the lines of:

model = recipe.get_pipeline() tuner.fit(model, recipe.get_X(), recipe.get_y()) recipe.plot_parallel(tuner)

Of course, that would require us to know the methods used by other libraries, but it could be worth it in a subsequent iteration. WDYT?

Co-authored-by: Vincent M <[email protected]>

Vincent-Maladiere · 2024-09-13T10:34:19Z

skrub/_recipe.py

+    ):
+        if self._has_predictor():
+            pred_name = self._get_step_names()[-1]
+            raise ValueError(


Not a high prio: should we allow more flexibility here and have estimators working as transformers? The hard part is making sure that's what the user wants and they are not stacking estimators by mistake.

sklego introduced this concept that might make sense for us: https://github.com/koaning/scikit-lego/blob/main/sklego/meta/estimator_transformer.py

This reverts commit f351627.

Vincent-Maladiere · 2024-10-08T09:21:35Z

Hey @jeromedockes, could you write a small TL;DR regarding the recent changes?

jeromedockes · 2024-10-08T09:31:05Z

yes:

a small change to be compatible with the current version of the tablereport (columns that match a filter must now be given by their indices not column names)
removing get_x_test etc as we discussed in the first round of review
adding (developer) documentation to the _tuning module

Vincent-Maladiere · 2024-10-08T10:17:39Z

Great, thanks!

Vincent-Maladiere · 2024-10-09T16:21:37Z

skrub/_tuning.py

+    Make a copy of a dataclass instance with different values for some of the attributes
+    """
+    return obj.__class__(
+        **({f.name: getattr(obj, f.name) for f in dataclasses.fields(obj)} | fields)


Why not:

from dataclasses import asdict obj.__class__(**(asdict(obj) | fields))

?

asdict recurses into attributes and makes a deep copy, here we want a shallow copy

jeromedockes added 7 commits September 9, 2024 15:51

add the Recipe

6eab7fa

remove unused file

f0182eb

changelog

959e3e1

pixi update

485ec5f

_

a9e3b70

remove some git conflicts

ec653c2

remove some git conflicts

dab4c45

jeromedockes marked this pull request as draft September 9, 2024 14:35

jeromedockes added this to the 0.4.0 milestone Sep 9, 2024

jeromedockes added 2 commits September 9, 2024 17:51

_

520de34

_

947c7a0

Vincent-Maladiere reviewed Sep 11, 2024

View reviewed changes

Update examples/10_recipe.py

0c834df

Co-authored-by: Vincent M <[email protected]>

Vincent-Maladiere reviewed Sep 13, 2024

View reviewed changes

jeromedockes added 15 commits September 24, 2024 12:02

Merge remote-tracking branch 'upstream/main' into add-recipe

b69c1ba

column filters must now be given as indices

72f6344

Merge remote-tracking branch 'upstream/main' into add-recipe

64ec3f0

Merge remote-tracking branch 'upstream/main' into add-recipe

33ceeec

remove get_x_train etc

db6a620

_

ca5eb4c

_

f351627

Revert "_"

8c337f8

This reverts commit f351627.

_

c0248fa

_tuning module docstring

e1b8e1d

docstring

eed7d48

_

ccb2df7

_

21a303a

_

7b08544

_

ce9f9d4

jeromedockes added 2 commits October 3, 2024 16:34

_

b95826c

_

830fdc7

Merge remote-tracking branch 'upstream/main' into add-recipe

ad4d357

Vincent-Maladiere reviewed Oct 9, 2024

View reviewed changes

jeromedockes removed this from the 0.4.0 milestone Oct 23, 2024

Merge remote-tracking branch 'upstream/main' into add-recipe

9560afd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP add the Recipe #1064

WIP add the Recipe #1064

jeromedockes commented Sep 9, 2024

jeromedockes commented Sep 9, 2024 •

edited

Loading

Vincent-Maladiere left a comment

Vincent-Maladiere Sep 11, 2024

jeromedockes Sep 12, 2024

Vincent-Maladiere Sep 13, 2024

jeromedockes Sep 13, 2024

Vincent-Maladiere Sep 11, 2024

jeromedockes Sep 12, 2024

Vincent-Maladiere Sep 13, 2024

jeromedockes Sep 13, 2024

Vincent-Maladiere Sep 13, 2024 •

edited

Loading

Vincent-Maladiere Sep 11, 2024

jeromedockes Sep 12, 2024

Vincent-Maladiere Sep 13, 2024

Vincent-Maladiere Sep 13, 2024

Vincent-Maladiere commented Oct 8, 2024

jeromedockes commented Oct 8, 2024

Vincent-Maladiere commented Oct 8, 2024

Vincent-Maladiere Oct 9, 2024 •

edited

Loading

jeromedockes Oct 9, 2024

WIP add the Recipe #1064

Are you sure you want to change the base?

WIP add the Recipe #1064

Conversation

jeromedockes commented Sep 9, 2024

jeromedockes commented Sep 9, 2024 • edited Loading

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vincent-Maladiere Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vincent-Maladiere commented Oct 8, 2024

jeromedockes commented Oct 8, 2024

Vincent-Maladiere commented Oct 8, 2024

Vincent-Maladiere Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromedockes commented Sep 9, 2024 •

edited

Loading

Vincent-Maladiere Sep 13, 2024 •

edited

Loading

Vincent-Maladiere Oct 9, 2024 •

edited

Loading