Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT add parameter validation using BaseEstimator #958

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

glemaitre
Copy link
Member

This bring parameter validation that is included in the supported scikit-learn version.

@TheooJ
Copy link
Contributor

TheooJ commented Jun 17, 2024

Should we add this for all BaseEstimators ?

@glemaitre glemaitre marked this pull request as draft June 18, 2024 08:46
@glemaitre
Copy link
Member Author

Yep, we could basically have this everywhere. I just wanted to make a test.
But apparently, we added the parameter validation in scikit-learn 1.3 and thus the tests are failing (the minimum version is 1.2.1).

So if we want to support, we would need to vendor a file like in imbalanced-learn: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/utils/_param_validation.py
Additionally, we would need to create a mixin.

I don't think this is worth the development time right now: we might bump the minimal version of scikit-learn and then it would be usable. So we can let this PR on the side for the moment.

Here are two future advantages that I foresee:

  • Having informative and consistent error that are complementary to the documentation
  • Making sure that the documentation is aligned with the code. Then, it would ensure effort that create automatic stubs (https://github.com/scientific-python/docstub) will provide the proper types. On the UI/UX side, I think this would be an interesting thing to have.

But let's postpone.

@jeromedockes
Copy link
Member

relevant discussion in scikit-learn: scikit-learn/scikit-learn#22722

I would say for now I am +0 on this, I suggest we discuss it in one of the skrub meetings.

some potential drawbacks IMHO:

  • this looks a lot like type annotations but that are not recognized by mypy or other tools (I realize this pr is about runtime checks but those could inspect annotations instead of custom attributes). this means we do the work of adding annotations but don't get the benefit of static checking, improved autocompletion in editors etc. also, contributors have to learn the scikit-learn (private) constraints instead of standard typing utilities which they may already know, eg StrOptions instead of Literal
  • I'm not sure it is really necessary to systematically check the types of all parameters at runtime and I don't think it is super common to do this in Python code. maybe we could allow specifying constraints only for some parameters and the others would be understood to have no constraints/no runtime checks
  • quite a few of the user errors we check (and I would say they are the most likely ones) cannot be covered by the constraints with the error messages we want AFAICT, eg checking that joining columns are found in the dataframe (which involves 2 parameters not just 1), or that the same kind of dataframe with the same columns is used at fit and test time. Also some checks would require using a callable as the constraint eg checking the contents of a dataframe column, in which case the benefit of the declarative way of specifying constraints is much reduced
  • adding the constraints means wrapping the methods that use them which causes slightly more complex and less helpful tracebacks
  • it would require using some private scikit-learn objects such as _fit_context and _param_validation
  • as you say there is the question of development time / priority choices

None of these are major issues and they may well be small drawbacks compared to the advantages listed in the scikit-learn issue 22722

@@ -67,7 +67,7 @@ class AggJoiner(TransformerMixin, BaseEstimator):
The placeholder string "X" can be provided to perform
self-aggregation on the input data.

key : str, default=None
key : str or iterable of str, default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a side note, we use "iterable" everywhere but I wonder if we should say "sequence" (or "list"?) because it is more understandable for users who are less familiar with the python/computer programming jargon. also it is arguably a bit more accurate because we iterate over these parameters several times and sometimes index them so some iterables would not be appropriate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to discuss this
I had the same question on the Joiner PR

Copy link
Member Author

@glemaitre glemaitre Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, this should be a list (and you can accept loosely tuple) but this is more friendly than stating sequence that is only meaningful for Python developer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with "list" -- and we already use that term in a bunch of places

@glemaitre
Copy link
Member Author

None of these are major issues and they may well be small drawbacks compared to the advantages listed in the scikit-learn issue 22722

They are valid point. I would say that this feature is supposed to become part of the developer API at some point.

Regarding the runtime checks, I think this is just a way to have consistent checks regarding the parameter instead of delegating to the dev that create the class. For the user, you get the benefit of a nice error message. I know that SciPy was looking at a similar checking system.

To me the real benefit is the next step regarding consistent documentation and stubs.

But as I said, this is really not a priority right now considering the effort of development. At least now, @TheooJ and @jeromedockes are aware that this is existing and it would cost defining a dictionary ;)

@jeromedockes
Copy link
Member

jeromedockes commented Jun 18, 2024

At least now, @TheooJ and @jeromedockes are aware that this is existing and it would cost defining a dictionary ;)

indeed, thanks :)
also, this doesn't have to be all-or-nothing: we could start using it for estimators for which it would replace or simplify already existing checks (as the datetimeencoder example in this PR)

(once we drop support for scikit-learn 1.2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants