Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: min_group_size for other category encoders, e.g. one hot #279

Open
philip-khor opened this issue Oct 30, 2020 · 11 comments

Comments

@philip-khor
Copy link
Contributor

CountEncoder() has a min_group_size parameter that sets a minimum number of obs in a group required in order not for the group to be lumped together with other small groups. It would be nice to have this for the other encoder classes.

@glevv
Copy link
Contributor

glevv commented Feb 16, 2022

I added this parameter to OrdinalEncoder, since there was also #81 issue, which was basically asking for the same thing. I can create pr if this is still needed

@PaulWestenthanner
Copy link
Collaborator

PaulWestenthanner commented Jan 23, 2023

Hi @glevv
I'm sorry, I haven't seen you've been working on this. Why did you close the Pull Request after only one day? I think it would be a useful feature and I'd be happy to include your changes.
If we add it to the OrdinalEncoder we basically get it for free for all encoders that do ordinal encoding first (e.g. all target encoders).
One thing we need to think of however is if we treat values in transform that have not been seen as the same 'other' category like the ones seen in training or not. I would probably argue for this although you could make it configurable as well..

@glevv
Copy link
Contributor

glevv commented Jan 24, 2023

@PaulWestenthanner I think I started doing this, made a beta version, but after testing it realized that I will have to go deeper and change a lot of things and I did not have spare time for that then.

@PaulWestenthanner
Copy link
Collaborator

Ok nice! Do you want to continue working on it? Otherwise I might look into it in March/April

@glevv
Copy link
Contributor

glevv commented Jan 24, 2023

I don't think I will work on it in the near future. You can give it a go

@julia-kraus
Copy link

I would like to work on it.

@PaulWestenthanner
Copy link
Collaborator

Hi Julia,
great! Looking forward to your contribution :)

@julia-kraus
Copy link

julia-kraus commented Feb 27, 2023

Hi @PaulWestenthanner ! I refactored the code count.py so the functionality can be extracted and pulled into the base class, so all classes can inherit this feature.

Some comments that I'd like to make:

There's an attribute called self.combine_min_nan_groups. It has three values and defines how to handle nan values in the categories. The behavior is accordingly:

  • If self.combine_min_nan_groups==True: Treat nan just like any other group. If it is a small group, treat it accordingly and combine it with the other small groups.
  • If self.combine_min_nan_groups=='force': Add the whole nan group to the combined category group, no matter its size
  • If self.combine_min_nan_groups==False: The nan group is not added to the combined category group, no matter ist size.

I would recommend to leave out this parameter and always go for the default for the following reasons:

  • Some force parameters is not compatible with several other settings of the countEncoder class, e.g. handle_missing_values. This has to be caught in the code and makes it harder to extract the behavior to other classes and there might be additional problems.

  • The force behavior is unintuitive and not what a user might expect. The docstring is hard to understand. And, in my opinion, the category encoder's responsibility is not to deal with nan values. Or again, we do a feature on a BaseEncoder level, so it is consistent for all modules. If we don't handle this here, the user can handle NaN themselves in their preprocessing pipeline.

If we do this refactor, the code shrinks by about a third, and becomes easier understandable and maintainable.

What do you think?

@julia-kraus
Copy link

@PaulWestenthanner Another simplification I'd suggest. There is right now another attribute called min_group_name. The logic is as follows:

if the min_group_name is set for a column, the minimum category gets this custom name. Else, all the minimum category name strings are combined with _ to a long string.

I think this parameter should be ommitted because it seems to lead to inconsistent behavior and changing names, Note: The default name can be long and may keep changing, for example, in cross-validation.
This is unexpected behavior. Moreover, the user should never see this feature, because the data will go right into the category encoder. Moreover, in the docs it says the custom option is 'in case the combined string' gets too long. I wonder when that is, and if the user should be aware of this. I'd rather hide that behavior.

I suggest having a kind of default name wrangling without custom option. Then we get rid of this parameter and the code gets easier to understand. I would suggest for name wrangling a name that's like __other__ + some random hash, so we make sure the name does not appear in the original dataset by chance.

if we do the above two suggestions, the min_group feature can be controlled by just one parameter, called min_group_size. This would be much cleaner.

@julia-kraus
Copy link

My suggestion for the integration to BaseEncocer: We add the min_group_size parameter to BaseEncoder attributes, so every subclass has it. The min_group preprocessing happens before the actual fitting and the actual transforming of the encoder, so we could add this behavior to the beginning of the BaseEncoder __super__().fit() and __super__().transform() method.

Then all the subclasses have it without us having to change their code.

@PaulWestenthanner
Copy link
Collaborator

Hi Julia,

thanks for all the work you've put into this. I really like your suggestions:

  1. The CategoryEncoders library aims to handle missing values. Especially for target-encoding-based encoders this is very helpful since a good default value for filling the nulls can only be calculated during fit time.
  2. I'm not sure if I'd just drop the force option. Null values might have a meaning and hence there should be the option to treat them separately, so we need the option to not include them in the other category, but also the option include them in the other category as other data sets might require otherwise. If the force option is necessary I'm not sure. I agree that most of the time you probably won't want it. On the other hand, this would make the change not backwards compatible, so I'm a little hesitant there. Also if you have the option to either include the NaNs in other or not you'll necessarily have conflicts with handle_missing='return_nan' for example. So it won't be easy to get rid of all these checks
  3. Couldn't agree more with the naming thing. Good catch! We should definitely change this.
  4. Regarding BaseEncoder: This sounds perfect!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants