Feature request: min_group_size for other category encoders, e.g. one hot #279

philip-khor · 2020-10-30T08:01:00Z

CountEncoder() has a min_group_size parameter that sets a minimum number of obs in a group required in order not for the group to be lumped together with other small groups. It would be nice to have this for the other encoder classes.

glevv · 2022-02-16T17:33:18Z

I added this parameter to OrdinalEncoder, since there was also #81 issue, which was basically asking for the same thing. I can create pr if this is still needed

PaulWestenthanner · 2023-01-23T21:40:57Z

Hi @glevv
I'm sorry, I haven't seen you've been working on this. Why did you close the Pull Request after only one day? I think it would be a useful feature and I'd be happy to include your changes.
If we add it to the OrdinalEncoder we basically get it for free for all encoders that do ordinal encoding first (e.g. all target encoders).
One thing we need to think of however is if we treat values in transform that have not been seen as the same 'other' category like the ones seen in training or not. I would probably argue for this although you could make it configurable as well..

glevv · 2023-01-24T06:29:07Z

@PaulWestenthanner I think I started doing this, made a beta version, but after testing it realized that I will have to go deeper and change a lot of things and I did not have spare time for that then.

PaulWestenthanner · 2023-01-24T14:42:33Z

Ok nice! Do you want to continue working on it? Otherwise I might look into it in March/April

glevv · 2023-01-24T18:18:27Z

I don't think I will work on it in the near future. You can give it a go

julia-kraus · 2023-02-26T12:59:41Z

I would like to work on it.

PaulWestenthanner · 2023-02-26T15:24:19Z

Hi Julia,
great! Looking forward to your contribution :)

julia-kraus · 2023-02-27T08:35:15Z

Hi @PaulWestenthanner ! I refactored the code count.py so the functionality can be extracted and pulled into the base class, so all classes can inherit this feature.

Some comments that I'd like to make:

There's an attribute called self.combine_min_nan_groups. It has three values and defines how to handle nan values in the categories. The behavior is accordingly:

If self.combine_min_nan_groups==True: Treat nan just like any other group. If it is a small group, treat it accordingly and combine it with the other small groups.
If self.combine_min_nan_groups=='force': Add the whole nan group to the combined category group, no matter its size
If self.combine_min_nan_groups==False: The nan group is not added to the combined category group, no matter ist size.

I would recommend to leave out this parameter and always go for the default for the following reasons:

Some force parameters is not compatible with several other settings of the countEncoder class, e.g. handle_missing_values. This has to be caught in the code and makes it harder to extract the behavior to other classes and there might be additional problems.
The force behavior is unintuitive and not what a user might expect. The docstring is hard to understand. And, in my opinion, the category encoder's responsibility is not to deal with nan values. Or again, we do a feature on a BaseEncoder level, so it is consistent for all modules. If we don't handle this here, the user can handle NaN themselves in their preprocessing pipeline.

If we do this refactor, the code shrinks by about a third, and becomes easier understandable and maintainable.

What do you think?

julia-kraus · 2023-02-27T08:44:11Z

@PaulWestenthanner Another simplification I'd suggest. There is right now another attribute called min_group_name. The logic is as follows:

if the min_group_name is set for a column, the minimum category gets this custom name. Else, all the minimum category name strings are combined with _ to a long string.

I think this parameter should be ommitted because it seems to lead to inconsistent behavior and changing names, Note: The default name can be long and may keep changing, for example, in cross-validation.
This is unexpected behavior. Moreover, the user should never see this feature, because the data will go right into the category encoder. Moreover, in the docs it says the custom option is 'in case the combined string' gets too long. I wonder when that is, and if the user should be aware of this. I'd rather hide that behavior.

I suggest having a kind of default name wrangling without custom option. Then we get rid of this parameter and the code gets easier to understand. I would suggest for name wrangling a name that's like __other__ + some random hash, so we make sure the name does not appear in the original dataset by chance.

if we do the above two suggestions, the min_group feature can be controlled by just one parameter, called min_group_size. This would be much cleaner.

julia-kraus · 2023-02-27T08:47:18Z

My suggestion for the integration to BaseEncocer: We add the min_group_size parameter to BaseEncoder attributes, so every subclass has it. The min_group preprocessing happens before the actual fitting and the actual transforming of the encoder, so we could add this behavior to the beginning of the BaseEncoder __super__().fit() and __super__().transform() method.

Then all the subclasses have it without us having to change their code.

PaulWestenthanner · 2023-02-27T10:51:18Z

Hi Julia,

thanks for all the work you've put into this. I really like your suggestions:

The CategoryEncoders library aims to handle missing values. Especially for target-encoding-based encoders this is very helpful since a good default value for filling the nulls can only be calculated during fit time.
I'm not sure if I'd just drop the force option. Null values might have a meaning and hence there should be the option to treat them separately, so we need the option to not include them in the other category, but also the option include them in the other category as other data sets might require otherwise. If the force option is necessary I'm not sure. I agree that most of the time you probably won't want it. On the other hand, this would make the change not backwards compatible, so I'm a little hesitant there. Also if you have the option to either include the NaNs in other or not you'll necessarily have conflicts with handle_missing='return_nan' for example. So it won't be easy to get rid of all these checks
Couldn't agree more with the naming thing. Good catch! We should definitely change this.
Regarding BaseEncoder: This sounds perfect!

glevv mentioned this issue Mar 6, 2022

Ordinal with min group #341

Closed

PaulWestenthanner mentioned this issue Jan 23, 2023

[ENH] Min_leaf_size for OrdinalEncoder (or arbitrary encoders) #293

Closed

PaulWestenthanner added enhancement good first issue labels Jan 31, 2023

PaulWestenthanner assigned julia-kraus Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: min_group_size for other category encoders, e.g. one hot #279

Feature request: min_group_size for other category encoders, e.g. one hot #279

philip-khor commented Oct 30, 2020

glevv commented Feb 16, 2022 •

edited

Loading

PaulWestenthanner commented Jan 23, 2023 •

edited

Loading

glevv commented Jan 24, 2023

PaulWestenthanner commented Jan 24, 2023

glevv commented Jan 24, 2023

julia-kraus commented Feb 26, 2023

PaulWestenthanner commented Feb 26, 2023

julia-kraus commented Feb 27, 2023 •

edited

Loading

julia-kraus commented Feb 27, 2023

julia-kraus commented Feb 27, 2023

PaulWestenthanner commented Feb 27, 2023

Feature request: min_group_size for other category encoders, e.g. one hot #279

Feature request: min_group_size for other category encoders, e.g. one hot #279

Comments

philip-khor commented Oct 30, 2020

glevv commented Feb 16, 2022 • edited Loading

PaulWestenthanner commented Jan 23, 2023 • edited Loading

glevv commented Jan 24, 2023

PaulWestenthanner commented Jan 24, 2023

glevv commented Jan 24, 2023

julia-kraus commented Feb 26, 2023

PaulWestenthanner commented Feb 26, 2023

julia-kraus commented Feb 27, 2023 • edited Loading

julia-kraus commented Feb 27, 2023

julia-kraus commented Feb 27, 2023

PaulWestenthanner commented Feb 27, 2023

glevv commented Feb 16, 2022 •

edited

Loading

PaulWestenthanner commented Jan 23, 2023 •

edited

Loading

julia-kraus commented Feb 27, 2023 •

edited

Loading