Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CatBoostEncoder mapping back the categories #361

Open
mirix opened this issue Jun 15, 2022 · 1 comment
Open

CatBoostEncoder mapping back the categories #361

mirix opened this issue Jun 15, 2022 · 1 comment

Comments

@mirix
Copy link

mirix commented Jun 15, 2022

Expected Behavior

I have not found a function to map the encoded values back to the categorical values when using category_encoders' CatBoostEncoder.

I was trying to do it manually by using the following equation (TargetSum + Prior) / (FeatureCount + 1)

I am using the following example :

https://www.geeksforgeeks.org/categorical-encoding-with-catboost-encoder/

And the following code:

cbe_encoder = ce.cat_boost.CatBoostEncoder()
mapp = cbe_encoder.fit(train, target)

prior = target['grade'].sum() / len(train)
color = mapp.mapping.get('color').reset_index()
color['encoder'] = ( color['sum'] + prior ) / ( color['count'] + 1 )

Same for the column 'interests'.

color interests height grade
0 red sketching 68 1
1 blue painting 64 2
2 blue instruments 87 3
3 green sketching 45 2
4 red painting 54 3
5 red video games 64 1
6 black painting 67 4
7 black instruments 98 4
8 blue sketching 90 2
9 green sketching 87 3
color interests height
0 1.875 2.100000 68
1 2.375 2.875000 64
2 2.375 3.166667 87
3 2.500 2.100000 45
4 1.875 2.875000 54
5 1.875 2.500000 64
6 3.500 2.875000 67
7 3.500 3.166667 98
8 2.375 2.100000 90
9 2.500 2.100000 87
2.5
index sum count encoder
0 black 8 2 3.500
1 blue 7 3 2.375
2 green 5 2 2.500
3 red 5 3 1.875
index sum count encoder
0 instruments 7 2 3.166667
1 painting 9 3 2.875000
2 sketching 8 4 2.100000
3 video games 1 1 1.750000

Actual Behavior

As you can see all values match except for video games, which is assigned 2.5 by the encoder but applying the equation yields 1.75, which seems the correct value to me.

Or is it the constant different from 1 when there is only one occurrence?

Steps to Reproduce the Problem

  1. Add the code above to the code from the link.
  2. Run it.

Specifications

  • Version: 2.5.0
  • Platform: Linux zboox 5.18.3-1-MANJARO Test cases failing #1 SMP PREEMPT_DYNAMIC Thu Jun 9 09:54:55 UTC 2022 x86_64 GNU/Linux
  • Subsystem:
@PaulWestenthanner
Copy link
Collaborator

Or is it the constant different from 1 when there is only one occurrence?

This is actually the case in the current implementation. This is done do avoid over-fitting. But there is a discussion that we should not have this behaviour but rather manage cases with little sample size via regularization. This is also the case in e.g. in target encoder. We discuss this in issue #327

For reference this is the critical line in the current code: https://github.com/scikit-learn-contrib/category_encoders/blob/master/category_encoders/cat_boost.py#L120=

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants