Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add robust metric #122

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

TimotheeMathieu
Copy link
Contributor

@TimotheeMathieu TimotheeMathieu commented Jun 25, 2021

This PR use Huber robust mean estimator to make a robust metric.

Description: one of the big challenge of robust machine learning is that the usual scoring scheme (cross_validation with MSE for instance) is not robust. Indeed, if the dataset has some outliers, then the test sets in cross_validation may have outliers and then the cross_validation MSE would give us a huge error for our robust algorithm on any corrupted data. This is why for example robust methods cannot be efficient for regression challenges in kaggle, because the error computation is not robust.
This PR propose a robust metric that would allow us to compute a robust cross-validation MSE for instance.

Example :

import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn_extra.robust import make_huber_metric

robust_mse = make_huber_metric(mean_squared_error, c=9) 
# c = 9 -> more than 99% of a normal is within [-3, 3]. Hence more that 99% of a normal squared is within [0,9].

y_true = np.random.normal(size=100)
y_true_cor = y_true.copy()
y_true_cor[42] = 20 # this is an outlier in the test set
y_pred = np.random.normal(size=100)


print('MSE on uncorrupted : %.3F ' %(mean_squared_error(y_true, y_pred)))
print('Robust MSE on uncorrupted : %.3F ' %(robust_mse(y_true, y_pred)))
print('MSE on corrupted : %.3F ' %(mean_squared_error(y_true_cor, y_pred)))
print('Robust MSE on corrupted : %.3F ' %(robust_mse(y_true_cor, y_pred)))

This returns

MSE on uncorrupted : 2.152 
Robust MSE on uncorrupted : 2.072 
MSE on corrupted : 7.202 
Robust MSE on corrupted : 2.072 

doc/modules/robust.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TimotheeMathieu !

I was wondering if @lorentzenchr has an opinion on this by any chance?

I think we would at least need more tests, for instance checking that cross-validation works on the resulting metric. Also I think all of scikit-learn metrics support sample_weight. Would it make sense to add it here?

sklearn_extra/robust/mean_estimators.py Show resolved Hide resolved
sklearn_extra/robust/mean_estimators.py Outdated Show resolved Hide resolved
sklearn_extra/robust/mean_estimators.py Outdated Show resolved Hide resolved
@lorentzenchr
Copy link

Having a huber loss available as metric makes sense for models fitted with huber loss.

Be aware that the huber loss elicits something in between the median and the expectation, so it is not really clear what you get/estimate. The omnipresent point about MSE not being robust has at least 2 important points:

  • The (estimation of the) expectation (mean) is not robust in general.
  • There are alternatives for the MSE, in particular for positive targets, that elicit the expectation, e.g. all tweedie deviances.

Last but not least, my all time favorite reference: https://arxiv.org/abs/0912.0902

@TimotheeMathieu
Copy link
Contributor Author

TimotheeMathieu commented Jun 26, 2021

Thanks for the comments.

@lorentzenchr what I did is not the Huber loss. It is a robust estimator of the mean applied to the squared errors.
I used the MSE only as an example, I can also do a robust version of mean absolute error if I use make_huber_metric(mean_absolute_error, c=9) , this is very different because our aim is always to estimate the MSE or mean absolute error but while ignoring the outliers. I don't use a different loss function, I use a different way to estimate the mean in MEAN squared error and MEAN absolute error. because the empirical mean is not robust while Huber estimator is robust.
This can be a problem for people used to Huber loss but in fact this is very different and it is also from Huber so I can't really change the name.

  • This is always robust, and we know this from the theory
  • We still estimate the MSE or MAE... we don't really change the metric estimated. We just make it robust. So it can still be interpreted as you would usually interpret an MSE or MAE...

If you want to see references, for instance there is Robust location estimator by Huber or more recently Challenging the empirical mean and empirical variance: a deviation study by Catoni.

EDIT : I added an explanation in the user guide that gives some equations to explain this.

@lorentzenchr
Copy link

@TimotheeMathieu Thanks for the explanation. Now I get it. Something that could be mentioned in the example is the trimmed mean as a simpler entry point to robust estimation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants