Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilterAUC and missing values #60

Open
mllg opened this issue Oct 30, 2019 · 9 comments
Open

FilterAUC and missing values #60

mllg opened this issue Oct 30, 2019 · 9 comments

Comments

@mllg
Copy link
Member

mllg commented Oct 30, 2019

FilterAUC operates on features with missing values by just ranking the missing values last (default in rank()). I'm not sure that this is statistically sound.

I'd suggest removing them and calculate the AUC on the remaining observations.

@berndbischl @pat-s ?

@pat-s
Copy link
Member

pat-s commented Oct 30, 2019

Could this also apply to more filters than just FilterAUC?

@berndbischl
Copy link
Member

why dont we throw an error? if NAs are there.
also we seem to be missing a generic test.
and we need to clearly doc / decide what happens in these cases for all filters

@berndbischl
Copy link
Member

i guess ignoring the NA-based obs in the calculation is "cleanest" and most robust as michel suggested. but we should probably then do this in a global place, unit test this properly and also document it visibly

@mllg
Copy link
Member Author

mllg commented Nov 1, 2019

a3e43f9 replaced Metrics and ignores NAs, but we still need tests and check the behaviour of other filters.

@berndbischl
Copy link
Member

Ignoring Nas is actually wrong after thinking about this more pls don't merge / release this without further discussion

@berndbischl
Copy link
Member

If you have a feature with 98% missing values and for the remainder there is a high or perfect correlation with the target that feature would get a very high score. That's wrong?

Nas should be an error for filters. And users should transparently impute them.

Agreed?

@pat-s
Copy link
Member

pat-s commented Oct 19, 2020

@mllg Looking at ?mlr3measures::auc(), the NA value is NaN. Is this something you added in the meantime which fixes the initial issue or does this have the same effect? (i.e. ordering the NaN features last).

@mllg
Copy link
Member Author

mllg commented Oct 19, 2020

@mllg Looking at ?mlr3measures::auc(), the NA value is NaN. Is this something you added in the meantime which fixes the initial issue or does this have the same effect? (i.e. ordering the NaN features last).

NaN is the return value if you cannot calculate the measure (div/0 etc). Having NA in truth or response always results in an error.

For filters, Bernd suggested throwing an error. I assume this is the safest way to deal with this. If we want to allow missing values by just removing them (as FilterVariance currently does), this should not be the default behavior.

@pat-s
Copy link
Member

pat-s commented Oct 20, 2020

Right now it seems like NAs are removed prior to score calculation

score = map_dbl(x, function(x) {
keep = !is.na(x)
auc(y[keep], x[keep])
})
abs(0.5 - score)

I would add an assertion which checks for NAs in any feature and apply this to every filter with a descriptive error message to use a pipeop to impute these values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants