-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Down-Sampling PipoOps (Tomek, Nearmiss) based on themis
#817
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mb706
reviewed
Sep 17, 2024
mb706
reviewed
Sep 17, 2024
mb706
reviewed
Sep 24, 2024
mb706
reviewed
Sep 24, 2024
todo: test with task with nonstandard rowids |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This implements two new PipeOps for down-sampling inbalanced data, by calling
themis
functions:PipeOpTomek
: Removes Tomek Links, i.e. pairs of observations that are nearest neighbors and of different classes. Note that this is only one possible implementation of Tomek Links, which is used for data cleaning. There also exists an algorithm for balancing data, in which not both observations of a Tomek Link are removed but only the majroity class member of the pair. However, this is the only version currently implemented bythemis
.PipeOpNearmiss
: Removes instances of the non-minority classes based on the NEARMISS algorithm, i.e. the instances that have the smallest mean distance to the closest instances of other classes. This is, again, only one possible implementation, but the only one inthemis
.The documentation in
themis
seems to contain a few errors (probably due to being copied from another function).As of right now, these two pipeops ignore stratification completely.
I'm looking for some feedback about the way I filter the task based on the
themis
result. As the result is adata.table
, calleddt
, I currently take the rownames of that result:This seems a bit clunky to me. An alternative I thought of would be
which I'd expect to be more robust but also computationally more intensive (don't know how efficient
fintersect
is).Partially addresses #790.