Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Down-Sampling PipoOps (Tomek, Nearmiss) based on themis #817

Merged
merged 28 commits into from
Sep 24, 2024

Conversation

advieser
Copy link
Collaborator

@advieser advieser commented Aug 31, 2024

This implements two new PipeOps for down-sampling inbalanced data, by calling themis functions:

  • PipeOpTomek: Removes Tomek Links, i.e. pairs of observations that are nearest neighbors and of different classes. Note that this is only one possible implementation of Tomek Links, which is used for data cleaning. There also exists an algorithm for balancing data, in which not both observations of a Tomek Link are removed but only the majroity class member of the pair. However, this is the only version currently implemented by themis.
  • PipeOpNearmiss: Removes instances of the non-minority classes based on the NEARMISS algorithm, i.e. the instances that have the smallest mean distance to the closest instances of other classes. This is, again, only one possible implementation, but the only one in themis.
    The documentation in themis seems to contain a few errors (probably due to being copied from another function).

As of right now, these two pipeops ignore stratification completely.

I'm looking for some feedback about the way I filter the task based on the themis result. As the result is a data.table, called dt, I currently take the rownames of that result:

keep = as.integer(row.names(dt))
task$filter(keep)

This seems a bit clunky to me. An alternative I thought of would be

keep = as.integer(row.names(fintersect(task$data(), dt)))

which I'd expect to be more robust but also computationally more intensive (don't know how efficient fintersect is).

Partially addresses #790.

R/PipeOpNearmiss.R Outdated Show resolved Hide resolved
R/PipeOpTomek.R Outdated Show resolved Hide resolved
R/PipeOpNearmiss.R Outdated Show resolved Hide resolved
R/PipeOpNearmiss.R Outdated Show resolved Hide resolved
R/PipeOpTomek.R Outdated Show resolved Hide resolved
R/PipeOpTomek.R Outdated Show resolved Hide resolved
@advieser advieser marked this pull request as ready for review September 21, 2024 18:06
R/PipeOpNearmiss.R Outdated Show resolved Hide resolved
R/PipeOpTomek.R Outdated Show resolved Hide resolved
@mb706
Copy link
Collaborator

mb706 commented Sep 24, 2024

todo: test with task with nonstandard rowids

@advieser advieser merged commit 7a1b09a into master Sep 24, 2024
4 checks passed
@advieser advieser deleted the themis_pipeops branch September 24, 2024 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants