Skip to content

Commit

Permalink
Merge pull request #817 from mlr-org/themis_pipeops
Browse files Browse the repository at this point in the history
New Down-Sampling PipoOps (Tomek, Nearmiss) based on `themis`
  • Loading branch information
advieser authored Sep 24, 2024
2 parents f19fb8c + d3402ef commit 7a1b09a
Show file tree
Hide file tree
Showing 85 changed files with 842 additions and 4 deletions.
4 changes: 3 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: mlr3pipelines
Title: Preprocessing Operators and Pipelines for 'mlr3'
Version: 0.7.0
Version: 0.7.0-9000
Authors@R:
c(person(given = "Martin",
family = "Binder",
Expand Down Expand Up @@ -163,6 +163,7 @@ Collate:
'PipeOpMutate.R'
'PipeOpNMF.R'
'PipeOpNOP.R'
'PipeOpNearmiss.R'
'PipeOpOVR.R'
'PipeOpPCA.R'
'PipeOpProxy.R'
Expand All @@ -183,6 +184,7 @@ Collate:
'PipeOpSubsample.R'
'PipeOpTextVectorizer.R'
'PipeOpThreshold.R'
'PipeOpTomek.R'
'PipeOpTrafo.R'
'PipeOpTuneThreshold.R'
'PipeOpUnbranch.R'
Expand Down
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ export(PipeOpMultiplicityImply)
export(PipeOpMutate)
export(PipeOpNMF)
export(PipeOpNOP)
export(PipeOpNearmiss)
export(PipeOpOVRSplit)
export(PipeOpOVRUnite)
export(PipeOpPCA)
Expand Down Expand Up @@ -160,6 +161,7 @@ export(PipeOpTaskPreproc)
export(PipeOpTaskPreprocSimple)
export(PipeOpTextVectorizer)
export(PipeOpThreshold)
export(PipeOpTomek)
export(PipeOpTuneThreshold)
export(PipeOpUnbranch)
export(PipeOpVtreat)
Expand Down
8 changes: 5 additions & 3 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# mlr3pipelines 0.7.0-9000

* New down-sampling PipeOps for inbalanced data: `PipeOpTomek` / `po("tomek")` and `PipeOpNearmiss` / `po("nearmiss")`

# mlr3pipelines 0.7.0

* New PipeOp `PipeOpRowApply` / `po("rowapply")`
Expand All @@ -16,9 +20,7 @@
* `as_data_table(po())` now works even when some `PipeOp`s can not be constructed.
For these `PipeOp`s, `NA` is reported in most columns.
* Compatibility with upcoming `mlr3` release.
* New PipeOp: `PipeOpRowApply` / `po("rowapply")`
* New PipeOps for handling inbalanced data: `PipeOpADAS` / `po("adas")` and `PipeOpBLSmote` / `po("blsmote")`
* New PipeOp for handling inbalanced data: `PipeOpSmoteNC` / `po("smotenc")`
* New PipeOps for handling inbalanced data: `PipeOpADAS` / `po("adas")`, `PipeOpBLSmote` / `po("blsmote")` and `PipeOpSmoteNC` / `po("smotenc")`

# mlr3pipelines 0.6.0

Expand Down
110 changes: 110 additions & 0 deletions R/PipeOpNearmiss.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
#' @title Nearmiss Down-Sampling
#'
#' @usage NULL
#' @name mlr_pipeops_nearmiss
#' @format [`R6Class`][R6::R6Class] object inheriting from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @description
#' Generates a more balanced data set by down-sampling the instances of non-minority classes using the NEARMISS algorithm.
#'
#' The algorithm down-samples by selecting instances from the non-minority classes that have the smallest mean distance
#' to their `k` nearest neighbors of different classes.
#' For this only numeric and integer features are taken into account. These must have no missing values.
#'
#' This can only be applied to [classification tasks][mlr3::TaskClassif]. Multiclass classification is supported.
#'
#' See [`themis::nearmiss`] for details.
#'
#' @section Construction:
#' ```
#' PipeOpNearmiss$new(id = "nearmiss", param_vals = list())
#' ```
#'
#' * `id` :: `character(1)`\cr
#' Identifier of resulting object, default `"nearmiss"`.
#' * `param_vals` :: named `list`\cr
#' List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default `list()`.
#'
#' @section Input and Output Channels:
#' Input and output channels are inherited from [`PipeOpTaskPreproc`].
#'
#' The output during training is the input [`Task`][mlr3::Task] with the rows removed from the non-minority classes.
#' The output during prediction is the unchanged input.
#'
#' @section State:
#' The `$state` is a named `list` with the `$state` elements inherited from [`PipeOpTaskPreproc`].
#'
#' @section Parameters:
#' The parameters are the parameters inherited from [`PipeOpTaskPreproc`], as well as
#' * `k` :: `integer(1)`\cr
#' Number of nearest neighbors used for calculating the mean distances. Default is `5`.
#' * `under_ratio` :: `numeric(1)`\cr
#' Ratio of the minority-to-majority frequencies. This specifies the ratio to which the number of instances
#' in the non-minority classes get down-sampled to, relative to the number of instances of the minority class.
#' Default is `1`. For details, see [`themis::nearmiss`].
#'
#' @section Fields:
#' Only fields inherited from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @section Methods:
#' Only methods inherited from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @references
#' `r format_bib("zhang2003")`
#'
#' @family PipeOps
#' @template seealso_pipeopslist
#' @include PipeOpTaskPreproc.R
#' @export
#' @examples
#' \dontshow{ if (requireNamespace("themis")) \{ }
#' library("mlr3")
#'
#' # Create example task
#' task = tsk("wine")
#' task$head()
#' table(task$data(cols = "type"))
#'
#' # Down-sample and balance data
#' pop = po("nearmiss")
#' nearmiss_result = pop$train(list(task))[[1]]$data()
#' nrow(nearmiss_result)
#' table(nearmiss_result$type)
#' \dontshow{ \} }
PipeOpNearmiss = R6Class("PipeOpNearmiss",
inherit = PipeOpTaskPreproc,
public = list(
initialize = function(id = "nearmiss", param_vals = list()) {
ps = ps(
k = p_int(lower = 1, default = 5, tags = c("train", "nearmiss")),
under_ratio = p_dbl(lower = 0, default = 1, tags = c("train", "nearmiss"))
)
super$initialize(id, param_set = ps, param_vals = param_vals, packages = "themis", can_subset_cols = FALSE,
task_type = "TaskClassif", tags = "imbalanced data")
}
),
private = list(

.train_task = function(task) {
# Return task unchanged, if no feature columns exist
if (!length(task$feature_names)) {
return(task)
}
# At least one numeric or integer feature required
if (!any(task$feature_types$type %in% c("numeric", "integer"))) {
stop("Nearmiss needs at least one numeric or integer feature to work.")
}
# Subset columns to only include integer/numeric features and the target
type = id = NULL
cols = c(task$feature_types[type %in% c("integer", "numeric"), id], task$target_names)
# Down-sample data
dt = setDT(invoke(themis::nearmiss, df = task$data(cols = cols), var = task$target_names,
.args = self$param_set$get_values(tags = "nearmiss")))

keep = task$row_ids[as.integer(row.names(dt))]
task$filter(keep)
}
)
)

mlr_pipeops$add("nearmiss", PipeOpNearmiss)
99 changes: 99 additions & 0 deletions R/PipeOpTomek.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
#' @title Tomek Down-Sampling
#'
#' @usage NULL
#' @name mlr_pipeops_tomek
#' @format [`R6Class`][R6::R6Class] object inheriting from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @description
#' Generates a cleaner data set by removing all majority-minority Tomek links.
#'
#' The algorithm down-samples the data by removing all pairs of observations that form a Tomek link,
#' i.e. a pair of observations that are nearest neighbors and belong to different classes.
#' For this only numeric and integer features are taken into account. These must have no missing values.
#'
#' This can only be applied to [classification tasks][mlr3::TaskClassif]. Multiclass classification is supported.
#'
#' See [`themis::tomek`] for details.
#'
#' @section Construction:
#' ```
#' PipeOpTOmek$new(id = "tomek", param_vals = list())
#' ```
#'
#' * `id` :: `character(1)`\cr
#' Identifier of resulting object, default `"tomek"`.
#' * `param_vals` :: named `list`\cr
#' List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default `list()`.
#'
#' @section Input and Output Channels:
#' Input and output channels are inherited from [`PipeOpTaskPreproc`].
#'
#' The output during training is the input [`Task`][mlr3::Task] with removed rows for pairs of observations that form a Tomek link.
#' The output during prediction is the unchanged input.
#'
#' @section State:
#' The `$state` is a named `list` with the `$state` elements inherited from [`PipeOpTaskPreproc`].
#'
#' @section Parameters:
#' The parameters are the parameters inherited from [`PipeOpTaskPreproc`].
#'
#' @section Fields:
#' Only fields inherited from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @section Methods:
#' Only methods inherited from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @references
#' `r format_bib("tomek1976")`
#'
#' @family PipeOps
#' @template seealso_pipeopslist
#' @include PipeOpTaskPreproc.R
#' @export
#' @examples
#' \dontshow{ if (requireNamespace("themis")) \{ }
#' library("mlr3")
#'
#' # Create example task
#' task = tsk("iris")
#' task$head()
#' table(task$data(cols = "Species"))
#'
#' # Down-sample data
#' pop = po("tomek")
#' tomek_result = pop$train(list(task))[[1]]$data()
#' nrow(tomek_result)
#' table(tomek_result$Species)
#' \dontshow{ \} }
PipeOpTomek = R6Class("PipeOpTomek",
inherit = PipeOpTaskPreproc,
public = list(
initialize = function(id = "tomek", param_vals = list()) {
super$initialize(id, param_set = ps(), param_vals = param_vals, packages = "themis", can_subset_cols = FALSE,
task_type = "TaskClassif", tags = "imbalanced data")
}
),
private = list(

.train_task = function(task) {
# Return task unchanged, if no feature columns exist
if (!length(task$feature_names)) {
return(task)
}
# At least one numeric or integer feature required
if (!any(task$feature_types$type %in% c("numeric", "integer"))) {
stop("Tomek needs at least one numeric or integer feature to work.")
}
# Subset columns to only include integer/numeric features and the target
type = id = NULL
cols = c(task$feature_types[type %in% c("integer", "numeric"), id], task$target_names)
# Down-sample data
dt = setDT(invoke(themis::tomek, df = task$data(cols = cols), var = task$target_names))

keep = task$row_ids[as.integer(row.names(dt))]
task$filter(keep)
}
)
)

mlr_pipeops$add("tomek", PipeOpTomek)
20 changes: 20 additions & 0 deletions R/bibentries.R
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,25 @@ bibentries = c(
journal = "Journal of the American Statistical Association"
),

zhang2003 = bibentry("inproceedings",
year = "2003",
author = "Zhang, J. and Mani, I.",
title = "KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction",
booktitle = "Proceedings of Workshop on Learning from Imbalanced Datasets (ICML)",
),

tomek1976 = bibentry("article",
doi = "10.1109/TSMC.1976.4309452",
author = "I. Tomek",
year = "1976",
title = "Two Modifications of CNN",
journal = "IEEE Transactions on Systems, Man and Cybernetics",
volume = "6",
number = "11",
pages = "769--772",
publisher = "IEEE"
),

he_2008 = bibentry("InProceedings",
author = "Haibo He and Yang Bai and Garcia, Edwardo A. and Shutao Li",
booktitle = "2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)",
Expand All @@ -75,4 +94,5 @@ bibentries = c(
pages = "878--887",
isbn = "978-3-540-31902-3"
)

)
2 changes: 2 additions & 0 deletions man/PipeOp.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/PipeOpEnsemble.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/PipeOpImpute.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/PipeOpTargetTrafo.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/PipeOpTaskPreproc.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/PipeOpTaskPreprocSimple.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 7a1b09a

Please sign in to comment.