Proper way of fitting classifiers before creating an heterogeneous pool #264

francescopisu · 2021-12-30T10:18:51Z

Hey, I'm working on a research paper focused on building a binary classification model in the biomedical domain.The dataset comprises approximately 800 data points. Let's say I want to feed an heterogeneous pool of classifiers to the dynamic selection methods. By following the instructions on the examples, I've found two different ways of splitting the dataset and fitting the base classifiers of the pool.

Split in train/test (e.g., 75/25) and then split the training in train/dsel (e.g., 50/50).
In this random forest example, the RF is fitted on the 75% training portion and the DS methods on the 50% DSEL portion.
In all the other examples, the 50% training portion is used to fit the classifier and the 50% DSEL portion is used to fit DS methods.

Furthermore, I wanted to point out this tip taken from the tutorial :

An important point here is that in case of small datasets or when the base classifier models in the pool are weak estimators such as Decision Stumps or Perceptrons, an overlap between the training data and DSEL may be beneficial for achieving better performance.

That seems my case, as my dataset is rather small compared to most datasets in the ML domain. Hence, I was thinking of fitting my base classifiers on the 75% part and then leveraging some overlap to get better performance (and this is really the case! In fact, overlapping leads to a median auc of 0.76 whereas non-overlapping gives 0.71).

What would be the best way of dealing with the problem ?

francescopisu · 2022-01-03T07:31:55Z

Any update on this ?

Menelau · 2022-01-04T23:08:38Z

Hello,

In you case (small dataset with about 800 samples), I do believe that using the whole 75% for training the base models and DSEL I'm pretty sure the best way of handling it is having some overlap between the datasets.

Does your dataset also suffers from class imbalance? If yes, this is another indication that you should consider an overlapping approach as the minority class may be quite under represented if we divide the dataset too much. If have done that in quite a few papers such as the FIRE-DES++ one:

FIRE-DES++: Enhanced online pruning of base classifiers for dynamic ensemble selection: https://www.etsmtl.ca/Unites-de-recherche/LIVIA/Recherche-et-innovation/Publications/Publications-2019/Cruz_PR_2019.pdf

Also, if your dataset suffer from class imbalance, maybe you could try applying data augmentation to increase the DSEL size later as we did in a previous publication:

A study on combining dynamic selection and data preprocessing for imbalance learning: https://en.etsmtl.ca/Unites-de-recherche/LIVIA/Recherche-et-innovation/Publications/Publications-2018/Roy_Neurocomputing_2018_InPress.pdf

francescopisu · 2022-01-05T15:25:46Z

Hello Menelau, thank you very much for your time and valuable inputs. My dataset suffers from mild imbalance, but this time is the negative class (let's say absence of condition) to be underrepresented, which is the class I'm less interested in. I'm going to read these two articles and try to apply what you suggested.

francescopisu · 2022-02-22T18:35:46Z

-- EDIT: In my previous answer I was reporting wrong results due to some implementation errors on my part. I removed that part as I am now getting more realistic results.

I have some feedback regarding the overlap. If I understood correctly, Xtrain and Xdsel should be taken such that set.intersection(Xnew_train, Xdsel) is not empty set.

Let Xtrain, Xtest = train_test_split(features, target, test_size=0.2)
be the train and test set after the first split (test set is 20% of the original dataset).

Then, I'm sampling 80% of entries from Xtrain to get the new training set Xnew_train. The remaining 20% is the Xdsel set. As of now there's no overlap.

To introduce overlap, I'm randomly sampling 10% of the entries in Xnew_train and adding them to Xdsel to get Xnew_dsel.

Example with easy numbers:
dataset = (1000, n_features)
Xtrain, Xtest = (800, n_features), (200, n_features)
Xnew_train = 80% of 800 -> (640, n_features)
Xdsel = 20% of 800 -> (160, n_features)
Xnew_dsel = take 10% from Xnew_train: 64 -> (64+160=224, n_features)

I'm also doing it in the cross-validation procedure: I split the training portion into new_train and dsel (80%, 20%), sample 10% of rows from new_train and add them to dsel to get new_dsel.

Am I implementing the overlapping correctly ?

Thank you for your time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper way of fitting classifiers before creating an heterogeneous pool #264

Proper way of fitting classifiers before creating an heterogeneous pool #264

francescopisu commented Dec 30, 2021 •

edited

Loading

francescopisu commented Jan 3, 2022

Menelau commented Jan 4, 2022

francescopisu commented Jan 5, 2022

francescopisu commented Feb 22, 2022 •

edited

Loading

Proper way of fitting classifiers before creating an heterogeneous pool #264

Proper way of fitting classifiers before creating an heterogeneous pool #264

Comments

francescopisu commented Dec 30, 2021 • edited Loading

francescopisu commented Jan 3, 2022

Menelau commented Jan 4, 2022

francescopisu commented Jan 5, 2022

francescopisu commented Feb 22, 2022 • edited Loading

francescopisu commented Dec 30, 2021 •

edited

Loading

francescopisu commented Feb 22, 2022 •

edited

Loading