-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper way of fitting classifiers before creating an heterogeneous pool #264
Comments
Any update on this ? |
Hello, In you case (small dataset with about 800 samples), I do believe that using the whole 75% for training the base models and DSEL I'm pretty sure the best way of handling it is having some overlap between the datasets. Does your dataset also suffers from class imbalance? If yes, this is another indication that you should consider an overlapping approach as the minority class may be quite under represented if we divide the dataset too much. If have done that in quite a few papers such as the FIRE-DES++ one: FIRE-DES++: Enhanced online pruning of base classifiers for dynamic ensemble selection: https://www.etsmtl.ca/Unites-de-recherche/LIVIA/Recherche-et-innovation/Publications/Publications-2019/Cruz_PR_2019.pdf Also, if your dataset suffer from class imbalance, maybe you could try applying data augmentation to increase the DSEL size later as we did in a previous publication: A study on combining dynamic selection and data preprocessing for imbalance learning: https://en.etsmtl.ca/Unites-de-recherche/LIVIA/Recherche-et-innovation/Publications/Publications-2018/Roy_Neurocomputing_2018_InPress.pdf |
Hello Menelau, thank you very much for your time and valuable inputs. My dataset suffers from mild imbalance, but this time is the negative class (let's say absence of condition) to be underrepresented, which is the class I'm less interested in. I'm going to read these two articles and try to apply what you suggested. |
-- EDIT: In my previous answer I was reporting wrong results due to some implementation errors on my part. I removed that part as I am now getting more realistic results. I have some feedback regarding the overlap. If I understood correctly, Xtrain and Xdsel should be taken such that set.intersection(Xnew_train, Xdsel) is not empty set. Let Xtrain, Xtest = train_test_split(features, target, test_size=0.2) Then, I'm sampling 80% of entries from Xtrain to get the new training set Xnew_train. The remaining 20% is the Xdsel set. As of now there's no overlap. To introduce overlap, I'm randomly sampling 10% of the entries in Xnew_train and adding them to Xdsel to get Xnew_dsel. Example with easy numbers: I'm also doing it in the cross-validation procedure: I split the training portion into new_train and dsel (80%, 20%), sample 10% of rows from new_train and add them to dsel to get new_dsel. Am I implementing the overlapping correctly ? Thank you for your time |
Hey, I'm working on a research paper focused on building a binary classification model in the biomedical domain.The dataset comprises approximately 800 data points. Let's say I want to feed an heterogeneous pool of classifiers to the dynamic selection methods. By following the instructions on the examples, I've found two different ways of splitting the dataset and fitting the base classifiers of the pool.
In this random forest example, the RF is fitted on the 75% training portion and the DS methods on the 50% DSEL portion.
Furthermore, I wanted to point out this tip taken from the tutorial :
That seems my case, as my dataset is rather small compared to most datasets in the ML domain. Hence, I was thinking of fitting my base classifiers on the 75% part and then leveraging some overlap to get better performance (and this is really the case! In fact, overlapping leads to a median auc of 0.76 whereas non-overlapping gives 0.71).
What would be the best way of dealing with the problem ?
The text was updated successfully, but these errors were encountered: