Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add minimum sentence threshold for prediction of single-label SciBERT in Colab notebook? #202

Open
hoangcuongnguyen2001 opened this issue Oct 2, 2023 · 1 comment

Comments

@hoangcuongnguyen2001
Copy link

hoangcuongnguyen2001 commented Oct 2, 2023

I am currently trying to evaluate the SciBERT model (using in TRAM project) that I have successfully fine-tuned on my own CTI dataset, in this notebook: Predict single-label SciBERT; however, I just realised that for each report that I put into SciBERT, I got far more number of techniques, and therefore, higher number of false positives compared with what I got from the TRAM website.
As far as I know, there is no feature for minimum of accepted sentences threshold (ML_ACCEPT_THRESHOLD, which is equal to 4) like in TRAM website, as in this image below; so, I think that could be a reason why I got so many techniques with 1-2 sentences, and therefore higher false positive rate.

image

Thus, I would like to ask you that could you provide some guides to me to add this threshold into the Colab notebook, so that I could test the website? And more importantly, could you add this feature into the notebook for later users, so that they could have a more reliable way to test the performance of their model, without having to loading the model directly to TRAM's website?

@mehaase
Copy link
Contributor

mehaase commented Mar 18, 2024

Yes, as you noticed, there is no ML_ACCEPT_THRESHOLD for the SciBERT models. The original models for TRAM 1 were lightweight and could be trained at runtime, and so the threshold value was used to decide which techniques to train on. Since SciBERT is pre-trained, we don't have a similar feature for pruning the training data. But you could achieve the same effect by removing labels from your training data that don't have many examples.

One limitation of the SciBert approach is that it does not have a "null" label, i.e. absence of a technique. To address this, the notebook includes the "probability" field, which is a threshold on prediction probability. The model chooses the label with the highest probability, and if that probability is less than the threshold, then it predicts no label.

So if you are getting too many false positives, try increasing the probability threshold, e.g. try 0.95, 0.99, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants