-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem/question with random_mask
in masking.py
#12
Comments
Hi there, I apologize for the delayed response - I totally missed the notification for this issue. Unfortunately it has been a long time since I wrote this code, so I can't answer very confidently, but it looks like you're right. That said, I double checked and this logic is used in all three masking methods (including the one we trained with) and if our masking was genuinely completely broken, I'm not sure how the model would have learned anything. Edit: If you come to a conclusion you are confident about and fix something please feel free to open a PR though. |
@sven-nm does this affect training? We are planning to start training on our English corpus and I was wondering if this bug has a major effect on the training loss |
@ganeshkrishnan1 If this is actually broken it should pretty much break training entirely (which may or may not be reflected in the loss), but it should be pretty easy to figure out if there's a problem or not, and if it is an issue it would be very easy to fix - it would just be the change suggested in the original issue. |
I haven't started the training for this yet. But, how do you figure out if there is a problem if it's not evident with the loss? Do you mean run it on live world classification (or similar task) scenarios? |
@ganeshkrishnan1 Easiest ways are either to step through the training code with a debugger and just look at what is actually being masked, or if you prefer, train on an extremely small simplified dataset (like strings of consecutive letters or something) and see if the model can learn that properly. |
@ganeshkrishnan1 Btw, if you do figure this out one way or the other please let me know. I can update the code in the main branch (or you can open a PR) if changes are necessary. |
Our team is going to start testing work on this early next week. I will keep you updated |
The encode/decode worked with minor issues: There are some quirks though: some words have a space in between after the decoding eg "presiden tial" We will try to run again with the fix above and check but from what we have seen, it seems ok |
I'm not entirely sure why there are extra spaces after decoding (the tokenization code is dead simple); I'd need a better understanding of exactly what you're doing. In regards to the possible masking issue mentioned here, that wouldn't affect encoding/decoding, just the batched inputs to the model during training. |
@Mindful you can check this repo , after encoding and decoding there's a random gap in each line at the same time in a pattern https://github.com/arnavgupta16/shiba-canine_tokenization/blob/main/main.py |
Hey guys thanks for this awesome adaptation of CANINE 😊 I've been working on adapting for any language and I came across weird empty masks. I think the problem is in
training/masking.py
in the functionrandom_mask
. We have the following (starting at line 42):I've added my comments with a⚠️ . Am I missing something here ? My hunch is that line 43 should be
special_tokens_mask = special_tokens_mask | ~attention_mask.bool()
.The text was updated successfully, but these errors were encountered: