Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem/question with random_mask in masking.py #12

Open
sven-nm opened this issue Jul 20, 2024 · 10 comments
Open

Problem/question with random_mask in masking.py #12

sven-nm opened this issue Jul 20, 2024 · 10 comments

Comments

@sven-nm
Copy link

sven-nm commented Jul 20, 2024

Hey guys thanks for this awesome adaptation of CANINE 😊 I've been working on adapting for any language and I came across weird empty masks. I think the problem is in training/masking.py in the function random_mask. We have the following (starting at line 42):

    special_tokens_mask = _special_tokens_mask_from_range(input_ids, range(MIN_SPECIAL_TOKEN, MAX_SPECIAL_TOKEN))
    # ⚠️ So far okey
    
    special_tokens_mask = special_tokens_mask | attention_mask.bool()
    # ⚠️ Here we have a problem: say attention mask is a `torch.ones()` of (shape batch_size, seq_len) (e.g. 8,2048), then we are going to end up with `special_tokens_mask` as a `torch.ones()` of the same shape. Which we use later on (see below)
    
    mask_count = math.floor(input_ids.shape[1] * masking_percent)

    indices_to_mask = []
    for unmaskable_indices, inputs in zip(special_tokens_mask, input_ids):
        # compute the possible indices we could mask for this input
        maskable_indices = torch.arange(inputs.shape[0]).masked_select(~unmaskable_indices)
        # ⚠️ Here we are using `~unmaskable_indices`, so basically no input id can be masked ! 
        
        # take mask_count random indices, get the maskable indices using random indices
        indices_to_mask.append(maskable_indices[torch.randperm(maskable_indices.shape[0])[:mask_count]])

I've added my comments with a ⚠️. Am I missing something here ? My hunch is that line 43 should be special_tokens_mask = special_tokens_mask | ~attention_mask.bool().

@Mindful
Copy link
Collaborator

Mindful commented Aug 15, 2024

@sven-nm

Hi there, I apologize for the delayed response - I totally missed the notification for this issue.

Unfortunately it has been a long time since I wrote this code, so I can't answer very confidently, but it looks like you're right. That said, I double checked and this logic is used in all three masking methods (including the one we trained with) and if our masking was genuinely completely broken, I'm not sure how the model would have learned anything.
Given that, I am... not sure what to tell you, to be honest.

Edit: If you come to a conclusion you are confident about and fix something please feel free to open a PR though.

@ganeshkrishnan1
Copy link

@sven-nm does this affect training? We are planning to start training on our English corpus and I was wondering if this bug has a major effect on the training loss

@Mindful
Copy link
Collaborator

Mindful commented Oct 20, 2024

@ganeshkrishnan1 If this is actually broken it should pretty much break training entirely (which may or may not be reflected in the loss), but it should be pretty easy to figure out if there's a problem or not, and if it is an issue it would be very easy to fix - it would just be the change suggested in the original issue.

@ganeshkrishnan1
Copy link

I haven't started the training for this yet. But, how do you figure out if there is a problem if it's not evident with the loss? Do you mean run it on live world classification (or similar task) scenarios?

@Mindful
Copy link
Collaborator

Mindful commented Oct 20, 2024

@ganeshkrishnan1 Easiest ways are either to step through the training code with a debugger and just look at what is actually being masked, or if you prefer, train on an extremely small simplified dataset (like strings of consecutive letters or something) and see if the model can learn that properly.

@Mindful
Copy link
Collaborator

Mindful commented Oct 26, 2024

@ganeshkrishnan1 Btw, if you do figure this out one way or the other please let me know. I can update the code in the main branch (or you can open a PR) if changes are necessary.

@ganeshkrishnan1
Copy link

Our team is going to start testing work on this early next week. I will keep you updated

@ganeshkrishnan1
Copy link

The encode/decode worked with minor issues:
https://pastebin.com/1kaXCV8C

There are some quirks though: some words have a space in between after the decoding eg "presiden tial"

We will try to run again with the fix above and check but from what we have seen, it seems ok

@Mindful
Copy link
Collaborator

Mindful commented Nov 1, 2024

I'm not entirely sure why there are extra spaces after decoding (the tokenization code is dead simple); I'd need a better understanding of exactly what you're doing.
If you want to talk to tokenization I'm happy to do that but it would be better to open a different issue.

In regards to the possible masking issue mentioned here, that wouldn't affect encoding/decoding, just the batched inputs to the model during training.

@arnavgupta16
Copy link

@Mindful you can check this repo , after encoding and decoding there's a random gap in each line at the same time in a pattern
Screenshot 2024-11-04 at 10 56 30 PM

https://github.com/arnavgupta16/shiba-canine_tokenization/blob/main/main.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants