Does training new tokenizer effect the pre-trained model? #1

dinhngoc267 · 2024-07-03T03:27:30Z

Hi,

It's nice to see your repository. I can see that you train a new tokenizer based on your new corpus, but I wonder does it change the token id of the original model base? If it does then it might effect the weight of original pre-train model, as you continue to pre-train the model from the pre-trained model right?

w11wo · 2024-08-02T03:54:54Z

Hi @dinhngoc267, very sorry for the late reply. I might have missed the notification for some reason.

To clarify, we did not continue training from a pre-trained model. IndoT5 was trained completely from scratch, with a new vocabulary/tokenizer. It is only when we fine-tuned to a downstream task like QA, summarization, etc., where we fine-tuned our model; but we kept the same vocabulary for this step.

Hope that clears it up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does training new tokenizer effect the pre-trained model? #1

Does training new tokenizer effect the pre-trained model? #1

dinhngoc267 commented Jul 3, 2024

w11wo commented Aug 2, 2024

Does training new tokenizer effect the pre-trained model? #1

Does training new tokenizer effect the pre-trained model? #1

Comments

dinhngoc267 commented Jul 3, 2024

w11wo commented Aug 2, 2024