Fix handling of Sequence post-processors in train_new_from_iterator #34246

taidopurason · 2024-10-18T10:40:49Z

What does this PR do?

This PR fixes an issue where the post-processor special token IDs are not correctly updated when training a new tokenizer using train_new_from_iterator of a tokenizer with a Sequence post-processor. Instead, the special token IDs are copied directly from the original tokenizer.

For example, this affects training a new tokenizer from Llama-3 tokenizers, as reported in #33998 and #30752.

Running the following code:

from transformers import AutoTokenizer
from datasets import load_dataset
import json
from itertools import islice

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
ds = load_dataset("wikimedia/wikipedia", "20231101.et", streaming=True, split="train")

new_tokenizer = tokenizer.train_new_from_iterator([x["text"] for x in islice(ds, 100)], 1000)

print(f"bos_token_id={new_tokenizer.bos_token_id}")
print(f"'Hello world!' tokenized as {new_tokenizer('Hello world!')['input_ids']}")
print(json.dumps(json.loads(new_tokenizer._tokenizer.to_str())['post_processor'], indent=2))

the output is:

bos_token_id=0
'Hello world!' tokenized as [128000, 294, 569, 727, 399, 338, 541, 327, 319, 256]
{
  "type": "Sequence",
  "processors": [
    {
      "type": "ByteLevel",
      "add_prefix_space": true,
      "trim_offsets": false,
      "use_regex": true
    },
    {
      "type": "TemplateProcessing",
      "single": [
        {
          "SpecialToken": {
            "id": "<|begin_of_text|>",
            "type_id": 0
          }
        },
        {
          "Sequence": {
            "id": "A",
            "type_id": 0
          }
        }
      ],
      "pair": [
        {
          "SpecialToken": {
            "id": "<|begin_of_text|>",
            "type_id": 0
          }
        },
        {
          "Sequence": {
            "id": "A",
            "type_id": 0
          }
        },
        {
          "SpecialToken": {
            "id": "<|begin_of_text|>",
            "type_id": 1
          }
        },
        {
          "Sequence": {
            "id": "B",
            "type_id": 1
          }
        }
      ],
      "special_tokens": {
        "<|begin_of_text|>": {
          "id": "<|begin_of_text|>",
          "ids": [
            128000
          ],
          "tokens": [
            "<|begin_of_text|>"
          ]
        }
      }
    }
  ]
}

As shown, the new tokenizer prepends an incorrect bos_token_id (128000 instead of 0)

Fixes #33998 #30752

I welcome feedback and suggestions on this fix.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

tokenizers: @ArthurZucker

LysandreJik · 2024-10-22T09:12:04Z

Thanks for the issue! Also cc @itazap for knowledge (but I know your bandwidth is very limited right now)

itazap · 2024-10-25T14:56:55Z

src/transformers/tokenization_utils_fast.py

-                            "Attempted to set a token in the post processor that does not exist in the mapping"
-                        )
-                    post_processor[special_token] = [token, token_id]
+                        _post_processor[special_token] = [token, token_id]

            trained_tokenizer_json["post_processor"] = post_processor


Thanks for working on this!! 👍
Should this also be updated to set to _post_processors?

Also, would be great to add a test for this so that we can catch this in the future! Let me know if you would like to contribute the test or I can add it! 🤗

Thanks for the feedback! The reason for using _post_processor on the final block is to ensure that if Roberta or BERT processors are ever part of the Sequential postprocessor, their cls and sep token IDs will be correctly updated. I can revert this if you think it’s unnecessary.

Including a unit test is a great idea - I’ll start working on it.

Added a test case for the Sequence post-processor. @itazap, let me know what you think. Should anything else be added or changed?

…terator.

Fix Sequence post-processor handling in train_new_from_iterator.

ba97f5d

itazap reviewed Oct 25, 2024

View reviewed changes

Added a test for Sequence post-processor handling in train_new_from_i…

30acc00

…terator.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of Sequence post-processors in train_new_from_iterator #34246

Fix handling of Sequence post-processors in train_new_from_iterator #34246

taidopurason commented Oct 18, 2024

LysandreJik commented Oct 22, 2024

itazap Oct 25, 2024

taidopurason Oct 27, 2024

taidopurason Oct 29, 2024

Fix handling of Sequence post-processors in train_new_from_iterator #34246

Are you sure you want to change the base?

Fix handling of Sequence post-processors in train_new_from_iterator #34246

Conversation

taidopurason commented Oct 18, 2024

What does this PR do?

Before submitting

Who can review?

LysandreJik commented Oct 22, 2024

itazap Oct 25, 2024

Choose a reason for hiding this comment

taidopurason Oct 27, 2024

Choose a reason for hiding this comment

taidopurason Oct 29, 2024

Choose a reason for hiding this comment