Support Llama 3 tokenizer (implement `ignore_merges` behavior) #453

robertknight · 2024-12-08T20:23:46Z

Attempting to load the tokenizer.json file from Llama 3.2 fails with an error processing the BPE merge entries:

Error: BpeError(InvalidMergeEntry("Ġ ĠĠĠ"))

If rten-text is modified to ignore this error, then the qwen2_chat example works with Llama 3.2, after a minor modification to the special token IDs.

Edit: I have just noticed the ignore_merges: true in the tokenizer.json file. This seems relevant.

The text was updated successfully, but these errors were encountered:

robertknight · 2024-12-08T20:54:57Z

ignore_merges was added in huggingface/tokenizers@914576f. See also https://github.com/huggingface/tokenizers/pull/1493/files.

The documentation says:

ignore_merges (bool, optional) — Whether or not to match tokens with the vocab before using merges.

robertknight added the tokenizers label Dec 8, 2024

robertknight mentioned this issue Dec 9, 2024

Clean up outdated comments in BPE tokenizer, pass configuration as a struct #455

Merged

robertknight changed the title ~~Investigate InvalidMergeEntry error when loading Llama 3 tokenizer~~ Support Llama 3 tokenizer (implement ignore_merges behavior) Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Llama 3 tokenizer (implement `ignore_merges` behavior) #453

Support Llama 3 tokenizer (implement `ignore_merges` behavior) #453

robertknight commented Dec 8, 2024 •

edited

Loading

robertknight commented Dec 8, 2024 •

edited

Loading

Support Llama 3 tokenizer (implement ignore_merges behavior) #453

Support Llama 3 tokenizer (implement ignore_merges behavior) #453

Comments

robertknight commented Dec 8, 2024 • edited Loading

robertknight commented Dec 8, 2024 • edited Loading

Support Llama 3 tokenizer (implement `ignore_merges` behavior) #453

Support Llama 3 tokenizer (implement `ignore_merges` behavior) #453

robertknight commented Dec 8, 2024 •

edited

Loading

robertknight commented Dec 8, 2024 •

edited

Loading