Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Llama 3 tokenizer (implement ignore_merges behavior) #453

Open
robertknight opened this issue Dec 8, 2024 · 1 comment
Open

Comments

@robertknight
Copy link
Owner

robertknight commented Dec 8, 2024

Attempting to load the tokenizer.json file from Llama 3.2 fails with an error processing the BPE merge entries:

Error: BpeError(InvalidMergeEntry("Ġ ĠĠĠ"))

If rten-text is modified to ignore this error, then the qwen2_chat example works with Llama 3.2, after a minor modification to the special token IDs.

Edit: I have just noticed the ignore_merges: true in the tokenizer.json file. This seems relevant.

@robertknight
Copy link
Owner Author

robertknight commented Dec 8, 2024

ignore_merges was added in huggingface/tokenizers@914576f. See also https://github.com/huggingface/tokenizers/pull/1493/files.

The documentation says:

ignore_merges (bool, optional) — Whether or not to match tokens with the vocab before using merges.

@robertknight robertknight changed the title Investigate InvalidMergeEntry error when loading Llama 3 tokenizer Support Llama 3 tokenizer (implement ignore_merges behavior) Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant