Skip to content

Pre-tokenizers that support multi-word/non-whitespace BPE in single pass #1753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mjbommar
Copy link

Inspired by the SuperBPE results this week, we had Claude Code clean up some old R&D work that others might find interesting.

We had some early success with 170M model training but didn't pursue this further. Hopefully someone else might find this interesting or be able to test it further.

This PR implements two pre-tokenizers for training:

  • RandomChunkSplit: Splits text into chunks of random length (configurable min/max), ignoring whitespace boundaries completely

  • RandomWhitespaceSplit: Probabilistically decides whether to split on whitespace, allowing for multi-word expressions

The key idea is that unlike SuperBPE, these pre-tokenizers:

  1. Can be trained in a single pass of usual BPE training
  2. Can be used by tokenizers/transformers out of the box (by removing pretokenizer from trained model)

Examples of training here:

Example of very small trained model that parses multi-word tokens with PreTrainedTokenizerFast:

In [1]: from transformers import PreTrainedTokenizerFast

In [2]: t = PreTrainedTokenizerFast.from_pretrained("alea-institute/kl3m-005-multi-word-example-32k")
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 304/304 [00:00<00:00, 3.65MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.08M/3.08M [00:00<00:00, 14.1MB/s]

In [3]: [t.decode(x) for x in t.encode("This is a test of the emergency broadcast siren.")]
Out[3]: ['This ',
 'is a ',
 'test ',
 'of the ',
 'emergency ',
 'broadcast ',
 'si',
 'ren',
 '.']

mjbommar and others added 3 commits March 22, 2025 10:54
This commit introduces a new pre-tokenizer that enables BPE models to learn
tokens that span across whitespace boundaries. The RandomChunkSplit pre-tokenizer
segments text into random-length chunks, allowing multi-word expressions to be
learned as single tokens.

Features:
- Rust implementation of RandomChunkSplit pre-tokenizer with configurable min/max chunk lengths
- Python bindings with full test coverage
- Documentation and example code
- Benchmarks comparing performance with traditional pre-tokenizers

This feature is especially useful for domain-specific applications where multi-word
technical terms and expressions are semantically meaningful as a single unit.
This commit adds:
1. RandomWhitespaceSplit pre-tokenizer that probabilistically decides whether to split at whitespace characters
2. Updated Python bindings and tests for the new pre-tokenizer
3. Added train_random_whitespace_bpe.py and train_random_chunk_bpe.py scripts
4. Added deterministic mode to both RandomChunkSplit and RandomWhitespaceSplit for consistent inference

This complements the RandomChunkSplit pre-tokenizer by providing an alternative approach to enabling BPE models to learn multi-word tokens.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@IsNoobgrammer
Copy link

Let me tag

@ArthurZucker
@Narsil

But it wouldn't be as effective as original SuperBPE token ;
For the most ideal scenario for it to match SuperBPE compression would require larger corpus of data than superBPE

@mjbommar
Copy link
Author

But it wouldn't be as effective as original SuperBPE token ; For the most ideal scenario for it to match SuperBPE compression would require larger corpus of data than superBPE

Agree that SuperBPE is almost certainly better for natural language. This was internal research we had from our first gen models last spring, so it's an older, simpler technique.

If I were to summarize differences from my understanding of SuperBPE:

  1. these methods immediately begin learning multi-word tokens, which might be better for smaller corpora
  2. no need to figure out \tau hyperparameter for stage 1 -> stage 2 transition
  3. RandomWhitespaceSplit can parameterize length of multi-word through split probability, whereas SuperBPE doesn't seem to provide mechanism for incentivizing longer/shorter stage 2 tokens
  4. RandomChunkSplit can learn tokens that are not whitespace tokenized (e.g., code or structured data, especially stuff like minified JS/JSON, dense languages like lisp, etc.), whereas it seems SuperBPE still assumes initial (whitespace?) tokenization in stage 1; this is discussed in C3 of SuperBPE paper a bit too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants