Pre-tokenizers that support multi-word/non-whitespace BPE in single pass #1753

mjbommar · 2025-03-22T20:55:57Z

Inspired by the SuperBPE results this week, we had Claude Code clean up some old R&D work that others might find interesting.

We had some early success with 170M model training but didn't pursue this further. Hopefully someone else might find this interesting or be able to test it further.

This PR implements two pre-tokenizers for training:

RandomChunkSplit: Splits text into chunks of random length (configurable min/max), ignoring whitespace boundaries completely
RandomWhitespaceSplit: Probabilistically decides whether to split on whitespace, allowing for multi-word expressions

The key idea is that unlike SuperBPE, these pre-tokenizers:

Can be trained in a single pass of usual BPE training
Can be used by tokenizers/transformers out of the box (by removing pretokenizer from trained model)

Examples of training here:

Example of very small trained model that parses multi-word tokens with PreTrainedTokenizerFast:

In [1]: from transformers import PreTrainedTokenizerFast

In [2]: t = PreTrainedTokenizerFast.from_pretrained("alea-institute/kl3m-005-multi-word-example-32k")
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 304/304 [00:00<00:00, 3.65MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.08M/3.08M [00:00<00:00, 14.1MB/s]

In [3]: [t.decode(x) for x in t.encode("This is a test of the emergency broadcast siren.")]
Out[3]: ['This ',
 'is a ',
 'test ',
 'of the ',
 'emergency ',
 'broadcast ',
 'si',
 'ren',
 '.']

This commit introduces a new pre-tokenizer that enables BPE models to learn tokens that span across whitespace boundaries. The RandomChunkSplit pre-tokenizer segments text into random-length chunks, allowing multi-word expressions to be learned as single tokens. Features: - Rust implementation of RandomChunkSplit pre-tokenizer with configurable min/max chunk lengths - Python bindings with full test coverage - Documentation and example code - Benchmarks comparing performance with traditional pre-tokenizers This feature is especially useful for domain-specific applications where multi-word technical terms and expressions are semantically meaningful as a single unit.

This commit adds: 1. RandomWhitespaceSplit pre-tokenizer that probabilistically decides whether to split at whitespace characters 2. Updated Python bindings and tests for the new pre-tokenizer 3. Added train_random_whitespace_bpe.py and train_random_chunk_bpe.py scripts 4. Added deterministic mode to both RandomChunkSplit and RandomWhitespaceSplit for consistent inference This complements the RandomChunkSplit pre-tokenizer by providing an alternative approach to enabling BPE models to learn multi-word tokens. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…rs in training.rs

IsNoobgrammer · 2025-03-27T13:37:05Z

Let me tag

@ArthurZucker
@Narsil

But it wouldn't be as effective as original SuperBPE token ;
For the most ideal scenario for it to match SuperBPE compression would require larger corpus of data than superBPE

mjbommar · 2025-03-27T14:56:57Z

But it wouldn't be as effective as original SuperBPE token ; For the most ideal scenario for it to match SuperBPE compression would require larger corpus of data than superBPE

Agree that SuperBPE is almost certainly better for natural language. This was internal research we had from our first gen models last spring, so it's an older, simpler technique.

If I were to summarize differences from my understanding of SuperBPE:

these methods immediately begin learning multi-word tokens, which might be better for smaller corpora
no need to figure out \tau hyperparameter for stage 1 -> stage 2 transition
RandomWhitespaceSplit can parameterize length of multi-word through split probability, whereas SuperBPE doesn't seem to provide mechanism for incentivizing longer/shorter stage 2 tokens
RandomChunkSplit can learn tokens that are not whitespace tokenized (e.g., code or structured data, especially stuff like minified JS/JSON, dense languages like lisp, etc.), whereas it seems SuperBPE still assumes initial (whitespace?) tokenization in stage 1; this is discussed in C3 of SuperBPE paper a bit too

mjbommar and others added 3 commits March 22, 2025 10:54

Add tests for RandomWhitespaceSplit and RandomChunkSplit pre-tokenize…

e7f2d31

…rs in training.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pre-tokenizers that support multi-word/non-whitespace BPE in single pass #1753

Pre-tokenizers that support multi-word/non-whitespace BPE in single pass #1753

Uh oh!

mjbommar commented Mar 22, 2025

Uh oh!

IsNoobgrammer commented Mar 27, 2025

Uh oh!

mjbommar commented Mar 27, 2025

Uh oh!

Uh oh!

Pre-tokenizers that support multi-word/non-whitespace BPE in single pass #1753

Are you sure you want to change the base?

Pre-tokenizers that support multi-word/non-whitespace BPE in single pass #1753

Uh oh!

Conversation

mjbommar commented Mar 22, 2025

Uh oh!

IsNoobgrammer commented Mar 27, 2025

Uh oh!

mjbommar commented Mar 27, 2025

Uh oh!

Uh oh!