Run today's most used tokenizers directly in your browser or Node.js application. No heavy dependencies, no server required. Just fast, client-side tokenization compatible with thousands of models on the Hugging Face Hub. These tokenizers are also used in 🤗 Transformers.js
- Lightweight (~ 8.3kB gzip)
- Zero dependencies
- Works in browsers and Node.js
npm install @huggingface/tokenizersAlternatively, you can use it via a CDN as follows:
<script type="module">
import { Tokenizer } from "https://cdn.jsdelivr.net/npm/@huggingface/tokenizers";
</script>import { Tokenizer } from "@huggingface/tokenizers";
// Load files from the Hugging Face Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`).then((res) => res.json());
const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`).then((res) => res.json());
// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Tokenize text
const tokens = tokenizer.tokenize("Hello World"); // ['Hello', 'Ä World']
const encoded = tokenizer.encode("Hello World"); // { ids: [9906, 4435], tokens: ['Hello', 'Ä World'], attention_mask: [1, 1] }
const decoded = tokenizer.decode(encoded.ids); // 'Hello World'This library expects two files from Hugging Face models:
tokenizer.json- Contains the tokenizer configurationtokenizer_config.json- Contains additional metadata
Tokenizers.js supports Hugging Face tokenizer components:
- NFD
- NFKC
- NFC
- NFKD
- Lowercase
- Strip
- StripAccents
- Replace
- BERT Normalizer
- Precompiled
- Sequence
- BERT
- ByteLevel
- Whitespace
- WhitespaceSplit
- Metaspace
- CharDelimiterSplit
- Split
- Punctuation
- Digits
- BPE (Byte-Pair Encoding)
- WordPiece
- Unigram
- Legacy
- ByteLevel
- TemplateProcessing
- RobertaProcessing
- BertProcessing
- Sequence
- ByteLevel
- WordPiece
- Metaspace
- BPE
- CTC
- Replace
- Fuse
- Strip
- ByteFallback
- Sequence