Skip to content

huggingface/tokenizers.js

Repository files navigation


transformers.js javascript library logo

License

A lightweight tokenizer for the Web

Run today's most used tokenizers directly in your browser or Node.js application. No heavy dependencies, no server required. Just fast, client-side tokenization compatible with thousands of models on the Hugging Face Hub. These tokenizers are also used in 🤗 Transformers.js

Features

  • Lightweight (~ 8.3kB gzip)
  • Zero dependencies
  • Works in browsers and Node.js

Installation

npm install @huggingface/tokenizers

Alternatively, you can use it via a CDN as follows:

<script type="module">
  import { Tokenizer } from "https://cdn.jsdelivr.net/npm/@huggingface/tokenizers";
</script>

Usage

import { Tokenizer } from "@huggingface/tokenizers";

// Load files from the Hugging Face Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`).then((res) => res.json());
const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`).then((res) => res.json());

// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Tokenize text
const tokens = tokenizer.tokenize("Hello World"); // ['Hello', 'Ä World']
const encoded = tokenizer.encode("Hello World"); // { ids: [9906, 4435], tokens: ['Hello', 'Ä World'], attention_mask: [1, 1] }
const decoded = tokenizer.decode(encoded.ids); // 'Hello World'

Requirements

This library expects two files from Hugging Face models:

  • tokenizer.json - Contains the tokenizer configuration
  • tokenizer_config.json - Contains additional metadata

Components

Tokenizers.js supports Hugging Face tokenizer components:

Normalizers

  • NFD
  • NFKC
  • NFC
  • NFKD
  • Lowercase
  • Strip
  • StripAccents
  • Replace
  • BERT Normalizer
  • Precompiled
  • Sequence

Pre-tokenizers

  • BERT
  • ByteLevel
  • Whitespace
  • WhitespaceSplit
  • Metaspace
  • CharDelimiterSplit
  • Split
  • Punctuation
  • Digits

Models

  • BPE (Byte-Pair Encoding)
  • WordPiece
  • Unigram
  • Legacy

Post-processors

  • ByteLevel
  • TemplateProcessing
  • RobertaProcessing
  • BertProcessing
  • Sequence

Decoders

  • ByteLevel
  • WordPiece
  • Metaspace
  • BPE
  • CTC
  • Replace
  • Fuse
  • Strip
  • ByteFallback
  • Sequence