Skip to content

[Enhancement] Tokenization Correctness & Robustness Improvements #7

Description

@farhan-syah

Context

This issue tracks enhancements based on feedback from ReturningTarzan on Reddit:

The main thing about tokenization is always correctness. Throughput is nice but secondary.

While splintr is already fast and functional, these enhancements will ensure production-grade correctness and robustness.


Status Summary

Already Implemented ✅

Feature Status Reference
Control tokens (multiple BPE runs) ✅ Implemented src/core/tokenizer.rs:863-899 - encode_with_special() splits at special tokens and runs BPE on each segment separately
Decoding to byte strings ✅ Implemented src/core/tokenizer.rs:905 - decode_bytes(&[u32]) -> Vec<u8>
Buffer/queue for incomplete chars ✅ Implemented src/core/streaming.rs - StreamingDecoder and ByteLevelStreamingDecoder (commit 9e45c14)
Basic test coverage ✅ Implemented ~1,537 lines in tests/ covering 4 tokenizers (commit d6d836f)

Proposed Enhancements

1. Comprehensive Correctness Testing

Priority: HIGH | Complexity: MEDIUM

Current State: Integration tests cover basic functionality and exact token IDs (~1,537 lines).

Gaps:

  • Add correctness validation suite comparing against tiktoken/HuggingFace on large text corpus
  • Add property-based testing (fuzzing) with cargo-fuzz or proptest
  • Add edge case tests: malformed UTF-8, empty strings, very long inputs, pathological BPE cases
  • Document test coverage metrics and correctness validation methodology
  • Add "silent degradation" detection: tests that catch minor tokenization differences

Why it matters: LLMs are robust to small differences, but those differences can accumulate and silently degrade performance over large datasets.


2. Trimming/Padding/Normalization for Added Tokens

Priority: MEDIUM | Complexity: LOW-MEDIUM

Current State: No normalization layer exists.

Tasks:

  • Research normalization requirements for tiktoken/HuggingFace implementations
  • Determine if normalization (NFKC, whitespace trimming, etc.) is needed for correctness
  • If needed: Add optional Normalizer trait and implementations
  • Document normalization behavior (or lack thereof) in API docs
  • Add tests for special token boundary cases (leading/trailing whitespace, etc.)

3. Preprocessing/Postprocessing Validation

Priority: HIGH | Complexity: LOW

Current State: Has PCRE2 regex, ByteLevel preprocessing, Aho-Corasick. No formal validation against reference implementations.

Tasks:

  • Add validation tests comparing regex split output to tiktoken for diverse examples
  • Validate ByteLevel encode/decode against HuggingFace tokenizers
  • Test preprocessing edge cases: emoji, rare unicode, control characters
  • Document preprocessing/postprocessing guarantees in API docs

4. Ergonomic API Improvements for Single Token Decoding

Priority: LOW | Complexity: LOW

Current State: decode_bytes(&[u32]) -> Vec<u8> and streaming decoders work well, but single-token convenience methods could be added.

Tasks:

  • Consider adding decode_token_bytes(u32) -> Option<Vec<u8>> convenience method
  • Consider adding decode_token(u32) -> Option<String> method
  • Document streaming and single-token decode patterns more prominently in API guide

Success Criteria

  • Correctness validated against reference implementations (tiktoken/HuggingFace) on large sample sets
  • Zero tokenization differences from reference implementations for standard inputs
  • Property-based testing catches edge cases automatically
  • All preprocessing/postprocessing validated and documented

Resources

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions