[Enhancement] Tokenization Correctness & Robustness Improvements

## Context

This issue tracks enhancements based on feedback from [ReturningTarzan](https://www.reddit.com/user/ReturningTarzan/) on [Reddit](https://www.reddit.com/r/LocalLLaMA/comments/1p71luf/comment/nqv7upz/):

> The main thing about tokenization is always correctness. Throughput is nice but secondary.

While splintr is already fast and functional, these enhancements will ensure production-grade correctness and robustness.

---

## Status Summary

### Already Implemented ✅

| Feature | Status | Reference |
|---------|--------|-----------|
| Control tokens (multiple BPE runs) | ✅ Implemented | `src/core/tokenizer.rs:863-899` - `encode_with_special()` splits at special tokens and runs BPE on each segment separately |
| Decoding to byte strings | ✅ Implemented | `src/core/tokenizer.rs:905` - `decode_bytes(&[u32]) -> Vec<u8>` |
| Buffer/queue for incomplete chars | ✅ Implemented | `src/core/streaming.rs` - `StreamingDecoder` and `ByteLevelStreamingDecoder` (commit `9e45c14`) |
| Basic test coverage | ✅ Implemented | ~1,537 lines in `tests/` covering 4 tokenizers (commit `d6d836f`) |

---

## Proposed Enhancements

### 1. Comprehensive Correctness Testing
**Priority: HIGH | Complexity: MEDIUM**

**Current State**: Integration tests cover basic functionality and exact token IDs (~1,537 lines).

**Gaps:**
- [ ] Add correctness validation suite comparing against tiktoken/HuggingFace on large text corpus
- [ ] Add property-based testing (fuzzing) with `cargo-fuzz` or `proptest`
- [ ] Add edge case tests: malformed UTF-8, empty strings, very long inputs, pathological BPE cases
- [ ] Document test coverage metrics and correctness validation methodology
- [ ] Add "silent degradation" detection: tests that catch minor tokenization differences

**Why it matters**: LLMs are robust to small differences, but those differences can accumulate and silently degrade performance over large datasets.

---

### 2. Trimming/Padding/Normalization for Added Tokens
**Priority: MEDIUM | Complexity: LOW-MEDIUM**

**Current State**: No normalization layer exists.

**Tasks:**
- [ ] Research normalization requirements for tiktoken/HuggingFace implementations
- [ ] Determine if normalization (NFKC, whitespace trimming, etc.) is needed for correctness
- [ ] If needed: Add optional `Normalizer` trait and implementations
- [ ] Document normalization behavior (or lack thereof) in API docs
- [ ] Add tests for special token boundary cases (leading/trailing whitespace, etc.)

---

### 3. Preprocessing/Postprocessing Validation
**Priority: HIGH | Complexity: LOW**

**Current State**: Has PCRE2 regex, ByteLevel preprocessing, Aho-Corasick. No formal validation against reference implementations.

**Tasks:**
- [ ] Add validation tests comparing regex split output to tiktoken for diverse examples
- [ ] Validate ByteLevel encode/decode against HuggingFace tokenizers
- [ ] Test preprocessing edge cases: emoji, rare unicode, control characters
- [ ] Document preprocessing/postprocessing guarantees in API docs

---

### 4. Ergonomic API Improvements for Single Token Decoding
**Priority: LOW | Complexity: LOW**

**Current State**: `decode_bytes(&[u32]) -> Vec<u8>` and streaming decoders work well, but single-token convenience methods could be added.

**Tasks:**
- [ ] Consider adding `decode_token_bytes(u32) -> Option<Vec<u8>>` convenience method
- [ ] Consider adding `decode_token(u32) -> Option<String>` method
- [ ] Document streaming and single-token decode patterns more prominently in API guide

---

## Success Criteria

- [ ] Correctness validated against reference implementations (tiktoken/HuggingFace) on large sample sets
- [ ] Zero tokenization differences from reference implementations for standard inputs
- [ ] Property-based testing catches edge cases automatically
- [ ] All preprocessing/postprocessing validated and documented

---

## Resources

- Original Reddit feedback: https://www.reddit.com/r/LocalLLaMA/comments/1p71luf/comment/nqv7upz/
- Tiktoken (reference): https://github.com/openai/tiktoken
- HuggingFace tokenizers: https://github.com/huggingface/tokenizers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Tokenization Correctness & Robustness Improvements #7

Context

Status Summary

Already Implemented ✅

Proposed Enhancements

1. Comprehensive Correctness Testing

2. Trimming/Padding/Normalization for Added Tokens

3. Preprocessing/Postprocessing Validation

4. Ergonomic API Improvements for Single Token Decoding

Success Criteria

Resources

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature	Status	Reference
Control tokens (multiple BPE runs)	✅ Implemented	`src/core/tokenizer.rs:863-899` - `encode_with_special()` splits at special tokens and runs BPE on each segment separately
Decoding to byte strings	✅ Implemented	`src/core/tokenizer.rs:905` - `decode_bytes(&[u32]) -> Vec<u8>`
Buffer/queue for incomplete chars	✅ Implemented	`src/core/streaming.rs` - `StreamingDecoder` and `ByteLevelStreamingDecoder` (commit `9e45c14`)
Basic test coverage	✅ Implemented	~1,537 lines in `tests/` covering 4 tokenizers (commit `d6d836f`)

Uh oh!

[Enhancement] Tokenization Correctness & Robustness Improvements #7

Description

Context

Status Summary

Already Implemented ✅

Proposed Enhancements

1. Comprehensive Correctness Testing

2. Trimming/Padding/Normalization for Added Tokens

3. Preprocessing/Postprocessing Validation

4. Ergonomic API Improvements for Single Token Decoding

Success Criteria

Resources

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions