[clarification] Chunking Token Limit Behavior

I was doing some larger codebase processing and then noticed the behavior that chunking overlap is added earlier that I thought it should be, which could potentially leading to embedding failure. I let claude code prepared the following description and I'd like to hear the devs' opinion if this observation is valid.

---

# Chunking Token Limit Issue

## Issue Summary

When using Ollama embeddings (or other embedding models with hard token limits), batch processing failures can occur even when chunk size parameters are set below the model's context limit. This affects **both traditional and AST-aware chunking** and is caused by overlap features producing final chunks that exceed the embedding model's token limits.

## Root Cause

### Critical Unit Mismatch

**Traditional Chunking (SentenceSplitter):**
- CLI parameters `--doc-chunk-size` and `--code-chunk-size` are in **TOKENS**
- `SentenceSplitter` measures `chunk_size` in tokens by default
- Example: `chunk_size=100` produces ~350 characters (~100 tokens)

**AST Chunking:**
- CLI parameter `--ast-chunk-size` is passed to astchunk library
- astchunk measures `max_chunk_size` in **CHARACTERS** (non-whitespace count)
- This creates a unit mismatch with traditional chunking parameters

### Overlap Behavior

**Both chunking methods** apply overlap **after** the base chunk size, not as part of it:

**Traditional Chunking:**
```
--doc-chunk-size 480 (tokens) + --doc-chunk-overlap 128 (tokens, default) = 608 tokens max
--code-chunk-size 512 (tokens, default) + --code-chunk-overlap 50 (tokens, default) = 562 tokens max
```

**AST Chunking:**
From astchunk documentation:
```
# NOTE: max_chunk_size apply to the chunks before overlapping or chunk expansion.
# The final chunk size after overlapping or chunk expansion may exceed max_chunk_size.
```

Example:
- **User sets**: `--ast-chunk-size 480` (characters)
- **AST chunking produces**: Base chunks up to 480 chars
- **With overlap**: Final chunks can be 480 + 64 = **544 characters**
- **After tokenization**: 544 chars × ~1.05 tokens/char ≈ **570 tokens**

### Embedding Model Limits

Many embedding models have hard token limits:
- `nomic-embed-text-v2`: 512 tokens
- `mxbai-embed-large`: 512 tokens
- `all-minilm`: 256-512 tokens

When a chunk exceeds the model's token limit, Ollama returns a 500 error for the **entire batch** (typically 32 chunks), causing cascade failures.

## Observed Symptoms

```
Computing Ollama embeddings (batched):  14%|██████ | 61/422 [01:15<07:06,  1.18s/it]
ERROR - leann.embedding_compute - Failed to get embeddings for batch: 500 Server Error: Internal Server Error
```

Ollama server logs show:
```
time=2025-10-23T12:44:57.762-04:00 level=INFO source=routes.go:694 msg="" ctxLen=510 tokenCount=537
init: embeddings required but some input tokens were not marked as outputs -> overriding
time=2025-10-23T12:44:58.232-04:00 level=INFO source=server.go:1635 msg="llm embedding error: Failed to create new sequence: the input length exceeds the context length"
```

Note: A single 510-character chunk produces 537 tokens, exceeding the 512 token limit.

## Character-to-Token Ratio Analysis

For code content, the character-to-token ratio is poor:
- **Observed ratio**: ~1.05 tokens per character
- **Reason**: Code contains symbols, camelCase identifiers, and special characters that tokenize poorly

This is significantly worse than natural language text (~0.25 tokens per character).

## Impact on Users

1. **Unit confusion**: CLI parameters use different units (tokens vs characters) between traditional and AST chunking
2. **Unpredictable failures**: Setting `--doc-chunk-size 480` seems safe for 512 token limit, but overlap pushes it to 608 tokens
3. **Batch cascade failures**: One oversized chunk causes 31 other valid chunks to fail in the same batch
4. **Parameter name misleading**: Both `chunk_size` and `max_chunk_size` suggest hard limits but don't enforce them
5. **Silent accumulation**: Overlap increases chunk sizes beyond specified limits without warning
6. **Affects both modes**: Problem occurs with and without `--use-ast-chunking` flag

## Proposed Solutions

### Solution 1: Safe Chunk Size Calculation (Recommended)

Add conservative default calculations that account for overlap in **both** traditional and AST chunking:

```python
def calculate_safe_chunk_size(
    model_token_limit: int,
    overlap_tokens: int,
    chunking_mode: str = "traditional"
) -> int:
    """
    Calculate safe chunk size accounting for overlap.

    Args:
        model_token_limit: Maximum tokens supported by embedding model
        overlap_tokens: Overlap size (tokens for traditional, needs conversion for AST)
        chunking_mode: "traditional" (tokens) or "ast" (characters)

    Returns:
        Safe chunk size: tokens for traditional, characters for AST
    """
    safety_factor = 0.9  # 10% safety margin
    safe_limit = int(model_token_limit * safety_factor)

    if chunking_mode == "traditional":
        # Traditional chunking uses tokens
        # Max chunk = chunk_size + overlap, so chunk_size = limit - overlap
        return max(1, safe_limit - overlap_tokens)
    else:  # AST chunking
        # AST uses characters, need to convert
        # Conservative estimate: 1.2 tokens per char for code
        overlap_chars = int(overlap_tokens * 3)  # ~3 chars per token for code
        safe_chars = int(safe_limit / 1.2)
        return max(1, safe_chars - overlap_chars)

# Examples for 512 token limit:
# Traditional: (512 * 0.9) - 128 = 333 tokens (safe with 128 token overlap)
# AST: (512 * 0.9 / 1.2) - (64 * 1) = 320 characters (safe with 64 char overlap)
```

### Solution 2: Post-Chunking Truncation

Add hard truncation after AST chunking in `leann/chunking_utils.py`:

```python
def create_ast_chunks(...):
    chunks = chunk_builder.chunkify(code_content)

    # Enforce hard character limit
    safe_limit = max_chunk_size
    for i, chunk in enumerate(chunks):
        chunk_text = extract_chunk_text(chunk)
        if len(chunk_text) > safe_limit:
            chunks[i] = chunk_text[:safe_limit]
            logger.warning(
                f"Truncated AST chunk from {len(chunk_text)} to {safe_limit} chars"
            )
```

### Solution 3: Token-Aware Truncation (Most Robust)

Add token-based truncation in `leann/embedding_compute.py` before API calls:

```python
def truncate_to_token_limit(texts: list[str], max_tokens: int = 512) -> list[str]:
    """
    Truncate texts to token limit using tiktoken or similar.
    Falls back to conservative character truncation if tokenizer unavailable.
    """
    try:
        import tiktoken
        encoder = tiktoken.get_encoding("cl100k_base")
        truncated = []
        for text in texts:
            tokens = encoder.encode(text)
            if len(tokens) > max_tokens:
                truncated_tokens = tokens[:max_tokens]
                truncated_text = encoder.decode(truncated_tokens)
                truncated.append(truncated_text)
            else:
                truncated.append(text)
        return truncated
    except ImportError:
        # Fallback: Conservative character truncation
        # Assume worst case: 1.5 tokens per character for code
        char_limit = int(max_tokens / 1.5)
        return [text[:char_limit] if len(text) > char_limit else text for text in texts]
```

### Solution 4: Better User Guidance

Update CLI help text and documentation to clarify units and overlap behavior:

```
--doc-chunk-size: Document chunk size in TOKENS. Final chunks may be larger
                  due to overlap. For 512 token models (e.g., nomic-embed-text-v2):
                  Recommended: 350 tokens (with default 128 overlap = 478 max)

--code-chunk-size: Code chunk size in TOKENS. Final chunks may be larger
                   due to overlap. For 512 token models:
                   Recommended: 400 tokens (with default 50 overlap = 450 max)

--ast-chunk-size: AST chunk size in CHARACTERS (non-whitespace). Final chunks
                  may be larger due to overlap and expansion. For 512 token models:
                  Recommended: 300 characters (with default 64 overlap ~= 400 chars ~= 480 tokens)

--doc-chunk-overlap: Overlap in TOKENS (default: 128)
--code-chunk-overlap: Overlap in TOKENS (default: 50)
--ast-chunk-overlap: Overlap in CHARACTERS (default: 64)
```

**Important User Warning:**
Add to documentation: "⚠️ Chunk sizes do NOT include overlap. Final chunks = chunk_size + overlap. For embedding models with token limits, always leave headroom for overlap."

## Recommended Implementation Strategy

**Phase 1: Immediate Fix**
- Implement Solution 3 (token-aware truncation) in `embedding_compute.py`
- Add warning logs when truncation occurs
- No breaking changes for users

**Phase 2: Better Defaults**
- Implement Solution 1 (safe calculation) in `chunking_utils.py`
- Auto-detect model token limits and calculate safe defaults
- Warn users if manual settings may exceed limits

**Phase 3: Documentation**
- Update CLI help text with Solution 4 guidance
- Add troubleshooting guide for token limit errors
- Document character-to-token ratios for different content types

## Additional Considerations

### Model-Specific Token Limits

Consider adding a registry of known model token limits:

```python
EMBEDDING_MODEL_LIMITS = {
    "nomic-embed-text": 512,
    "nomic-embed-text-v2": 512,
    "mxbai-embed-large": 512,
    "all-minilm": 512,
    "bge-m3": 8192,
    # ... etc
}
```

### Batch Error Handling

Current behavior: Entire batch fails if one chunk exceeds limit.

Improved behavior:
1. Catch batch failures
2. Retry individual chunks to identify problematic ones
3. Truncate and retry failed chunks
4. Continue processing remaining batches

This prevents cascade failures from single oversized chunks.

## Immediate Workarounds for Users

Until fixes are implemented, users experiencing batch failures with 512 token limit models should use:

### Traditional Chunking (without --use-ast-chunking)

```bash
leann build --docs <paths> \
  --embedding-mode ollama \
  --embedding-model nomic-embed-text-v2 \
  --doc-chunk-size 350 \       # 350 + 128 overlap = 478 tokens
  --doc-chunk-overlap 100 \     # Reduced overlap for more safety
  --code-chunk-size 400 \       # 400 + 50 overlap = 450 tokens
  --force
```

### AST Chunking (with --use-ast-chunking)

```bash
leann build --docs <paths> \
  --embedding-mode ollama \
  --embedding-model nomic-embed-text-v2 \
  --doc-chunk-size 350 \       # For non-code files
  --ast-chunk-size 300 \       # 300 chars + 64 overlap ~= 480 tokens
  --ast-chunk-overlap 50 \     # Reduced from default 64
  --use-ast-chunking \
  --force
```

### Conservative Settings (Maximum Safety)

For mission-critical builds, use extra conservative settings:

```bash
leann build --docs <paths> \
  --embedding-mode ollama \
  --embedding-model nomic-embed-text-v2 \
  --doc-chunk-size 300 \
  --doc-chunk-overlap 80 \
  --code-chunk-size 350 \
  --code-chunk-overlap 40 \
  --ast-chunk-size 250 \
  --ast-chunk-overlap 40 \
  --use-ast-chunking \
  --force
```

## References

- AST Chunking Documentation: `/packages/astchunk-leann/README.md`
- Embedding Compute Module: `/packages/leann-core/src/leann/embedding_compute.py`
- Chunking Utils Module: `/packages/leann-core/src/leann/chunking_utils.py`
- llama_index SentenceSplitter: Uses tokens by default for chunk_size
- Related Issue: Ollama batch embedding failures with code content


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[clarification] Chunking Token Limit Behavior #153

Chunking Token Limit Issue

Issue Summary

Root Cause

Critical Unit Mismatch

Overlap Behavior

Embedding Model Limits

Observed Symptoms

Character-to-Token Ratio Analysis

Impact on Users

Proposed Solutions

Solution 1: Safe Chunk Size Calculation (Recommended)

Solution 2: Post-Chunking Truncation

Solution 3: Token-Aware Truncation (Most Robust)

Solution 4: Better User Guidance

Recommended Implementation Strategy

Additional Considerations

Model-Specific Token Limits

Batch Error Handling

Immediate Workarounds for Users

Traditional Chunking (without --use-ast-chunking)

AST Chunking (with --use-ast-chunking)

Conservative Settings (Maximum Safety)

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[clarification] Chunking Token Limit Behavior #153

Description

Chunking Token Limit Issue

Issue Summary

Root Cause

Critical Unit Mismatch

Overlap Behavior

Embedding Model Limits

Observed Symptoms

Character-to-Token Ratio Analysis

Impact on Users

Proposed Solutions

Solution 1: Safe Chunk Size Calculation (Recommended)

Solution 2: Post-Chunking Truncation

Solution 3: Token-Aware Truncation (Most Robust)

Solution 4: Better User Guidance

Recommended Implementation Strategy

Additional Considerations

Model-Specific Token Limits

Batch Error Handling

Immediate Workarounds for Users

Traditional Chunking (without --use-ast-chunking)

AST Chunking (with --use-ast-chunking)

Conservative Settings (Maximum Safety)

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions