Skip to content

[clarification] Chunking Token Limit Behavior #153

@ww2283

Description

@ww2283

I was doing some larger codebase processing and then noticed the behavior that chunking overlap is added earlier that I thought it should be, which could potentially leading to embedding failure. I let claude code prepared the following description and I'd like to hear the devs' opinion if this observation is valid.


Chunking Token Limit Issue

Issue Summary

When using Ollama embeddings (or other embedding models with hard token limits), batch processing failures can occur even when chunk size parameters are set below the model's context limit. This affects both traditional and AST-aware chunking and is caused by overlap features producing final chunks that exceed the embedding model's token limits.

Root Cause

Critical Unit Mismatch

Traditional Chunking (SentenceSplitter):

  • CLI parameters --doc-chunk-size and --code-chunk-size are in TOKENS
  • SentenceSplitter measures chunk_size in tokens by default
  • Example: chunk_size=100 produces ~350 characters (~100 tokens)

AST Chunking:

  • CLI parameter --ast-chunk-size is passed to astchunk library
  • astchunk measures max_chunk_size in CHARACTERS (non-whitespace count)
  • This creates a unit mismatch with traditional chunking parameters

Overlap Behavior

Both chunking methods apply overlap after the base chunk size, not as part of it:

Traditional Chunking:

--doc-chunk-size 480 (tokens) + --doc-chunk-overlap 128 (tokens, default) = 608 tokens max
--code-chunk-size 512 (tokens, default) + --code-chunk-overlap 50 (tokens, default) = 562 tokens max

AST Chunking:
From astchunk documentation:

# NOTE: max_chunk_size apply to the chunks before overlapping or chunk expansion.
# The final chunk size after overlapping or chunk expansion may exceed max_chunk_size.

Example:

  • User sets: --ast-chunk-size 480 (characters)
  • AST chunking produces: Base chunks up to 480 chars
  • With overlap: Final chunks can be 480 + 64 = 544 characters
  • After tokenization: 544 chars × ~1.05 tokens/char ≈ 570 tokens

Embedding Model Limits

Many embedding models have hard token limits:

  • nomic-embed-text-v2: 512 tokens
  • mxbai-embed-large: 512 tokens
  • all-minilm: 256-512 tokens

When a chunk exceeds the model's token limit, Ollama returns a 500 error for the entire batch (typically 32 chunks), causing cascade failures.

Observed Symptoms

Computing Ollama embeddings (batched):  14%|██████ | 61/422 [01:15<07:06,  1.18s/it]
ERROR - leann.embedding_compute - Failed to get embeddings for batch: 500 Server Error: Internal Server Error

Ollama server logs show:

time=2025-10-23T12:44:57.762-04:00 level=INFO source=routes.go:694 msg="" ctxLen=510 tokenCount=537
init: embeddings required but some input tokens were not marked as outputs -> overriding
time=2025-10-23T12:44:58.232-04:00 level=INFO source=server.go:1635 msg="llm embedding error: Failed to create new sequence: the input length exceeds the context length"

Note: A single 510-character chunk produces 537 tokens, exceeding the 512 token limit.

Character-to-Token Ratio Analysis

For code content, the character-to-token ratio is poor:

  • Observed ratio: ~1.05 tokens per character
  • Reason: Code contains symbols, camelCase identifiers, and special characters that tokenize poorly

This is significantly worse than natural language text (~0.25 tokens per character).

Impact on Users

  1. Unit confusion: CLI parameters use different units (tokens vs characters) between traditional and AST chunking
  2. Unpredictable failures: Setting --doc-chunk-size 480 seems safe for 512 token limit, but overlap pushes it to 608 tokens
  3. Batch cascade failures: One oversized chunk causes 31 other valid chunks to fail in the same batch
  4. Parameter name misleading: Both chunk_size and max_chunk_size suggest hard limits but don't enforce them
  5. Silent accumulation: Overlap increases chunk sizes beyond specified limits without warning
  6. Affects both modes: Problem occurs with and without --use-ast-chunking flag

Proposed Solutions

Solution 1: Safe Chunk Size Calculation (Recommended)

Add conservative default calculations that account for overlap in both traditional and AST chunking:

def calculate_safe_chunk_size(
    model_token_limit: int,
    overlap_tokens: int,
    chunking_mode: str = "traditional"
) -> int:
    """
    Calculate safe chunk size accounting for overlap.

    Args:
        model_token_limit: Maximum tokens supported by embedding model
        overlap_tokens: Overlap size (tokens for traditional, needs conversion for AST)
        chunking_mode: "traditional" (tokens) or "ast" (characters)

    Returns:
        Safe chunk size: tokens for traditional, characters for AST
    """
    safety_factor = 0.9  # 10% safety margin
    safe_limit = int(model_token_limit * safety_factor)

    if chunking_mode == "traditional":
        # Traditional chunking uses tokens
        # Max chunk = chunk_size + overlap, so chunk_size = limit - overlap
        return max(1, safe_limit - overlap_tokens)
    else:  # AST chunking
        # AST uses characters, need to convert
        # Conservative estimate: 1.2 tokens per char for code
        overlap_chars = int(overlap_tokens * 3)  # ~3 chars per token for code
        safe_chars = int(safe_limit / 1.2)
        return max(1, safe_chars - overlap_chars)

# Examples for 512 token limit:
# Traditional: (512 * 0.9) - 128 = 333 tokens (safe with 128 token overlap)
# AST: (512 * 0.9 / 1.2) - (64 * 1) = 320 characters (safe with 64 char overlap)

Solution 2: Post-Chunking Truncation

Add hard truncation after AST chunking in leann/chunking_utils.py:

def create_ast_chunks(...):
    chunks = chunk_builder.chunkify(code_content)

    # Enforce hard character limit
    safe_limit = max_chunk_size
    for i, chunk in enumerate(chunks):
        chunk_text = extract_chunk_text(chunk)
        if len(chunk_text) > safe_limit:
            chunks[i] = chunk_text[:safe_limit]
            logger.warning(
                f"Truncated AST chunk from {len(chunk_text)} to {safe_limit} chars"
            )

Solution 3: Token-Aware Truncation (Most Robust)

Add token-based truncation in leann/embedding_compute.py before API calls:

def truncate_to_token_limit(texts: list[str], max_tokens: int = 512) -> list[str]:
    """
    Truncate texts to token limit using tiktoken or similar.
    Falls back to conservative character truncation if tokenizer unavailable.
    """
    try:
        import tiktoken
        encoder = tiktoken.get_encoding("cl100k_base")
        truncated = []
        for text in texts:
            tokens = encoder.encode(text)
            if len(tokens) > max_tokens:
                truncated_tokens = tokens[:max_tokens]
                truncated_text = encoder.decode(truncated_tokens)
                truncated.append(truncated_text)
            else:
                truncated.append(text)
        return truncated
    except ImportError:
        # Fallback: Conservative character truncation
        # Assume worst case: 1.5 tokens per character for code
        char_limit = int(max_tokens / 1.5)
        return [text[:char_limit] if len(text) > char_limit else text for text in texts]

Solution 4: Better User Guidance

Update CLI help text and documentation to clarify units and overlap behavior:

--doc-chunk-size: Document chunk size in TOKENS. Final chunks may be larger
                  due to overlap. For 512 token models (e.g., nomic-embed-text-v2):
                  Recommended: 350 tokens (with default 128 overlap = 478 max)

--code-chunk-size: Code chunk size in TOKENS. Final chunks may be larger
                   due to overlap. For 512 token models:
                   Recommended: 400 tokens (with default 50 overlap = 450 max)

--ast-chunk-size: AST chunk size in CHARACTERS (non-whitespace). Final chunks
                  may be larger due to overlap and expansion. For 512 token models:
                  Recommended: 300 characters (with default 64 overlap ~= 400 chars ~= 480 tokens)

--doc-chunk-overlap: Overlap in TOKENS (default: 128)
--code-chunk-overlap: Overlap in TOKENS (default: 50)
--ast-chunk-overlap: Overlap in CHARACTERS (default: 64)

Important User Warning:
Add to documentation: "⚠️ Chunk sizes do NOT include overlap. Final chunks = chunk_size + overlap. For embedding models with token limits, always leave headroom for overlap."

Recommended Implementation Strategy

Phase 1: Immediate Fix

  • Implement Solution 3 (token-aware truncation) in embedding_compute.py
  • Add warning logs when truncation occurs
  • No breaking changes for users

Phase 2: Better Defaults

  • Implement Solution 1 (safe calculation) in chunking_utils.py
  • Auto-detect model token limits and calculate safe defaults
  • Warn users if manual settings may exceed limits

Phase 3: Documentation

  • Update CLI help text with Solution 4 guidance
  • Add troubleshooting guide for token limit errors
  • Document character-to-token ratios for different content types

Additional Considerations

Model-Specific Token Limits

Consider adding a registry of known model token limits:

EMBEDDING_MODEL_LIMITS = {
    "nomic-embed-text": 512,
    "nomic-embed-text-v2": 512,
    "mxbai-embed-large": 512,
    "all-minilm": 512,
    "bge-m3": 8192,
    # ... etc
}

Batch Error Handling

Current behavior: Entire batch fails if one chunk exceeds limit.

Improved behavior:

  1. Catch batch failures
  2. Retry individual chunks to identify problematic ones
  3. Truncate and retry failed chunks
  4. Continue processing remaining batches

This prevents cascade failures from single oversized chunks.

Immediate Workarounds for Users

Until fixes are implemented, users experiencing batch failures with 512 token limit models should use:

Traditional Chunking (without --use-ast-chunking)

leann build --docs <paths> \
  --embedding-mode ollama \
  --embedding-model nomic-embed-text-v2 \
  --doc-chunk-size 350 \       # 350 + 128 overlap = 478 tokens
  --doc-chunk-overlap 100 \     # Reduced overlap for more safety
  --code-chunk-size 400 \       # 400 + 50 overlap = 450 tokens
  --force

AST Chunking (with --use-ast-chunking)

leann build --docs <paths> \
  --embedding-mode ollama \
  --embedding-model nomic-embed-text-v2 \
  --doc-chunk-size 350 \       # For non-code files
  --ast-chunk-size 300 \       # 300 chars + 64 overlap ~= 480 tokens
  --ast-chunk-overlap 50 \     # Reduced from default 64
  --use-ast-chunking \
  --force

Conservative Settings (Maximum Safety)

For mission-critical builds, use extra conservative settings:

leann build --docs <paths> \
  --embedding-mode ollama \
  --embedding-model nomic-embed-text-v2 \
  --doc-chunk-size 300 \
  --doc-chunk-overlap 80 \
  --code-chunk-size 350 \
  --code-chunk-overlap 40 \
  --ast-chunk-size 250 \
  --ast-chunk-overlap 40 \
  --use-ast-chunking \
  --force

References

  • AST Chunking Documentation: /packages/astchunk-leann/README.md
  • Embedding Compute Module: /packages/leann-core/src/leann/embedding_compute.py
  • Chunking Utils Module: /packages/leann-core/src/leann/chunking_utils.py
  • llama_index SentenceSplitter: Uses tokens by default for chunk_size
  • Related Issue: Ollama batch embedding failures with code content

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions