Skip to content

Conversation

@ASuresh0524
Copy link
Collaborator

Fixes Issue #153: Chunking Token Limit Behavior

This PR addresses the cascade batch failures when using Ollama embeddings with token-limited models (e.g., nomic-embed-text-v2 with 512 token limit).

Root Cause Analysis

  • Unit mismatch: Traditional chunking uses tokens, AST chunking uses characters
  • Overlap behavior: Overlap is added AFTER chunk size, not included in it
  • No token awareness: No validation against embedding model token limits
  • Cascade failures: One oversized chunk causes entire batch (32 chunks) to fail

Solutions Implemented

1. Token-Aware Truncation (Critical Fix)

  • Added EMBEDDING_MODEL_LIMITS registry for known models
  • Implemented truncate_to_token_limit() with tiktoken support
  • Applied truncation before Ollama API calls to prevent 500 errors

2. CLI Help Text Clarification

  • Updated all help text to clearly specify TOKENS vs CHARACTERS
  • Added explicit warnings about overlap behavior
  • Provided safe recommendations for 512 token models
  • Changed AST defaults: 768→300 chars, 96→64 overlap (safer)

3. Post-Chunking Validation

  • Added validate_chunk_token_limits() with real token counting
  • Implemented safety net that validates all chunks ≤ 512 tokens
  • Added warnings when AST parameters might exceed token limits

4. Improved Batch Error Handling

  • Enhanced error detection for token limit violations
  • Added individual text recovery on batch failures
  • Prevented cascade failures from single oversized chunks
  • Added exponential backoff for retries

Testing

  • No linting errors
  • Backward compatibility maintained
  • Safe defaults for 512 token models
  • Graceful fallbacks when tiktoken unavailable

Files Changed

  • packages/leann-core/src/leann/embedding_compute.py - Token truncation & batch handling
  • packages/leann-core/src/leann/chunking_utils.py - Validation & safe calculations
  • packages/leann-core/src/leann/cli.py - Help text & safe defaults
  • apps/base_rag_example.py - Consistent parameter documentation

Resolves

  • Eliminates Ollama 500 Server Error batch failures
  • Prevents token limit violations for common embedding models
  • Provides clear user guidance for safe parameter settings
  • Maintains backward compatibility while adding safety features

Closes #153

@ww2283 ww2283 mentioned this pull request Oct 28, 2025
3 tasks
@yichuan-w
Copy link
Owner

@ASuresh0524 thanks for the great PR, I will review it later, can you solve the conflict first?
packages/leann-core/src/leann/embedding_compute.py
let me know when it is ready

@ASuresh0524
Copy link
Collaborator Author

Thanks @yichuan-w! Plesae checkout #156 and see if the fixes are okay, would love your input as well @ww2283

@ASuresh0524 ASuresh0524 force-pushed the fix/chunking-token-limit-behavior branch from d6ed618 to 64b92a0 Compare November 1, 2025 00:15
@yichuan-w
Copy link
Owner

What's the difference between 154 and 156

@yichuan-w
Copy link
Owner

Again don't give two PR if it is exactly the same, correct me if I am wrong, and which one I should merge

@ASuresh0524
Copy link
Collaborator Author

Tried to combine all the changes we made before to make it cleaner in 156 but if this new 154 one is fine then we can delete the 156 PR

@ASuresh0524
Copy link
Collaborator Author

We can merge 154, delete 156 sorry, tried to synthesize it into 1, bad practice on my part

@ww2283
Copy link
Contributor

ww2283 commented Nov 1, 2025

sorry for being late as I was investigating around the fixes in 154, so far my understanding is for 154:

  1. EMBEDDING_MODEL_LIMITS registry
  2. truncate_to_token_limit() function with tiktoken
  3. Token truncation before Ollama API calls
  4. Prevents 500 errors from oversized chunks

This has some issues potentially:
EMBEDDING_MODEL_LIMITS is hard to keep up with the evolving of embedding models as they come up so fast
also user may set different embedding seq length support in their hosting end.
most importantly, I tested Ollama and LM studio , which both depending on the llama.cpp, will silently truncate.
this means that truncate_to_token_limit() or not, you will get the same result of not full encoding.
So i'm investigating around a potential solution.
The major difference is, doc or code chunk (non-AST) use token to cut, while AST uses character to cut.
I think there is a way to solve the AST character to token conversion issue and i'm working on it.

So my personal opinion is that EMBEDDING_MODEL_LIMITS and truncate_to_token_limit() are not solving the underlying problems behind. Let me know if my interpretation is correct.

@yichuan-w
Copy link
Owner

I have no comments on this PR, and the AST feature is a totally community-driven feature. I think we should keep it simple, like just use a character as a sign to trunk

@yichuan-w
Copy link
Owner

EMBEDDING_MODEL_LIMITS But I think this is not sustainable at least

@yichuan-w
Copy link
Owner

whatever that is production ready is good

@ww2283
Copy link
Contributor

ww2283 commented Nov 1, 2025

understood, i have no intention of holding this pr. If @yichuan-w and @ASuresh0524 feel this is good to go then I have no objection for the merging and I will rebase on my end for my part.

@ASuresh0524
Copy link
Collaborator Author

should we merge this one then? or are there any updates to make

@ASuresh0524 ASuresh0524 merged commit 366984e into main Nov 3, 2025
27 checks passed
ww2283 added a commit to ww2283/LEANN that referenced this pull request Nov 3, 2025
Improves upon upstream PR yichuan-w#154 with two major enhancements:

1. **Hybrid Token Limit Discovery**
   - Dynamic: Query Ollama /api/show for context limits
   - Fallback: Registry for LM Studio/OpenAI
   - Zero maintenance for Ollama users
   - Respects custom num_ctx settings

2. **AST Metadata Preservation**
   - create_ast_chunks() returns dict format with metadata
   - Preserves file_path, file_name, timestamps
   - Includes astchunk metadata (line numbers, node counts)
   - Fixes content extraction bug (checks "content" key)
   - Enables --show-metadata flag

3. **Better Token Limits**
   - nomic-embed-text: 2048 tokens (vs 512)
   - nomic-embed-text-v1.5: 2048 tokens
   - Added OpenAI models: 8192 tokens

4. **Comprehensive Tests**
   - 11 tests for token truncation
   - 545 new lines in test_astchunk_integration.py
   - All metadata preservation tests passing
ww2283 added a commit to ww2283/LEANN that referenced this pull request Nov 3, 2025
…dling

- Remove duplicate truncate_to_token_limit and get_model_token_limit functions
- Restore version handling logic (model:latest -> model) from PR yichuan-w#154
- Restore partial matching fallback for model name variations
- Apply ruff formatting to all modified files
- All 11 token truncation tests passing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[clarification] Chunking Token Limit Behavior

4 participants