Fix/chunking token limit behavior #154

ASuresh0524 · 2025-10-27T01:57:54Z

Fixes Issue #153: Chunking Token Limit Behavior

This PR addresses the cascade batch failures when using Ollama embeddings with token-limited models (e.g., nomic-embed-text-v2 with 512 token limit).

Root Cause Analysis

Unit mismatch: Traditional chunking uses tokens, AST chunking uses characters
Overlap behavior: Overlap is added AFTER chunk size, not included in it
No token awareness: No validation against embedding model token limits
Cascade failures: One oversized chunk causes entire batch (32 chunks) to fail

Solutions Implemented

1. Token-Aware Truncation (Critical Fix)

Added EMBEDDING_MODEL_LIMITS registry for known models
Implemented truncate_to_token_limit() with tiktoken support
Applied truncation before Ollama API calls to prevent 500 errors

2. CLI Help Text Clarification

Updated all help text to clearly specify TOKENS vs CHARACTERS
Added explicit warnings about overlap behavior
Provided safe recommendations for 512 token models
Changed AST defaults: 768→300 chars, 96→64 overlap (safer)

3. Post-Chunking Validation

Added validate_chunk_token_limits() with real token counting
Implemented safety net that validates all chunks ≤ 512 tokens
Added warnings when AST parameters might exceed token limits

4. Improved Batch Error Handling

Enhanced error detection for token limit violations
Added individual text recovery on batch failures
Prevented cascade failures from single oversized chunks
Added exponential backoff for retries

Testing

No linting errors
Backward compatibility maintained
Safe defaults for 512 token models
Graceful fallbacks when tiktoken unavailable

Files Changed

packages/leann-core/src/leann/embedding_compute.py - Token truncation & batch handling
packages/leann-core/src/leann/chunking_utils.py - Validation & safe calculations
packages/leann-core/src/leann/cli.py - Help text & safe defaults
apps/base_rag_example.py - Consistent parameter documentation

Resolves

Eliminates Ollama 500 Server Error batch failures
Prevents token limit violations for common embedding models
Provides clear user guidance for safe parameter settings
Maintains backward compatibility while adding safety features

Closes #153

yichuan-w · 2025-10-30T23:44:06Z

@ASuresh0524 thanks for the great PR, I will review it later, can you solve the conflict first?
packages/leann-core/src/leann/embedding_compute.py
let me know when it is ready

ASuresh0524 · 2025-11-01T00:11:31Z

Thanks @yichuan-w! Plesae checkout #156 and see if the fixes are okay, would love your input as well @ww2283

yichuan-w · 2025-11-01T00:16:10Z

What's the difference between 154 and 156

yichuan-w · 2025-11-01T00:20:09Z

Again don't give two PR if it is exactly the same, correct me if I am wrong, and which one I should merge

ASuresh0524 · 2025-11-01T00:20:41Z

Tried to combine all the changes we made before to make it cleaner in 156 but if this new 154 one is fine then we can delete the 156 PR

ASuresh0524 · 2025-11-01T00:21:46Z

We can merge 154, delete 156 sorry, tried to synthesize it into 1, bad practice on my part

ww2283 · 2025-11-01T00:36:16Z

sorry for being late as I was investigating around the fixes in 154, so far my understanding is for 154:

EMBEDDING_MODEL_LIMITS registry
truncate_to_token_limit() function with tiktoken
Token truncation before Ollama API calls
Prevents 500 errors from oversized chunks

This has some issues potentially:
EMBEDDING_MODEL_LIMITS is hard to keep up with the evolving of embedding models as they come up so fast
also user may set different embedding seq length support in their hosting end.
most importantly, I tested Ollama and LM studio , which both depending on the llama.cpp, will silently truncate.
this means that truncate_to_token_limit() or not, you will get the same result of not full encoding.
So i'm investigating around a potential solution.
The major difference is, doc or code chunk (non-AST) use token to cut, while AST uses character to cut.
I think there is a way to solve the AST character to token conversion issue and i'm working on it.

So my personal opinion is that EMBEDDING_MODEL_LIMITS and truncate_to_token_limit() are not solving the underlying problems behind. Let me know if my interpretation is correct.

yichuan-w · 2025-11-01T00:41:17Z

I have no comments on this PR, and the AST feature is a totally community-driven feature. I think we should keep it simple, like just use a character as a sign to trunk

yichuan-w · 2025-11-01T00:42:30Z

EMBEDDING_MODEL_LIMITS But I think this is not sustainable at least

yichuan-w · 2025-11-01T00:42:49Z

whatever that is production ready is good

ww2283 · 2025-11-01T01:05:21Z

understood, i have no intention of holding this pr. If @yichuan-w and @ASuresh0524 feel this is good to go then I have no objection for the merging and I will rebase on my end for my part.

ASuresh0524 · 2025-11-02T00:14:14Z

should we merge this one then? or are there any updates to make

Improves upon upstream PR yichuan-w#154 with two major enhancements: 1. **Hybrid Token Limit Discovery** - Dynamic: Query Ollama /api/show for context limits - Fallback: Registry for LM Studio/OpenAI - Zero maintenance for Ollama users - Respects custom num_ctx settings 2. **AST Metadata Preservation** - create_ast_chunks() returns dict format with metadata - Preserves file_path, file_name, timestamps - Includes astchunk metadata (line numbers, node counts) - Fixes content extraction bug (checks "content" key) - Enables --show-metadata flag 3. **Better Token Limits** - nomic-embed-text: 2048 tokens (vs 512) - nomic-embed-text-v1.5: 2048 tokens - Added OpenAI models: 8192 tokens 4. **Comprehensive Tests** - 11 tests for token truncation - 545 new lines in test_astchunk_integration.py - All metadata preservation tests passing

…dling - Remove duplicate truncate_to_token_limit and get_model_token_limit functions - Restore version handling logic (model:latest -> model) from PR yichuan-w#154 - Restore partial matching fallback for model name variations - Apply ruff formatting to all modified files - All 11 token truncation tests passing

ww2283 mentioned this pull request Oct 28, 2025

Feature/add metadata output #150

Closed

3 tasks

ASuresh0524 force-pushed the fix/chunking-token-limit-behavior branch from 665c6d3 to d6ed618 Compare November 1, 2025 00:08

ASuresh0524 mentioned this pull request Nov 1, 2025

fixing chunking token issues within limit for embedding models #156

Closed

3 tasks

fixing chunking token issues within limit for embedding models

64b92a0

ASuresh0524 force-pushed the fix/chunking-token-limit-behavior branch from d6ed618 to 64b92a0 Compare November 1, 2025 00:15

ASuresh0524 merged commit 366984e into main Nov 3, 2025
27 checks passed

ww2283 mentioned this pull request Nov 3, 2025

metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation #157

Open

3 tasks

Fix/chunking token limit behavior #154

Fix/chunking token limit behavior #154

Uh oh!

Conversation

ASuresh0524 commented Oct 27, 2025

Fixes Issue #153: Chunking Token Limit Behavior

Root Cause Analysis

Solutions Implemented

1. Token-Aware Truncation (Critical Fix)

2. CLI Help Text Clarification

3. Post-Chunking Validation

4. Improved Batch Error Handling

Testing

Files Changed

Resolves

Uh oh!

yichuan-w commented Oct 30, 2025

Uh oh!

ASuresh0524 commented Nov 1, 2025

Uh oh!

yichuan-w commented Nov 1, 2025

Uh oh!

yichuan-w commented Nov 1, 2025

Uh oh!

ASuresh0524 commented Nov 1, 2025

Uh oh!

ASuresh0524 commented Nov 1, 2025

Uh oh!

ww2283 commented Nov 1, 2025

Uh oh!

yichuan-w commented Nov 1, 2025

Uh oh!

yichuan-w commented Nov 1, 2025

Uh oh!

yichuan-w commented Nov 1, 2025

Uh oh!

ww2283 commented Nov 1, 2025

Uh oh!

ASuresh0524 commented Nov 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants