-
Notifications
You must be signed in to change notification settings - Fork 350
Fix/chunking token limit behavior #154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ASuresh0524 thanks for the great PR, I will review it later, can you solve the conflict first? |
665c6d3 to
d6ed618
Compare
|
Thanks @yichuan-w! Plesae checkout #156 and see if the fixes are okay, would love your input as well @ww2283 |
d6ed618 to
64b92a0
Compare
|
What's the difference between 154 and 156 |
|
Again don't give two PR if it is exactly the same, correct me if I am wrong, and which one I should merge |
|
Tried to combine all the changes we made before to make it cleaner in 156 but if this new 154 one is fine then we can delete the 156 PR |
|
We can merge 154, delete 156 sorry, tried to synthesize it into 1, bad practice on my part |
|
sorry for being late as I was investigating around the fixes in 154, so far my understanding is for 154:
This has some issues potentially: So my personal opinion is that EMBEDDING_MODEL_LIMITS and |
|
I have no comments on this PR, and the AST feature is a totally community-driven feature. I think we should keep it simple, like just use a character as a sign to trunk |
|
EMBEDDING_MODEL_LIMITS But I think this is not sustainable at least |
|
whatever that is production ready is good |
|
understood, i have no intention of holding this pr. If @yichuan-w and @ASuresh0524 feel this is good to go then I have no objection for the merging and I will rebase on my end for my part. |
|
should we merge this one then? or are there any updates to make |
Improves upon upstream PR yichuan-w#154 with two major enhancements: 1. **Hybrid Token Limit Discovery** - Dynamic: Query Ollama /api/show for context limits - Fallback: Registry for LM Studio/OpenAI - Zero maintenance for Ollama users - Respects custom num_ctx settings 2. **AST Metadata Preservation** - create_ast_chunks() returns dict format with metadata - Preserves file_path, file_name, timestamps - Includes astchunk metadata (line numbers, node counts) - Fixes content extraction bug (checks "content" key) - Enables --show-metadata flag 3. **Better Token Limits** - nomic-embed-text: 2048 tokens (vs 512) - nomic-embed-text-v1.5: 2048 tokens - Added OpenAI models: 8192 tokens 4. **Comprehensive Tests** - 11 tests for token truncation - 545 new lines in test_astchunk_integration.py - All metadata preservation tests passing
…dling - Remove duplicate truncate_to_token_limit and get_model_token_limit functions - Restore version handling logic (model:latest -> model) from PR yichuan-w#154 - Restore partial matching fallback for model name variations - Apply ruff formatting to all modified files - All 11 token truncation tests passing
Fixes Issue #153: Chunking Token Limit Behavior
This PR addresses the cascade batch failures when using Ollama embeddings with token-limited models (e.g.,
nomic-embed-text-v2with 512 token limit).Root Cause Analysis
Solutions Implemented
1. Token-Aware Truncation (Critical Fix)
EMBEDDING_MODEL_LIMITSregistry for known modelstruncate_to_token_limit()with tiktoken support2. CLI Help Text Clarification
3. Post-Chunking Validation
validate_chunk_token_limits()with real token counting4. Improved Batch Error Handling
Testing
Files Changed
packages/leann-core/src/leann/embedding_compute.py- Token truncation & batch handlingpackages/leann-core/src/leann/chunking_utils.py- Validation & safe calculationspackages/leann-core/src/leann/cli.py- Help text & safe defaultsapps/base_rag_example.py- Consistent parameter documentationResolves
Closes #153