-
Notifications
You must be signed in to change notification settings - Fork 353
metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation #157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Improves upon upstream PR yichuan-w#154 with two major enhancements: 1. **Hybrid Token Limit Discovery** - Dynamic: Query Ollama /api/show for context limits - Fallback: Registry for LM Studio/OpenAI - Zero maintenance for Ollama users - Respects custom num_ctx settings 2. **AST Metadata Preservation** - create_ast_chunks() returns dict format with metadata - Preserves file_path, file_name, timestamps - Includes astchunk metadata (line numbers, node counts) - Fixes content extraction bug (checks "content" key) - Enables --show-metadata flag 3. **Better Token Limits** - nomic-embed-text: 2048 tokens (vs 512) - nomic-embed-text-v1.5: 2048 tokens - Added OpenAI models: 8192 tokens 4. **Comprehensive Tests** - 11 tests for token truncation - 545 new lines in test_astchunk_integration.py - All metadata preservation tests passing
- Merged upstream's model list with our corrected token limits - Kept our corrected nomic-embed-text: 2048 (not 512) - Removed post-chunking validation (redundant with embedding-time truncation) - All tests passing except 2 pre-existing integration test failures
…dling - Remove duplicate truncate_to_token_limit and get_model_token_limit functions - Restore version handling logic (model:latest -> model) from PR yichuan-w#154 - Restore partial matching fallback for model name variations - Apply ruff formatting to all modified files - All 11 token truncation tests passing
|
@ASuresh0524 and @yichuan-w : take a look at this pr and let me know. Btw where's the slack yichuan mentioned to me several days ago? I was not smart enough to find it to join... |
- Add module-level flag to track if warning shown - Prevents spam when processing multiple files - Add clarifying note that auto-truncation happens at embedding time - Addresses issue where warning appeared for every code file
|
hey @ww2283 will look at this by tomorrow night, sorry for the delay |
- Track and report truncation statistics (count, tokens removed, max length) - Show first 3 individual truncations with exact token counts - Provide comprehensive summary when truncation occurs - Use WARNING level for data loss visibility - Silent (DEBUG level only) when no truncation needed Replaces misleading "truncated where necessary" message that appeared even when nothing was truncated.
494637d to
24d2971
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| logger = logging.getLogger(__name__) | ||
|
|
||
| # Flag to ensure AST token warning only shown once per session | ||
| _ast_token_warning_shown = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's use warnings.filterwarnings. it's thread-safe.
|
Thanks @ww2283 for the awesome PR! |
|
LGTM as well! @yichuan-w @ww2283 |

Enhanced Token Limit Handling with Dynamic Discovery + AST Metadata
This PR primarily adds on top of #152 to extend the metadata revealing when ast chunking is used, also builds on #154 by adding dynamic token limit discovery to provide comprehensive token truncation with zero-maintenance for Ollama deployments.
Key Enhancements
1. Dynamic Ollama Token Limit Discovery
/api/showendpoint for model context limits2. Corrected Model Token Limits
nomic-embed-text: 2048 tokens (was incorrectly 512)/api/showendpoint inspection3. AST Metadata Preservation
List[Dict[str, Any]]format4. Comprehensive Test Coverage
Implementation Details
Files Changed:
embedding_compute.py: Dynamic discovery, enhanced truncation, improved loggingchunking_utils.py: Dict-based return format, AST metadata flowcli.py: AST chunking parameterstest_token_truncation.py: Comprehensive unit tests (new)test_astchunk_integration.py: AST metadata validation (new)Testing
Exclude diskann tests (native C++ threading bugs on macOS):
uv run pytest -k "not (diskann or test_backend_options or test_large_index)"Benefits
Migration Notes
No breaking changes - enhancement is additive. Existing code continues working with static registry, Ollama users automatically benefit from dynamic discovery.
Checklist
uv run pytest)ruff formatandruff check)pre-commit run --all-files)