metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation #157

ww2283 · 2025-11-03T17:09:09Z

Enhanced Token Limit Handling with Dynamic Discovery + AST Metadata

This PR primarily adds on top of #152 to extend the metadata revealing when ast chunking is used, also builds on #154 by adding dynamic token limit discovery to provide comprehensive token truncation with zero-maintenance for Ollama deployments.

Key Enhancements

1. Dynamic Ollama Token Limit Discovery

Zero-maintenance: Auto-queries Ollama's /api/show endpoint for model context limits
Future-proof: Automatically adapts to new models without code changes if ollama is used
Fallback safety: Registry-based limits for LM Studio/OpenAI/unknown models for openai style api

# Detects limits automatically for Ollama
token_limit = get_model_token_limit("nomic-embed-text", base_url="http://localhost:11434")
# → Queries /api/show, returns actual context_length

2. Corrected Model Token Limits

Fixed nomic-embed-text: 2048 tokens (was incorrectly 512)
Verified via /api/show endpoint inspection
Added OpenAI embedding models (text-embedding-3-small/large: 8192)

3. AST Metadata Preservation

Chunking functions now return List[Dict[str, Any]] format
Preserves AST metadata (classes, functions, start_line, end_line)
Enables metadata display in search results
Backwards compatible with existing code

4. Comprehensive Test Coverage

277 lines of token truncation unit tests
553 lines of AST integration tests
Tests cover: known/unknown models, single/batch truncation, edge cases
Current results: 84/89 tests passing (5 pre-existing RAG environment failures)

Implementation Details

Files Changed:

embedding_compute.py: Dynamic discovery, enhanced truncation, improved logging
chunking_utils.py: Dict-based return format, AST metadata flow
cli.py: AST chunking parameters
test_token_truncation.py: Comprehensive unit tests (new)
test_astchunk_integration.py: AST metadata validation (new)

Testing

Exclude diskann tests (native C++ threading bugs on macOS):

uv run pytest -k "not (diskann or test_backend_options or test_large_index)"

Benefits

Zero-maintenance for Ollama: No hardcoded limits to update
Prevents silent truncation: Client-side control before server truncates
Better debugging: Rich metadata for search result provenance
Safer defaults: Conservative fallbacks (2048) for unknown models
Comprehensive testing: Verified behavior across scenarios

Migration Notes

No breaking changes - enhancement is additive. Existing code continues working with static registry, Ollama users automatically benefit from dynamic discovery.

Checklist

Tests pass (uv run pytest)
Code formatted (ruff format and ruff check)
Pre-commit hooks pass (pre-commit run --all-files)

Improves upon upstream PR yichuan-w#154 with two major enhancements: 1. **Hybrid Token Limit Discovery** - Dynamic: Query Ollama /api/show for context limits - Fallback: Registry for LM Studio/OpenAI - Zero maintenance for Ollama users - Respects custom num_ctx settings 2. **AST Metadata Preservation** - create_ast_chunks() returns dict format with metadata - Preserves file_path, file_name, timestamps - Includes astchunk metadata (line numbers, node counts) - Fixes content extraction bug (checks "content" key) - Enables --show-metadata flag 3. **Better Token Limits** - nomic-embed-text: 2048 tokens (vs 512) - nomic-embed-text-v1.5: 2048 tokens - Added OpenAI models: 8192 tokens 4. **Comprehensive Tests** - 11 tests for token truncation - 545 new lines in test_astchunk_integration.py - All metadata preservation tests passing

- Merged upstream's model list with our corrected token limits - Kept our corrected nomic-embed-text: 2048 (not 512) - Removed post-chunking validation (redundant with embedding-time truncation) - All tests passing except 2 pre-existing integration test failures

…dling - Remove duplicate truncate_to_token_limit and get_model_token_limit functions - Restore version handling logic (model:latest -> model) from PR yichuan-w#154 - Restore partial matching fallback for model name variations - Apply ruff formatting to all modified files - All 11 token truncation tests passing

ww2283 · 2025-11-03T20:11:07Z

@ASuresh0524 and @yichuan-w : take a look at this pr and let me know. Btw where's the slack yichuan mentioned to me several days ago? I was not smart enough to find it to join...

- Add module-level flag to track if warning shown - Prevents spam when processing multiple files - Add clarifying note that auto-truncation happens at embedding time - Addresses issue where warning appeared for every code file

ASuresh0524 · 2025-11-03T21:04:54Z

hey @ww2283 will look at this by tomorrow night, sorry for the delay

yichuan-w · 2025-11-04T00:13:43Z

Oh the Slack
link is here

- Track and report truncation statistics (count, tokens removed, max length) - Show first 3 individual truncations with exact token counts - Provide comprehensive summary when truncation occurs - Use WARNING level for data loss visibility - Silent (DEBUG level only) when no truncation needed Replaces misleading "truncated where necessary" message that appeared even when nothing was truncated.

andylizf

LGTM!

andylizf · 2025-11-05T21:05:49Z

packages/leann-core/src/leann/chunking_utils.py

 logger = logging.getLogger(__name__)

+# Flag to ensure AST token warning only shown once per session
+_ast_token_warning_shown = False


nit: let's use warnings.filterwarnings. it's thread-safe.

andylizf · 2025-11-05T21:06:18Z

Thanks @ww2283 for the awesome PR!

ASuresh0524 · 2025-11-06T01:18:34Z

LGTM as well! @yichuan-w @ww2283

ww2283 added 4 commits November 3, 2025 10:48

style: sort imports alphabetically (pre-commit auto-fix)

7344310

fix: show AST token limit warning only once per session

49ce68b

- Add module-level flag to track if warning shown - Prevents spam when processing multiple files - Add clarifying note that auto-truncation happens at embedding time - Addresses issue where warning appeared for every code file

ww2283 force-pushed the pr154-conflict-resolution branch from 494637d to 24d2971 Compare November 4, 2025 14:23

andylizf self-requested a review November 5, 2025 04:02

yichuan-w assigned andylizf Nov 5, 2025

andylizf approved these changes Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation #157

metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation #157

ww2283 commented Nov 3, 2025

Uh oh!

ww2283 commented Nov 3, 2025

Uh oh!

ASuresh0524 commented Nov 3, 2025

Uh oh!

yichuan-w commented Nov 4, 2025

Uh oh!

andylizf left a comment

Uh oh!

andylizf Nov 5, 2025

Uh oh!

andylizf commented Nov 5, 2025

Uh oh!

ASuresh0524 commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation #157

Are you sure you want to change the base?

metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation #157

Conversation

ww2283 commented Nov 3, 2025

Enhanced Token Limit Handling with Dynamic Discovery + AST Metadata

Key Enhancements

1. Dynamic Ollama Token Limit Discovery

2. Corrected Model Token Limits

3. AST Metadata Preservation

4. Comprehensive Test Coverage

Implementation Details

Testing

Benefits

Migration Notes

Checklist

Uh oh!

ww2283 commented Nov 3, 2025

Uh oh!

ASuresh0524 commented Nov 3, 2025

Uh oh!

yichuan-w commented Nov 4, 2025

Uh oh!

andylizf left a comment

Choose a reason for hiding this comment

Uh oh!

andylizf Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

andylizf commented Nov 5, 2025

Uh oh!

ASuresh0524 commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants