Support configured tokenizers for embedding chunking#858
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughAdds an optional embedding tokenizer config, wires it through runtime config, and replaces fixed ChangesPluggable Tokenizer for Embedding Client
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/embedding_client.py (1)
156-203: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse repo-standard exceptions for tokenizer config failures.
These branches introduce new config/runtime failures as raw
ImportErrorandValueError. Please route invalid specs and missing optional dependencies through the exception types defined insrc/exceptions.pyso callers get the repo’s normal error contract. As per coding guidelines "Use explicit error handling with appropriate exception types from src/exceptions.py" and "Define custom exception types in src/exceptions.py and use them throughout the codebase".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/embedding_client.py` around lines 156 - 203, The tokenizer resolution paths in _load_huggingface_tokenizer and _resolve_tokenizer currently raise raw ImportError and ValueError for missing optional dependencies and invalid specs; switch these failures to the repo-standard exception types defined in src/exceptions.py so callers see the expected error contract. Update the hf:, file:, and tiktoken: validation branches, plus the missing tokenizers import case, to use the appropriate custom exceptions already used elsewhere in the codebase.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/embedding_client.py`:
- Around line 156-203: The tokenizer resolution paths in
_load_huggingface_tokenizer and _resolve_tokenizer currently raise raw
ImportError and ValueError for missing optional dependencies and invalid specs;
switch these failures to the repo-standard exception types defined in
src/exceptions.py so callers see the expected error contract. Update the hf:,
file:, and tiktoken: validation branches, plus the missing tokenizers import
case, to use the appropriate custom exceptions already used elsewhere in the
codebase.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fa05c1b1-4cc2-4900-a22e-3f3262953282
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
.env.templateconfig.toml.exampledocs/v3/contributing/configuration.mdxpyproject.tomlsrc/config.pysrc/embedding_client.pytests/llm/test_embedding_client.pytests/llm/test_model_config.py
Embedding chunking always picked a tokenizer from the model name through
tiktoken, which can be wrong for non-OpenAI embedding models likeBAAI/bge-m3.This adds an explicit embedding tokenizer setting and passes it into the chunking/token-count path, with support for
tiktoken:,hf:, andfile:tokenizer specs. Blank tokenizer config stays unset, so existing defaults continue to work.I checked the focused embedding/config tests with 49 passing, the full
tests/llmsuite with 206 passing, ruff, basedpyright,uv sync --all-extras --dev --frozen, andgit diff --check.Fixes #827
Summary by CodeRabbit
New Features
tiktoken:...,hf:..., and localfile:/.../tokenizer.jsonspecs.Documentation
EMBEDDING_MODEL_CONFIG__TOKENIZERoption and tokenizer spec examples.