Skip to content

Support configured tokenizers for embedding chunking#858

Open
morgan-coded wants to merge 2 commits into
plastic-labs:mainfrom
morgan-coded:fix/827-embedding-tokenizer-config
Open

Support configured tokenizers for embedding chunking#858
morgan-coded wants to merge 2 commits into
plastic-labs:mainfrom
morgan-coded:fix/827-embedding-tokenizer-config

Conversation

@morgan-coded

@morgan-coded morgan-coded commented Jun 28, 2026

Copy link
Copy Markdown

Embedding chunking always picked a tokenizer from the model name through tiktoken, which can be wrong for non-OpenAI embedding models like BAAI/bge-m3.

This adds an explicit embedding tokenizer setting and passes it into the chunking/token-count path, with support for tiktoken:, hf:, and file: tokenizer specs. Blank tokenizer config stays unset, so existing defaults continue to work.

I checked the focused embedding/config tests with 49 passing, the full tests/llm suite with 206 passing, ruff, basedpyright, uv sync --all-extras --dev --frozen, and git diff --check.

Fixes #827

Summary by CodeRabbit

  • New Features

    • Added optional embedding tokenizer configuration (env/config and examples), including tiktoken:..., hf:..., and local file:/.../tokenizer.json specs.
    • Embedding token counting/chunking and batching now use the configured tokenizer.
    • Embedding tokenizer selection is now reflected in runtime behavior (changes trigger a tokenizer rebuild).
  • Documentation

    • Updated configuration docs and templates to include the new EMBEDDING_MODEL_CONFIG__TOKENIZER option and tokenizer spec examples.
    • Noted the optional extra needed for Hugging Face and file tokenizers.

Copilot AI review requested due to automatic review settings June 28, 2026 16:10

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c1bc1b8c-19b4-4f04-b1ce-23012113339f

📥 Commits

Reviewing files that changed from the base of the PR and between 1e7176b and 95a9f55.

📒 Files selected for processing (2)
  • src/embedding_client.py
  • tests/llm/test_embedding_client.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/llm/test_embedding_client.py
  • src/embedding_client.py

Walkthrough

Adds an optional embedding tokenizer config, wires it through runtime config, and replaces fixed tiktoken selection with a pluggable tokenizer resolver supporting tiktoken:, hf:, and file: specs. Updates docs, templates, dependency metadata, and tests.

Changes

Pluggable Tokenizer for Embedding Client

Layer / File(s) Summary
Config fields and optional dependency
pyproject.toml, src/config.py, .env.template, config.toml.example
Adds tokenizer: str | None = None to embedding config types, passes it through resolve_embedding_model_config, defines tokenizers>=0.22.0 as an optional dependency extra, and adds commented tokenizer examples to the config templates.
TokenizerLike protocol and _resolve_tokenizer implementation
src/embedding_client.py
Defines TokenizerLike, adds tokenizer resolution for tiktoken:, hf:, and file: specs with default fallback behavior, uses it in _EmbeddingClient initialization and chunking, and includes tokenizer changes in singleton reinitialization.
Tests and documentation
tests/llm/test_embedding_client.py, tests/llm/test_model_config.py, docs/v3/contributing/configuration.mdx
Adds tokenizer-focused tests for client behavior and config/env plumbing, updates .env.template coverage, and documents tokenizer spec formats and the honcho[tokenizers] extra.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • plastic-labs/honcho#647: Modifies the same embedding client tokenizer/token-counting flow that this PR makes configurable.

Suggested reviewers

  • VVoruganti

Poem

🐇 I hopped through tokens, light and spry,
tiktoken, hf:, and file:—oh my!
The right encoder now joins the choir,
So chunks stay true and don’t conspire.
Hop-hop hooray, the counts align just right!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: configurable tokenizers for embedding chunking.
Linked Issues check ✅ Passed The PR implements the requested tokenizer config, supports tiktoken/hf/file specs, adds the optional dependency, and wires it through chunking.
Out of Scope Changes check ✅ Passed The changes stay focused on tokenizer configuration, runtime wiring, docs, and tests with no unrelated feature work.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/embedding_client.py (1)

156-203: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use repo-standard exceptions for tokenizer config failures.

These branches introduce new config/runtime failures as raw ImportError and ValueError. Please route invalid specs and missing optional dependencies through the exception types defined in src/exceptions.py so callers get the repo’s normal error contract. As per coding guidelines "Use explicit error handling with appropriate exception types from src/exceptions.py" and "Define custom exception types in src/exceptions.py and use them throughout the codebase".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/embedding_client.py` around lines 156 - 203, The tokenizer resolution
paths in _load_huggingface_tokenizer and _resolve_tokenizer currently raise raw
ImportError and ValueError for missing optional dependencies and invalid specs;
switch these failures to the repo-standard exception types defined in
src/exceptions.py so callers see the expected error contract. Update the hf:,
file:, and tiktoken: validation branches, plus the missing tokenizers import
case, to use the appropriate custom exceptions already used elsewhere in the
codebase.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/embedding_client.py`:
- Around line 156-203: The tokenizer resolution paths in
_load_huggingface_tokenizer and _resolve_tokenizer currently raise raw
ImportError and ValueError for missing optional dependencies and invalid specs;
switch these failures to the repo-standard exception types defined in
src/exceptions.py so callers see the expected error contract. Update the hf:,
file:, and tiktoken: validation branches, plus the missing tokenizers import
case, to use the appropriate custom exceptions already used elsewhere in the
codebase.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fa05c1b1-4cc2-4900-a22e-3f3262953282

📥 Commits

Reviewing files that changed from the base of the PR and between eb386c3 and 1e7176b.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • .env.template
  • config.toml.example
  • docs/v3/contributing/configuration.mdx
  • pyproject.toml
  • src/config.py
  • src/embedding_client.py
  • tests/llm/test_embedding_client.py
  • tests/llm/test_model_config.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Embedding tokenizer mismatch causes silent chunking failures for non-OpenAI models

2 participants