Support configured tokenizers for embedding chunking by morgan-coded · Pull Request #858 · plastic-labs/honcho

morgan-coded · 2026-06-28T16:10:47Z

Embedding chunking always picked a tokenizer from the model name through tiktoken, which can be wrong for non-OpenAI embedding models like BAAI/bge-m3.

This adds an explicit embedding tokenizer setting and passes it into the chunking/token-count path, with support for tiktoken:, hf:, and file: tokenizer specs. Blank tokenizer config stays unset, so existing defaults continue to work.

I checked the focused embedding/config tests with 49 passing, the full tests/llm suite with 206 passing, ruff, basedpyright, uv sync --all-extras --dev --frozen, and git diff --check.

Fixes #827

Summary by CodeRabbit

New Features
- Added optional embedding tokenizer configuration (env/config and examples), including tiktoken:..., hf:..., and local file:/.../tokenizer.json specs.
- Embedding token counting/chunking and batching now use the configured tokenizer.
- Embedding tokenizer selection is now reflected in runtime behavior (changes trigger a tokenizer rebuild).
Documentation
- Updated configuration docs and templates to include the new EMBEDDING_MODEL_CONFIG__TOKENIZER option and tokenizer spec examples.
- Noted the optional extra needed for Hugging Face and file tokenizers.

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

coderabbitai · 2026-06-28T16:11:01Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c1bc1b8c-19b4-4f04-b1ce-23012113339f

📥 Commits

Reviewing files that changed from the base of the PR and between 1e7176b and 95a9f55.

📒 Files selected for processing (2)

src/embedding_client.py
tests/llm/test_embedding_client.py

🚧 Files skipped from review as they are similar to previous changes (2)

tests/llm/test_embedding_client.py
src/embedding_client.py

Walkthrough

Adds an optional embedding tokenizer config, wires it through runtime config, and replaces fixed tiktoken selection with a pluggable tokenizer resolver supporting tiktoken:, hf:, and file: specs. Updates docs, templates, dependency metadata, and tests.

Changes

Pluggable Tokenizer for Embedding Client

Layer / File(s)	Summary
Config fields and optional dependency `pyproject.toml`, `src/config.py`, `.env.template`, `config.toml.example`	Adds `tokenizer: str \| None = None` to embedding config types, passes it through `resolve_embedding_model_config`, defines `tokenizers>=0.22.0` as an optional dependency extra, and adds commented tokenizer examples to the config templates.
TokenizerLike protocol and `_resolve_tokenizer` implementation `src/embedding_client.py`	Defines `TokenizerLike`, adds tokenizer resolution for `tiktoken:`, `hf:`, and `file:` specs with default fallback behavior, uses it in `_EmbeddingClient` initialization and chunking, and includes tokenizer changes in singleton reinitialization.
Tests and documentation `tests/llm/test_embedding_client.py`, `tests/llm/test_model_config.py`, `docs/v3/contributing/configuration.mdx`	Adds tokenizer-focused tests for client behavior and config/env plumbing, updates `.env.template` coverage, and documents tokenizer spec formats and the `honcho[tokenizers]` extra.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

plastic-labs/honcho#647: Modifies the same embedding client tokenizer/token-counting flow that this PR makes configurable.

Suggested reviewers

VVoruganti

Poem

🐇 I hopped through tokens, light and spry,
tiktoken, hf:, and file:—oh my!
The right encoder now joins the choir,
So chunks stay true and don’t conspire.
Hop-hop hooray, the counts align just right!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.62% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: configurable tokenizers for embedding chunking.
Linked Issues check	✅ Passed	The PR implements the requested tokenizer config, supports tiktoken/hf/file specs, adds the optional dependency, and wires it through chunking.
Out of Scope Changes check	✅ Passed	The changes stay focused on tokenizer configuration, runtime wiring, docs, and tests with no unrelated feature work.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (1)

src/embedding_client.py (1)
156-203: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use repo-standard exceptions for tokenizer config failures.

These branches introduce new config/runtime failures as raw ImportError and ValueError. Please route invalid specs and missing optional dependencies through the exception types defined in src/exceptions.py so callers get the repo’s normal error contract. As per coding guidelines "Use explicit error handling with appropriate exception types from src/exceptions.py" and "Define custom exception types in src/exceptions.py and use them throughout the codebase".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/embedding_client.py` around lines 156 - 203, The tokenizer resolution
paths in _load_huggingface_tokenizer and _resolve_tokenizer currently raise raw
ImportError and ValueError for missing optional dependencies and invalid specs;
switch these failures to the repo-standard exception types defined in
src/exceptions.py so callers see the expected error contract. Update the hf:,
file:, and tiktoken: validation branches, plus the missing tokenizers import
case, to use the appropriate custom exceptions already used elsewhere in the
codebase.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/embedding_client.py`:
- Around line 156-203: The tokenizer resolution paths in
_load_huggingface_tokenizer and _resolve_tokenizer currently raise raw
ImportError and ValueError for missing optional dependencies and invalid specs;
switch these failures to the repo-standard exception types defined in
src/exceptions.py so callers see the expected error contract. Update the hf:,
file:, and tiktoken: validation branches, plus the missing tokenizers import
case, to use the appropriate custom exceptions already used elsewhere in the
codebase.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fa05c1b1-4cc2-4900-a22e-3f3262953282

📥 Commits

Reviewing files that changed from the base of the PR and between eb386c3 and 1e7176b.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (8)

.env.template
config.toml.example
docs/v3/contributing/configuration.mdx
pyproject.toml
src/config.py
src/embedding_client.py
tests/llm/test_embedding_client.py
tests/llm/test_model_config.py

feat(embeddings): support configured tokenizers

1e7176b

Copilot AI review requested due to automatic review settings June 28, 2026 16:10

Copilot AI reviewed Jun 28, 2026

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

fix(embeddings): use ValidationException for tokenizer config errors

95a9f55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support configured tokenizers for embedding chunking#858

Support configured tokenizers for embedding chunking#858
morgan-coded wants to merge 2 commits into
plastic-labs:mainfrom
morgan-coded:fix/827-embedding-tokenizer-config

morgan-coded commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

morgan-coded commented Jun 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

morgan-coded commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading