fix(deriver): truncate oversize observations so one can't drop the batch (#569) by leo922oel · Pull Request #864 · plastic-labs/honcho

leo922oel · 2026-07-01T01:49:15Z

Description

src/crud/representation.py collects every observation from a deriver pass and
embeds them in one call — embedding_client.simple_batch_embed(observation_texts).
If a single observation exceeded the provider's per-input token cap,
simple_batch_embed raised ValueError, save_representation re-raised
ValidationException, and the deriver caught it — so none of that pass's
observations were saved, not just the long one (#569). One long pasted artifact
(a PR review, verbose tool output, a stack trace) dropped the whole batch. Rare
under OpenAI's 8192 cap; routine under Gemini's smaller embedding cap.

simple_batch_embed has no sub-chunking path (only batch_embed, used for
messages, does), and truncation is the right call here: an observation is a short
derived fact, so a token-capped prefix still carries the meaningful content.

This adds an opt-in truncation mode and points the deriver at it:

simple_batch_embed gains on_oversize: "raise" | "truncate" (keyword-only,
default "raise"). "raise" preserves the exact historical contract for every
existing caller. "truncate" embeds a token-capped prefix of the oversize input
and logs a warning; the one-input→one-vector mapping is preserved.
RepresentationManager opts into on_oversize="truncate", so one oversize
observation is truncated instead of failing the whole observer's batch.

Complements #860 (surface representation save failures instead of silent loss):
#569 keeps the common oversize case from ever failing; #860 makes a genuinely
total save failure loud instead of silent. Neither masks the other.

Quantified (tests/bench/bench_embedding_oversize_survival.py — real
save_representation path with a faked provider + faked DB write, batch_n=10,
exactly one over-length observation):

Metric	Before (`main`)	After
`batch_survival_rate`	0%	100%

Linked issues

Fixes #569

Type of change

fix — bug fix
feat — new feature
refactor — neither fixes a bug nor adds a feature
perf — performance improvement
docs — documentation only
test — adding or correcting tests
chore — build process or tooling

Test plan

New / updated tests:

tests/llm/test_embedding_client.py — simple_batch_embed raises on oversize
when on_oversize="raise" (default) and truncates the oversize prefix (keeping
1 vector per input) when on_oversize="truncate". (18 passed)
tests/crud/test_representation_manager.py::TestRepresentationManagerSave::test_save_representation_embeds_with_truncate_on_oversize
— the manager passes on_oversize="truncate" through. (7 passed; needs a local
pgvector DB)
tests/conftest.py — the autouse embedding mock accepts the new on_oversize
kwarg (via **_kwargs) so existing tests keep passing.

Quantitative before/after (standalone harness, not a CI test):

# AFTER (this branch)
PYTHONPATH=. uv run python tests/bench/bench_embedding_oversize_survival.py
# BEFORE (main source)
git checkout main -- src/embedding_client.py src/crud/representation.py
PYTHONPATH=. uv run python tests/bench/bench_embedding_oversize_survival.py
git checkout HEAD -- src/embedding_client.py src/crud/representation.py

Checklist

Tests added/updated for the change
Existing tests pass (tests/llm/test_embedding_client.py 18/18;
tests/crud/test_representation_manager.py 7/7 against a local pgvector DB)
ruff clean; basedpyright 0 errors on every changed file (4 pre-existing
conftest.py header warnings, present identically on main)
Commit messages follow Conventional Commits
Documentation update — n/a (internal behavior; new param defaults to the
prior raise-on-oversize contract, so no caller changes)

Summary by CodeRabbit

New Features
- Added support for handling oversized observation text during embedding using automatic truncation.
Bug Fixes
- Prevented embedding failures when inputs exceed the token limit by truncating oversized texts.
- Preserved the mapping between inputs and returned embeddings to keep output counts consistent.
Tests
- Added benchmarks and expanded coverage for oversize “survival” behavior and truncation edge cases.

simple_batch_embed raised ValueError when any input exceeded the per-input token cap, which failed the entire deriver batch when a single observation was over-length — losing all the other observations in that pass. Add an on_oversize="truncate" option: oversize inputs are embedded from a token-capped prefix (with a warning), preserving one vector per input. The default stays "raise" so existing callers are unchanged. Refs plastic-labs#569

… drop the batch save_representation now calls simple_batch_embed with on_oversize="truncate", so a single over-length observation is embedded from a capped prefix instead of raising and failing the whole batch. Also forward on_oversize through the public EmbeddingClient wrapper: the prior cycle only updated the inner _EmbeddingClient, so the singleton the deriver actually uses would have raised TypeError at runtime (caught by basedpyright; the unit tests mock the singleton). Refs plastic-labs#569

save_representation passes on_oversize="truncate" to simple_batch_embed so one over-length observation is embedded from a capped prefix instead of raising and dropping the whole batch's observations. Adds a test for the opt-in and updates the existing embed-call assertions to the new contract. (The public embedding client wrapper that forwards the option landed in the previous commit.) Refs plastic-labs#569

The shared embeddings autouse mock had signature simple_batch_embed(texts); once save_representation began passing on_oversize="truncate", any test that drives the real save path through this mock would raise TypeError. Mirror the new interface (accept and ignore on_oversize) so the mock stays compatible. Refs plastic-labs#569

The autouse embeddings mock now absorbs the new on_oversize kwarg via **_kwargs (no unused-parameter warning), and the truncate test casts the Mock-recorded provider input to list[str] so basedpyright doesn't flag a partially-unknown argument. No behavior change; keeps the contributed test files warning-free.

A standalone, DB-free harness quantifying the fix: saves a batch where one observation is over-length and reports batch_survival_rate via the real save_representation path with a faked provider and faked DB write. Runs unmodified against any checkout (before/after via main vs this branch). Measured: batch-survival 0% -> 100%. Refs plastic-labs#569

coderabbitai · 2026-07-01T01:49:31Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e1046be-2c51-4878-9374-e30b767c352b

📥 Commits

Reviewing files that changed from the base of the PR and between 2ee6513 and cd02039.

📒 Files selected for processing (3)

src/embedding_client.py
tests/bench/bench_embedding_oversize_survival.py
tests/llm/test_embedding_client.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/embedding_client.py
tests/bench/bench_embedding_oversize_survival.py

Walkthrough

Adds truncate handling to embedding batch generation, routes representation saving through that mode, and updates tests plus a benchmark to cover oversize observation inputs.

Changes

Embedding oversize truncation

Layer / File(s)	Summary
Embedding client truncate path `src/embedding_client.py`	`_EmbeddingClient.simple_batch_embed` now accepts `on_oversize="raise"
Representation save path `src/crud/representation.py`, `tests/crud/test_representation_manager.py`	`save_representation` now passes `on_oversize="truncate"` for observation embeddings, and the manager tests assert that kwarg in the affected calls.
Embedding client oversize tests `tests/llm/test_embedding_client.py`	New tests cover truncate mode, re-encoding drift during truncation, and the default oversize error path.
Benchmark and shared mock updates `tests/bench/bench_embedding_oversize_survival.py`, `tests/conftest.py`	Adds a benchmark for oversize batch survival and widens the shared embedding mock to accept extra kwargs.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Possibly related issues

Deriver silently marks representation work units processed when the embedding save fails on a transient 429 (no retry, no fail-loud → silent memory loss) #728 — Also touches RepresentationManager.save_representation() and simple_batch_embed() on the embedding path.

Poem

A bunny saw a text too wide,
and trimmed the tail with careful pride.
One batch stayed whole, no drop, no fuss,
just carrots saved for all of us. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main fix: truncating oversize observations to prevent batch failure.
Linked Issues check	✅ Passed	The PR implements truncate-on-oversize for storage embedding and updates representation.py so one long observation no longer aborts the batch.
Out of Scope Changes check	✅ Passed	The added API, representation change, tests, and benchmark all relate directly to the oversize-batch fix.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (2)

tests/bench/bench_embedding_oversize_survival.py (1)
121-131: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low value

Blind except Exception swallows unrelated failures.

Catching all exceptions to report 0.0 survival conflates "batch fully failed due to oversize" with any other unrelated bug (e.g., a typo in the fake fixture) — both silently report the same score with no diagnostic trace.
♻️ Suggested: log the exception before returning
             except Exception:
+                logger.exception("Batch save failed during survival benchmark")
                 return 0.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/bench/bench_embedding_oversize_survival.py` around lines 121 - 131, The
broad except in the benchmark helper around manager.save_representation is
swallowing unrelated failures and hiding diagnostics. Narrow the handling if
possible, and in the existing exception path for the survival calculation, log
the caught exception with enough context before returning 0.0 so oversize
failures can be distinguished from fixture or test bugs.
Source: Linters/SAST tools
src/embedding_client.py (1)
290-309: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Truncated token count isn't re-verified after decode.

After truncating token_ids[:self.max_embedding_tokens] and decoding back to text, the code assumes the resulting text re-encodes to exactly self.max_embedding_tokens tokens (token_counts.append(self.max_embedding_tokens)) without re-encoding to confirm. If encode(decode(token_ids[:n])) ever yields more tokens than n (edge case at BPE merge/regex-pretokenizer boundaries), the provider could still receive an over-cap input for the very fix this PR intends to guarantee — silently reintroducing the original bug in a rare case.

Since batching arithmetic (_create_batches) also relies on this assumed count for the per-request token budget, a mismatch could also skew batch sizing.
♻️ Suggested defensive re-check
                 if on_oversize == "truncate":
                     text = self.encoding.decode(token_ids[: self.max_embedding_tokens])
+                    # Re-encode to guard against any decode/encode boundary mismatch.
+                    actual_tokens = len(self.encoding.encode(text))
                     logger.warning(
                         "truncated oversize embedding input at idx %d: %d->%d tokens",
                         idx,
                         len(token_ids),
                         self.max_embedding_tokens,
                     )
                     prepared_texts.append(text)
-                    token_counts.append(self.max_embedding_tokens)
+                    token_counts.append(actual_tokens)
                     continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/embedding_client.py` around lines 290 - 309, The truncation path in the
text-preparation logic assumes the decoded text re-encodes to exactly
max_embedding_tokens, which may not always hold. In the embedding input loop in
the method that appends to prepared_texts/token_counts, re-encode the truncated
text after decoding and use that verified token count instead of blindly
appending self.max_embedding_tokens. If the re-encoded length still exceeds the
limit, trim again or fail fast so the batching logic remains accurate.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/embedding_client.py`:
- Around line 290-309: The truncation path in the text-preparation logic assumes
the decoded text re-encodes to exactly max_embedding_tokens, which may not
always hold. In the embedding input loop in the method that appends to
prepared_texts/token_counts, re-encode the truncated text after decoding and use
that verified token count instead of blindly appending
self.max_embedding_tokens. If the re-encoded length still exceeds the limit,
trim again or fail fast so the batching logic remains accurate.

In `@tests/bench/bench_embedding_oversize_survival.py`:
- Around line 121-131: The broad except in the benchmark helper around
manager.save_representation is swallowing unrelated failures and hiding
diagnostics. Narrow the handling if possible, and in the existing exception path
for the survival calculation, log the caught exception with enough context
before returning 0.0 so oversize failures can be distinguished from fixture or
test bugs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f44d40ac-b230-4987-9a2a-2e75401c00a7

📥 Commits

Reviewing files that changed from the base of the PR and between eb386c3 and 2ee6513.

📒 Files selected for processing (6)

src/crud/representation.py
src/embedding_client.py
tests/bench/bench_embedding_oversize_survival.py
tests/conftest.py
tests/crud/test_representation_manager.py
tests/llm/test_embedding_client.py

…ches provider decode(ids[:n]) can re-encode to more than n tokens at BPE boundaries, so the truncate path now re-encodes and trims again until it fits, and reports the verified token count (which _create_batches budgets on) instead of assuming the cap. Adds a drift test that forces the re-trim loop, and narrows the oversize-survival benchmark's except to ValidationException so an unrelated harness bug escapes loudly instead of being scored as a failed batch.

leo.lc.chien added 6 commits June 30, 2026 19:10

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(deriver): truncate oversize observations so one can't drop the batch (#569)#864

fix(deriver): truncate oversize observations so one can't drop the batch (#569)#864
leo922oel wants to merge 7 commits into
plastic-labs:mainfrom
leo922oel:fix/embedding-truncate-oversize

leo922oel commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

leo922oel commented Jul 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Linked issues

Type of change

Test plan

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leo922oel commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading