fix(deriver): truncate oversize observations so one can't drop the batch (#569)#864
fix(deriver): truncate oversize observations so one can't drop the batch (#569)#864leo922oel wants to merge 7 commits into
Conversation
simple_batch_embed raised ValueError when any input exceeded the per-input token cap, which failed the entire deriver batch when a single observation was over-length — losing all the other observations in that pass. Add an on_oversize="truncate" option: oversize inputs are embedded from a token-capped prefix (with a warning), preserving one vector per input. The default stays "raise" so existing callers are unchanged. Refs plastic-labs#569
… drop the batch save_representation now calls simple_batch_embed with on_oversize="truncate", so a single over-length observation is embedded from a capped prefix instead of raising and failing the whole batch. Also forward on_oversize through the public EmbeddingClient wrapper: the prior cycle only updated the inner _EmbeddingClient, so the singleton the deriver actually uses would have raised TypeError at runtime (caught by basedpyright; the unit tests mock the singleton). Refs plastic-labs#569
save_representation passes on_oversize="truncate" to simple_batch_embed so one over-length observation is embedded from a capped prefix instead of raising and dropping the whole batch's observations. Adds a test for the opt-in and updates the existing embed-call assertions to the new contract. (The public embedding client wrapper that forwards the option landed in the previous commit.) Refs plastic-labs#569
The shared embeddings autouse mock had signature simple_batch_embed(texts); once save_representation began passing on_oversize="truncate", any test that drives the real save path through this mock would raise TypeError. Mirror the new interface (accept and ignore on_oversize) so the mock stays compatible. Refs plastic-labs#569
The autouse embeddings mock now absorbs the new on_oversize kwarg via **_kwargs (no unused-parameter warning), and the truncate test casts the Mock-recorded provider input to list[str] so basedpyright doesn't flag a partially-unknown argument. No behavior change; keeps the contributed test files warning-free.
A standalone, DB-free harness quantifying the fix: saves a batch where one observation is over-length and reports batch_survival_rate via the real save_representation path with a faked provider and faked DB write. Runs unmodified against any checkout (before/after via main vs this branch). Measured: batch-survival 0% -> 100%. Refs plastic-labs#569
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughAdds truncate handling to embedding batch generation, routes representation saving through that mode, and updates tests plus a benchmark to cover oversize observation inputs. ChangesEmbedding oversize truncation
Estimated code review effort: 3 (Moderate) | ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/bench/bench_embedding_oversize_survival.py (1)
121-131: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low valueBlind
except Exceptionswallows unrelated failures.Catching all exceptions to report
0.0survival conflates "batch fully failed due to oversize" with any other unrelated bug (e.g., a typo in the fake fixture) — both silently report the same score with no diagnostic trace.♻️ Suggested: log the exception before returning
except Exception: + logger.exception("Batch save failed during survival benchmark") return 0.0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/bench/bench_embedding_oversize_survival.py` around lines 121 - 131, The broad except in the benchmark helper around manager.save_representation is swallowing unrelated failures and hiding diagnostics. Narrow the handling if possible, and in the existing exception path for the survival calculation, log the caught exception with enough context before returning 0.0 so oversize failures can be distinguished from fixture or test bugs.Source: Linters/SAST tools
src/embedding_client.py (1)
290-309: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winTruncated token count isn't re-verified after decode.
After truncating
token_ids[:self.max_embedding_tokens]and decoding back to text, the code assumes the resulting text re-encodes to exactlyself.max_embedding_tokenstokens (token_counts.append(self.max_embedding_tokens)) without re-encoding to confirm. Ifencode(decode(token_ids[:n]))ever yields more tokens thann(edge case at BPE merge/regex-pretokenizer boundaries), the provider could still receive an over-cap input for the very fix this PR intends to guarantee — silently reintroducing the original bug in a rare case.Since batching arithmetic (
_create_batches) also relies on this assumed count for the per-request token budget, a mismatch could also skew batch sizing.♻️ Suggested defensive re-check
if on_oversize == "truncate": text = self.encoding.decode(token_ids[: self.max_embedding_tokens]) + # Re-encode to guard against any decode/encode boundary mismatch. + actual_tokens = len(self.encoding.encode(text)) logger.warning( "truncated oversize embedding input at idx %d: %d->%d tokens", idx, len(token_ids), self.max_embedding_tokens, ) prepared_texts.append(text) - token_counts.append(self.max_embedding_tokens) + token_counts.append(actual_tokens) continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/embedding_client.py` around lines 290 - 309, The truncation path in the text-preparation logic assumes the decoded text re-encodes to exactly max_embedding_tokens, which may not always hold. In the embedding input loop in the method that appends to prepared_texts/token_counts, re-encode the truncated text after decoding and use that verified token count instead of blindly appending self.max_embedding_tokens. If the re-encoded length still exceeds the limit, trim again or fail fast so the batching logic remains accurate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/embedding_client.py`:
- Around line 290-309: The truncation path in the text-preparation logic assumes
the decoded text re-encodes to exactly max_embedding_tokens, which may not
always hold. In the embedding input loop in the method that appends to
prepared_texts/token_counts, re-encode the truncated text after decoding and use
that verified token count instead of blindly appending
self.max_embedding_tokens. If the re-encoded length still exceeds the limit,
trim again or fail fast so the batching logic remains accurate.
In `@tests/bench/bench_embedding_oversize_survival.py`:
- Around line 121-131: The broad except in the benchmark helper around
manager.save_representation is swallowing unrelated failures and hiding
diagnostics. Narrow the handling if possible, and in the existing exception path
for the survival calculation, log the caught exception with enough context
before returning 0.0 so oversize failures can be distinguished from fixture or
test bugs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f44d40ac-b230-4987-9a2a-2e75401c00a7
📒 Files selected for processing (6)
src/crud/representation.pysrc/embedding_client.pytests/bench/bench_embedding_oversize_survival.pytests/conftest.pytests/crud/test_representation_manager.pytests/llm/test_embedding_client.py
…ches provider decode(ids[:n]) can re-encode to more than n tokens at BPE boundaries, so the truncate path now re-encodes and trims again until it fits, and reports the verified token count (which _create_batches budgets on) instead of assuming the cap. Adds a drift test that forces the re-trim loop, and narrows the oversize-survival benchmark's except to ValidationException so an unrelated harness bug escapes loudly instead of being scored as a failed batch.
Description
src/crud/representation.pycollects every observation from a deriver pass andembeds them in one call —
embedding_client.simple_batch_embed(observation_texts).If a single observation exceeded the provider's per-input token cap,
simple_batch_embedraisedValueError,save_representationre-raisedValidationException, and the deriver caught it — so none of that pass'sobservations were saved, not just the long one (#569). One long pasted artifact
(a PR review, verbose tool output, a stack trace) dropped the whole batch. Rare
under OpenAI's 8192 cap; routine under Gemini's smaller embedding cap.
simple_batch_embedhas no sub-chunking path (onlybatch_embed, used formessages, does), and truncation is the right call here: an observation is a short
derived fact, so a token-capped prefix still carries the meaningful content.
This adds an opt-in truncation mode and points the deriver at it:
simple_batch_embedgainson_oversize: "raise" | "truncate"(keyword-only,default
"raise")."raise"preserves the exact historical contract for everyexisting caller.
"truncate"embeds a token-capped prefix of the oversize inputand logs a warning; the one-input→one-vector mapping is preserved.
RepresentationManageropts intoon_oversize="truncate", so one oversizeobservation is truncated instead of failing the whole observer's batch.
Complements #860 (surface representation save failures instead of silent loss):
#569 keeps the common oversize case from ever failing; #860 makes a genuinely
total save failure loud instead of silent. Neither masks the other.
Quantified (
tests/bench/bench_embedding_oversize_survival.py— realsave_representationpath with a faked provider + faked DB write,batch_n=10,exactly one over-length observation):
main)batch_survival_rateLinked issues
Fixes #569
Type of change
fix— bug fixfeat— new featurerefactor— neither fixes a bug nor adds a featureperf— performance improvementdocs— documentation onlytest— adding or correcting testschore— build process or toolingTest plan
New / updated tests:
tests/llm/test_embedding_client.py—simple_batch_embedraises on oversizewhen
on_oversize="raise"(default) and truncates the oversize prefix (keeping1 vector per input) when
on_oversize="truncate". (18 passed)tests/crud/test_representation_manager.py::TestRepresentationManagerSave::test_save_representation_embeds_with_truncate_on_oversize— the manager passes
on_oversize="truncate"through. (7 passed; needs a localpgvector DB)
tests/conftest.py— the autouse embedding mock accepts the newon_oversizekwarg (via
**_kwargs) so existing tests keep passing.Quantitative before/after (standalone harness, not a CI test):
Checklist
tests/llm/test_embedding_client.py18/18;tests/crud/test_representation_manager.py7/7 against a local pgvector DB)ruffclean;basedpyright0 errors on every changed file (4 pre-existingconftest.pyheader warnings, present identically onmain)prior raise-on-oversize contract, so no caller changes)
Summary by CodeRabbit
New Features
Bug Fixes
Tests