Skip to content

fix(deriver): truncate oversize observations so one can't drop the batch (#569)#864

Open
leo922oel wants to merge 7 commits into
plastic-labs:mainfrom
leo922oel:fix/embedding-truncate-oversize
Open

fix(deriver): truncate oversize observations so one can't drop the batch (#569)#864
leo922oel wants to merge 7 commits into
plastic-labs:mainfrom
leo922oel:fix/embedding-truncate-oversize

Conversation

@leo922oel

@leo922oel leo922oel commented Jul 1, 2026

Copy link
Copy Markdown

Description

src/crud/representation.py collects every observation from a deriver pass and
embeds them in one call — embedding_client.simple_batch_embed(observation_texts).
If a single observation exceeded the provider's per-input token cap,
simple_batch_embed raised ValueError, save_representation re-raised
ValidationException, and the deriver caught it — so none of that pass's
observations were saved, not just the long one (#569). One long pasted artifact
(a PR review, verbose tool output, a stack trace) dropped the whole batch. Rare
under OpenAI's 8192 cap; routine under Gemini's smaller embedding cap.

simple_batch_embed has no sub-chunking path (only batch_embed, used for
messages, does), and truncation is the right call here: an observation is a short
derived fact, so a token-capped prefix still carries the meaningful content.

This adds an opt-in truncation mode and points the deriver at it:

  • simple_batch_embed gains on_oversize: "raise" | "truncate" (keyword-only,
    default "raise"). "raise" preserves the exact historical contract for every
    existing caller. "truncate" embeds a token-capped prefix of the oversize input
    and logs a warning; the one-input→one-vector mapping is preserved.
  • RepresentationManager opts into on_oversize="truncate", so one oversize
    observation is truncated instead of failing the whole observer's batch.

Complements #860 (surface representation save failures instead of silent loss):
#569 keeps the common oversize case from ever failing; #860 makes a genuinely
total save failure loud instead of silent. Neither masks the other.

Quantified (tests/bench/bench_embedding_oversize_survival.py — real
save_representation path with a faked provider + faked DB write, batch_n=10,
exactly one over-length observation):

Metric Before (main) After
batch_survival_rate 0% 100%

Linked issues

Fixes #569

Type of change

  • fix — bug fix
  • feat — new feature
  • refactor — neither fixes a bug nor adds a feature
  • perf — performance improvement
  • docs — documentation only
  • test — adding or correcting tests
  • chore — build process or tooling

Test plan

New / updated tests:

  • tests/llm/test_embedding_client.pysimple_batch_embed raises on oversize
    when on_oversize="raise" (default) and truncates the oversize prefix (keeping
    1 vector per input) when on_oversize="truncate". (18 passed)
  • tests/crud/test_representation_manager.py::TestRepresentationManagerSave::test_save_representation_embeds_with_truncate_on_oversize
    — the manager passes on_oversize="truncate" through. (7 passed; needs a local
    pgvector DB)
  • tests/conftest.py — the autouse embedding mock accepts the new on_oversize
    kwarg (via **_kwargs) so existing tests keep passing.

Quantitative before/after (standalone harness, not a CI test):

# AFTER (this branch)
PYTHONPATH=. uv run python tests/bench/bench_embedding_oversize_survival.py
# BEFORE (main source)
git checkout main -- src/embedding_client.py src/crud/representation.py
PYTHONPATH=. uv run python tests/bench/bench_embedding_oversize_survival.py
git checkout HEAD -- src/embedding_client.py src/crud/representation.py

Checklist

  • Tests added/updated for the change
  • Existing tests pass (tests/llm/test_embedding_client.py 18/18;
    tests/crud/test_representation_manager.py 7/7 against a local pgvector DB)
  • ruff clean; basedpyright 0 errors on every changed file (4 pre-existing
    conftest.py header warnings, present identically on main)
  • Commit messages follow Conventional Commits
  • Documentation update — n/a (internal behavior; new param defaults to the
    prior raise-on-oversize contract, so no caller changes)

Summary by CodeRabbit

  • New Features

    • Added support for handling oversized observation text during embedding using automatic truncation.
  • Bug Fixes

    • Prevented embedding failures when inputs exceed the token limit by truncating oversized texts.
    • Preserved the mapping between inputs and returned embeddings to keep output counts consistent.
  • Tests

    • Added benchmarks and expanded coverage for oversize “survival” behavior and truncation edge cases.

leo.lc.chien added 6 commits June 30, 2026 19:10
simple_batch_embed raised ValueError when any input exceeded the per-input
token cap, which failed the entire deriver batch when a single observation
was over-length — losing all the other observations in that pass. Add an
on_oversize="truncate" option: oversize inputs are embedded from a
token-capped prefix (with a warning), preserving one vector per input. The
default stays "raise" so existing callers are unchanged.

Refs plastic-labs#569
… drop the batch

save_representation now calls simple_batch_embed with on_oversize="truncate",
so a single over-length observation is embedded from a capped prefix instead
of raising and failing the whole batch. Also forward on_oversize through the
public EmbeddingClient wrapper: the prior cycle only updated the inner
_EmbeddingClient, so the singleton the deriver actually uses would have raised
TypeError at runtime (caught by basedpyright; the unit tests mock the singleton).

Refs plastic-labs#569
save_representation passes on_oversize="truncate" to simple_batch_embed so one
over-length observation is embedded from a capped prefix instead of raising and
dropping the whole batch's observations. Adds a test for the opt-in and updates
the existing embed-call assertions to the new contract. (The public embedding
client wrapper that forwards the option landed in the previous commit.)

Refs plastic-labs#569
The shared embeddings autouse mock had signature simple_batch_embed(texts);
once save_representation began passing on_oversize="truncate", any test that
drives the real save path through this mock would raise TypeError. Mirror the
new interface (accept and ignore on_oversize) so the mock stays compatible.

Refs plastic-labs#569
The autouse embeddings mock now absorbs the new on_oversize kwarg via **_kwargs
(no unused-parameter warning), and the truncate test casts the Mock-recorded
provider input to list[str] so basedpyright doesn't flag a partially-unknown
argument. No behavior change; keeps the contributed test files warning-free.
A standalone, DB-free harness quantifying the fix: saves a batch where one
observation is over-length and reports batch_survival_rate via the real
save_representation path with a faked provider and faked DB write. Runs
unmodified against any checkout (before/after via main vs this branch).

Measured: batch-survival 0% -> 100%.

Refs plastic-labs#569
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e1046be-2c51-4878-9374-e30b767c352b

📥 Commits

Reviewing files that changed from the base of the PR and between 2ee6513 and cd02039.

📒 Files selected for processing (3)
  • src/embedding_client.py
  • tests/bench/bench_embedding_oversize_survival.py
  • tests/llm/test_embedding_client.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/embedding_client.py
  • tests/bench/bench_embedding_oversize_survival.py

Walkthrough

Adds truncate handling to embedding batch generation, routes representation saving through that mode, and updates tests plus a benchmark to cover oversize observation inputs.

Changes

Embedding oversize truncation

Layer / File(s) Summary
Embedding client truncate path
src/embedding_client.py
_EmbeddingClient.simple_batch_embed now accepts `on_oversize="raise"
Representation save path
src/crud/representation.py, tests/crud/test_representation_manager.py
save_representation now passes on_oversize="truncate" for observation embeddings, and the manager tests assert that kwarg in the affected calls.
Embedding client oversize tests
tests/llm/test_embedding_client.py
New tests cover truncate mode, re-encoding drift during truncation, and the default oversize error path.
Benchmark and shared mock updates
tests/bench/bench_embedding_oversize_survival.py, tests/conftest.py
Adds a benchmark for oversize batch survival and widens the shared embedding mock to accept extra kwargs.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Possibly related issues

Poem

A bunny saw a text too wide,
and trimmed the tail with careful pride.
One batch stayed whole, no drop, no fuss,
just carrots saved for all of us. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main fix: truncating oversize observations to prevent batch failure.
Linked Issues check ✅ Passed The PR implements truncate-on-oversize for storage embedding and updates representation.py so one long observation no longer aborts the batch.
Out of Scope Changes check ✅ Passed The added API, representation change, tests, and benchmark all relate directly to the oversize-batch fix.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/bench/bench_embedding_oversize_survival.py (1)

121-131: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low value

Blind except Exception swallows unrelated failures.

Catching all exceptions to report 0.0 survival conflates "batch fully failed due to oversize" with any other unrelated bug (e.g., a typo in the fake fixture) — both silently report the same score with no diagnostic trace.

♻️ Suggested: log the exception before returning
             except Exception:
+                logger.exception("Batch save failed during survival benchmark")
                 return 0.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/bench/bench_embedding_oversize_survival.py` around lines 121 - 131, The
broad except in the benchmark helper around manager.save_representation is
swallowing unrelated failures and hiding diagnostics. Narrow the handling if
possible, and in the existing exception path for the survival calculation, log
the caught exception with enough context before returning 0.0 so oversize
failures can be distinguished from fixture or test bugs.

Source: Linters/SAST tools

src/embedding_client.py (1)

290-309: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Truncated token count isn't re-verified after decode.

After truncating token_ids[:self.max_embedding_tokens] and decoding back to text, the code assumes the resulting text re-encodes to exactly self.max_embedding_tokens tokens (token_counts.append(self.max_embedding_tokens)) without re-encoding to confirm. If encode(decode(token_ids[:n])) ever yields more tokens than n (edge case at BPE merge/regex-pretokenizer boundaries), the provider could still receive an over-cap input for the very fix this PR intends to guarantee — silently reintroducing the original bug in a rare case.

Since batching arithmetic (_create_batches) also relies on this assumed count for the per-request token budget, a mismatch could also skew batch sizing.

♻️ Suggested defensive re-check
                 if on_oversize == "truncate":
                     text = self.encoding.decode(token_ids[: self.max_embedding_tokens])
+                    # Re-encode to guard against any decode/encode boundary mismatch.
+                    actual_tokens = len(self.encoding.encode(text))
                     logger.warning(
                         "truncated oversize embedding input at idx %d: %d->%d tokens",
                         idx,
                         len(token_ids),
                         self.max_embedding_tokens,
                     )
                     prepared_texts.append(text)
-                    token_counts.append(self.max_embedding_tokens)
+                    token_counts.append(actual_tokens)
                     continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/embedding_client.py` around lines 290 - 309, The truncation path in the
text-preparation logic assumes the decoded text re-encodes to exactly
max_embedding_tokens, which may not always hold. In the embedding input loop in
the method that appends to prepared_texts/token_counts, re-encode the truncated
text after decoding and use that verified token count instead of blindly
appending self.max_embedding_tokens. If the re-encoded length still exceeds the
limit, trim again or fail fast so the batching logic remains accurate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/embedding_client.py`:
- Around line 290-309: The truncation path in the text-preparation logic assumes
the decoded text re-encodes to exactly max_embedding_tokens, which may not
always hold. In the embedding input loop in the method that appends to
prepared_texts/token_counts, re-encode the truncated text after decoding and use
that verified token count instead of blindly appending
self.max_embedding_tokens. If the re-encoded length still exceeds the limit,
trim again or fail fast so the batching logic remains accurate.

In `@tests/bench/bench_embedding_oversize_survival.py`:
- Around line 121-131: The broad except in the benchmark helper around
manager.save_representation is swallowing unrelated failures and hiding
diagnostics. Narrow the handling if possible, and in the existing exception path
for the survival calculation, log the caught exception with enough context
before returning 0.0 so oversize failures can be distinguished from fixture or
test bugs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f44d40ac-b230-4987-9a2a-2e75401c00a7

📥 Commits

Reviewing files that changed from the base of the PR and between eb386c3 and 2ee6513.

📒 Files selected for processing (6)
  • src/crud/representation.py
  • src/embedding_client.py
  • tests/bench/bench_embedding_oversize_survival.py
  • tests/conftest.py
  • tests/crud/test_representation_manager.py
  • tests/llm/test_embedding_client.py

…ches provider

decode(ids[:n]) can re-encode to more than n tokens at BPE boundaries,
so the truncate path now re-encodes and trims again until it fits, and
reports the verified token count (which _create_batches budgets on)
instead of assuming the cap. Adds a drift test that forces the re-trim
loop, and narrows the oversize-survival benchmark's except to
ValidationException so an unrelated harness bug escapes loudly instead
of being scored as a failed batch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deriver loses an entire batch of observations when one exceeds the embedding token limit

1 participant