fix(deriver): surface representation save failures instead of silent loss (#728) by leo922oel · Pull Request #860 · plastic-labs/honcho

leo922oel · 2026-06-30T11:03:45Z

Description

When every observer's representation save failed (e.g. embedding 429 RESOURCE_EXHAUSTED past the in-batch retries), process_representation_tasks_batch
(src/deriver/deriver.py) caught the exception, logged it, and returned normally —
so the queue marked the work unit processed with zero documents saved:
silent, permanent memory loss with no error state or alertable signal. (The errored
path mark_queue_item_as_errored existed but was never reached.)

This makes a save failure either loud or recorded — never a silent success:

Total failure → errored. Collect per-observer errors; raise when no observer
succeeded, so the queue marks the unit errored.
Partial failure → processed + recorded. Observers are independent
(observer, observed) collections, so the saved ones are kept (re-running would
duplicate; the queue has no requeue). The failure is recorded via a new additive
failed_observer_count on RepresentationCompletedEvent (schema unchanged).

Not a retry fix: the embedding client already retries transient errors; the bug is
the swallow after they're exhausted.

Quantified (tests/bench/bench_deriver_silent_loss.py — real deriver path with
a faked failing save, trials=20):

Metric	Before (`main`)	After
`silent_loss_rate`	100%	0%
`failure_visibility_rate`	0%	100%

#728 reported docs dropping ~300 → ~121 (~60%) in a 429 burst. Supersedes #660
(raised on any failure, no tests); coordinates with #839 (@akattelu).

Linked issues

Fixes #728

Type of change

fix — bug fix
feat — new feature
refactor — neither fixes a bug nor adds a feature
perf — performance improvement
docs — documentation only
test — adding or correcting tests
chore — build process or tooling

Test plan

New / updated tests (RED→GREEN):

tests/deriver/test_deriver_processing.py
- test_all_observer_saves_failing_surfaces_failure — every observer save
  raises ⇒ process_representation_tasks_batch raises (queue marks errored)
  instead of returning silently.
- test_partial_observer_failure_is_processed_and_surfaced — one observer
  succeeds, one fails ⇒ no raise (unit stays processed), and the emitted
  RepresentationCompletedEvent reports observer_count == 1 and
  failed_observer_count == 1.
tests/telemetry/test_representation_v2_fields.py — asserts the new
failed_observer_count defaults to 0 (additive; schema stays at v2).

Quantitative before/after (a standalone harness, not a CI test):

# AFTER (this branch)
PYTHONPATH=. uv run python tests/bench/bench_deriver_silent_loss.py
# BEFORE (main source)
git checkout main -- src/deriver/deriver.py src/telemetry/events/representation.py
PYTHONPATH=. uv run python tests/bench/bench_deriver_silent_loss.py
git checkout HEAD -- src/deriver/deriver.py src/telemetry/events/representation.py

Checklist

Tests added/updated for the change
Existing tests pass (tests/deriver/ + tests/telemetry/ green; 4 unrelated
failures in test_auth_route_policy.py / test_document.py dedup require a
local embedding API key and fail identically on main)
ruff and basedpyright clean on every changed file
Commit messages follow Conventional Commits
Documentation update — n/a (no user-facing/API change; failed_observer_count
is internal telemetry, additive, schema unchanged)

Summary by CodeRabbit

Bug Fixes
- Fixed a scenario where representation processing could complete successfully even when all observer saves failed.
- Failure details are now properly surfaced, and partial observer failures are tracked instead of being hidden.
New Features
- Added a new completion telemetry field for representation processing that reports how many observer save operations failed.
Tests
- Added coverage for both all-failure and partial-failure cases, plus a benchmark to measure the “silent loss” behavior.

When every observer's save_representation failed (e.g. embedding retries exhausted under a sustained 429), the deriver logged the error and returned normally, so the queue marked the work unit processed with zero documents saved — silent memory loss. Accumulate per-observer save errors and, after telemetry is emitted, raise when no observer succeeded so the queue marks the unit errored and visible. Refs plastic-labs#728

…s are visible A partially-failed representation batch (some observers saved, one failed) stays processed — the saved observers' work must not be discarded by raising. Previously the failed observer was only logged. Add an additive failed_observer_count field to RepresentationCompletedEvent (schema unchanged, per the additive-field convention) and populate it from the deriver's per-observer save_errors, so partial (and total) save failures are discoverable and alertable in telemetry instead of silent. Refs plastic-labs#728 plastic-labs#839

A standalone, DB-free harness quantifying the fix: drives the real deriver representation path with a faked save under sustained failure and reports silent_loss_rate and failure_visibility_rate. Runs unmodified against any checkout (before/after via main vs this branch). Measured: silent-loss 100% -> 0%, visibility 0% -> 100%. Refs plastic-labs#728

The lambda emit-capture (`lambda e: emitted.append(e)`) left e/_sink untyped, which basedpyright reports as reportUnknownLambdaType — and the pre-commit / staticanalysis gate fails on warnings. Pass the bound `emitted.append` directly: no lambda, fully typed, and it also avoids the loop-variable closure concern.

coderabbitai · 2026-06-30T11:04:04Z

Walkthrough

Adds per-observer save-error tracking to process_representation_tasks_batch, emits failed_observer_count in telemetry, raises on total observer save failure, and adds tests plus a benchmark for partial and total failure behavior.

Fail-loud observer save failure

Layer / File(s)	Summary
Telemetry field: `failed_observer_count` `src/telemetry/events/representation.py`, `tests/telemetry/test_representation_v2_fields.py`	Adds `failed_observer_count: int` (default `0`) to `RepresentationCompletedEvent` and asserts the default in the existing optional-fields test.
Deriver save-error tracking and fail-loud logic `src/deriver/deriver.py`	Introduces `save_errors` list, appends structured error entries on per-observer exceptions, passes `failed_observer_count` to telemetry, and raises `RuntimeError` after telemetry when `successful_observer_count == 0`.
Unit tests for total and partial observer failure `tests/deriver/test_deriver_processing.py`	Adds two async tests: total save failure asserts `RuntimeError` propagation; partial failure asserts no raise and correct `observer_count`/`failed_observer_count` in emitted telemetry.
Benchmark: silent-loss and failure-visibility rates `tests/bench/bench_deriver_silent_loss.py`	Measures `silent_loss_rate` and `failure_visibility_rate` across trials by forcing `save_representation` to raise and detecting exceptions or non-zero `failed_observer_count` in captured telemetry.

Changes

Fail-loud observer save failure

Layer / File(s)	Summary
Telemetry field: `failed_observer_count` `src/telemetry/events/representation.py`, `tests/telemetry/test_representation_v2_fields.py`	Adds `failed_observer_count: int` (default `0`) to `RepresentationCompletedEvent` and asserts the default in the existing optional-fields test.
Deriver save-error tracking and fail-loud logic `src/deriver/deriver.py`	Introduces `save_errors` list, appends structured error entries on per-observer exceptions, passes `failed_observer_count` to telemetry, and raises `RuntimeError` after telemetry when `successful_observer_count == 0`.
Unit tests for total and partial observer failure `tests/deriver/test_deriver_processing.py`	Adds two async tests: total save failure asserts `RuntimeError` propagation; partial failure asserts no raise and correct `observer_count`/`failed_observer_count` in emitted telemetry.
Benchmark: silent-loss and failure-visibility rates `tests/bench/bench_deriver_silent_loss.py`	Measures `silent_loss_rate` and `failure_visibility_rate` across trials by forcing `save_representation` to raise and detecting exceptions or non-zero `failed_observer_count` in captured telemetry.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

VVoruganti
akattelu

Poem

🐇 I hopped through the saves in the moonlit air,
One fail, two fails — but we now beware.
If all go wrong, the queue hears the sound,
No silent loss can sink underground.
Telemetry twinkles, the errors are seen,
And bunny memory stays nice and clean. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main fix: making representation save failures surface instead of being silently lost.
Linked Issues check	✅ Passed	The code now raises on total observer-save failure and records partial failures in telemetry, matching `#728`'s expected behavior.
Out of Scope Changes check	✅ Passed	The benchmark and tests are directly tied to validating the silent-loss fix and do not appear unrelated to the PR scope.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tests/telemetry/test_representation_v2_fields.py (1)

56-56: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Cover the serialized form of this new field too.

This assertion proves the constructor default, but the existing model_dump coverage in this file still will not catch failed_observer_count disappearing from emitted telemetry. Please add it to that field list as well.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/telemetry/test_representation_v2_fields.py` at line 56, The new
failed_observer_count field is only covered by the constructor default
assertion, so telemetry serialization could regress unnoticed. Update the
existing model_dump coverage in the telemetry representation v2 fields test to
include failed_observer_count in the asserted field list, using the same test
case around the representation/event object so the serialized output is
validated too.

tests/deriver/test_deriver_processing.py (1)

74-121: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the telemetry side of the total-failure contract here too.

The production path now emits RepresentationCompletedEvent before it raises. This test would still pass if someone accidentally moved emit(...) after the exception, which would reintroduce invisible total-save failures for telemetry consumers. Patch src.deriver.deriver.emit here and assert observer_count == 0 plus failed_observer_count == 1.

Example update

         failing_save = AsyncMock(side_effect=RuntimeError("429 RESOURCE_EXHAUSTED"))
+        emitted: list[Any] = []
         with (
             patch(
                 "src.deriver.deriver.honcho_llm_call",
                 new_callable=AsyncMock,
                 return_value=mock_response,
             ),
             patch.object(RepresentationManager, "save_representation", failing_save),
+            patch("src.deriver.deriver.emit", side_effect=emitted.append),
             pytest.raises(RuntimeError, match="save_representation failed"),
         ):
             await process_representation_tasks_batch(
                 messages=[message],
                 message_level_configuration=configuration,
@@
                 observed="alice",
                 queue_item_message_ids=[1],
             )
+
+        assert emitted, "expected telemetry before the raised failure"
+        assert emitted[-1].observer_count == 0
+        assert emitted[-1].failed_observer_count == 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/deriver/test_deriver_processing.py` around lines 74 - 121, This test
only verifies the exception path and should also assert telemetry for the
total-failure case. Patch src.deriver.deriver.emit alongside honcho_llm_call and
RepresentationManager.save_representation in
test_all_observer_saves_failing_surfaces_failure, then assert the emitted
RepresentationCompletedEvent reports observer_count as 0 and
failed_observer_count as 1 before the RuntimeError is raised. Keep the existing
failure setup, but add the emit assertion so the contract stays covered if emit
is moved after the exception.

src/deriver/deriver.py (1)

193-224: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Preserve the real save exception on the fail-loud path.

save_errors now keeps only formatted strings, so the all-failed branch throws a fresh RuntimeError with no original type or traceback attached. That makes root-cause debugging and any exception-based handling much weaker than the underlying save_representation() failure. Keep the exception objects, then raise a custom exception from src/exceptions.py with from ... so the queue still sees the batch as failed without losing the cause.

Proposed direction

-    save_errors: list[str] = []
+    save_errors: list[tuple[str, Exception]] = []
...
-            except Exception as e:
+            except Exception as e:
                 logger.error(
                     "Failed to save representation for observer %s: %s", observer, e
                 )
-                save_errors.append(f"{observer}: {e.__class__.__name__}: {e}")
+                save_errors.append((observer, e))
...
     if save_errors and successful_observer_count == 0:
-        raise RuntimeError(
-            f"save_representation failed for all {len(save_errors)} observer(s): "
-            + "; ".join(save_errors)
-        )
+        details = "; ".join(
+            f"{observer}: {exc.__class__.__name__}: {exc}"
+            for observer, exc in save_errors
+        )
+        raise exceptions.RepresentationSaveFailedError(
+            f"save_representation failed for all {len(save_errors)} observer(s): "
+            + details
+        ) from save_errors[0][1]

As per coding guidelines, "Use explicit error handling with appropriate exception types from src/exceptions.py" and "Define custom exception types in src/exceptions.py and use them throughout the codebase."

Also applies to: 318-329

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/deriver/deriver.py` around lines 193 - 224, The fail-loud path in
deriver.save_representation handling is losing the original save exception
because save_errors stores only strings and the later RuntimeError has no cause.
Update the save loop around RepresentationManager.save_representation to retain
the actual exception objects (not just formatted text), then in the all-failed
branch raise an appropriate custom exception from src/exceptions.py using
exception chaining with from so the original traceback and type are preserved;
also apply the same pattern to the related batch-failure branch mentioned in the
comment.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/bench/bench_deriver_silent_loss.py`:
- Around line 87-96: The benchmark in process_representation_tasks_batch
handling is too broad because the try/except marks any Exception as a successful
visibility signal. Tighten the failure check around the
process_representation_tasks_batch call in the benchmark test so only the
expected fail-loud save-failure path sets raised=True, and let unexpected
exceptions propagate instead of being counted as success. Use the existing
raised flag logic in the bench_deriver_silent_loss benchmark to distinguish the
intended failure case from unrelated regressions.

---

Nitpick comments:
In `@src/deriver/deriver.py`:
- Around line 193-224: The fail-loud path in deriver.save_representation
handling is losing the original save exception because save_errors stores only
strings and the later RuntimeError has no cause. Update the save loop around
RepresentationManager.save_representation to retain the actual exception objects
(not just formatted text), then in the all-failed branch raise an appropriate
custom exception from src/exceptions.py using exception chaining with from so
the original traceback and type are preserved; also apply the same pattern to
the related batch-failure branch mentioned in the comment.

In `@tests/deriver/test_deriver_processing.py`:
- Around line 74-121: This test only verifies the exception path and should also
assert telemetry for the total-failure case. Patch src.deriver.deriver.emit
alongside honcho_llm_call and RepresentationManager.save_representation in
test_all_observer_saves_failing_surfaces_failure, then assert the emitted
RepresentationCompletedEvent reports observer_count as 0 and
failed_observer_count as 1 before the RuntimeError is raised. Keep the existing
failure setup, but add the emit assertion so the contract stays covered if emit
is moved after the exception.

In `@tests/telemetry/test_representation_v2_fields.py`:
- Line 56: The new failed_observer_count field is only covered by the
constructor default assertion, so telemetry serialization could regress
unnoticed. Update the existing model_dump coverage in the telemetry
representation v2 fields test to include failed_observer_count in the asserted
field list, using the same test case around the representation/event object so
the serialized output is validated too.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: af574ed4-fea4-4966-8c6f-f6b25158db80

📥 Commits

Reviewing files that changed from the base of the PR and between eb386c3 and 7ce4914.

📒 Files selected for processing (5)

src/deriver/deriver.py
src/telemetry/events/representation.py
tests/bench/bench_deriver_silent_loss.py
tests/deriver/test_deriver_processing.py
tests/telemetry/test_representation_v2_fields.py

Catching every exception counted any error as a visibility signal, so an unrelated regression could inflate failure_visibility_rate instead of failing the run. Catch only the expected observer save-failure and re-raise anything else.

coderabbitai

🧹 Nitpick comments (1)

tests/bench/bench_deriver_silent_loss.py (1)
95-100: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Substring-matching on the exception message is a bit brittle.

Re-raise gating depends on the literal strings "observer"/"RESOURCE_EXHAUSTED" appearing in str(exc). If the message format in process_representation_tasks_batch's RuntimeError (src/deriver/deriver.py) ever changes wording, this benchmark would silently start re-raising the expected failure (or, in the opposite direction, swallowing an unrelated regression that happens to mention those words). Since this is benchmark-only tooling, this is a low-priority nitpick rather than a blocker.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/bench/bench_deriver_silent_loss.py` around lines 95 - 100, Substring
checks on RuntimeError text in the benchmark are brittle and should be replaced
with a more stable signal. Update the exception gating in
bench_deriver_silent_loss.py near the RuntimeError handler to key off a
structured indicator from process_representation_tasks_batch in
src/deriver/deriver.py, or otherwise narrow the match to a stable error
attribute/type instead of relying on literal message fragments like observer and
RESOURCE_EXHAUSTED. Ensure the benchmark still counts only the expected save
failure while letting unrelated RuntimeError cases fail normally.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/bench/bench_deriver_silent_loss.py`:
- Around line 95-100: Substring checks on RuntimeError text in the benchmark are
brittle and should be replaced with a more stable signal. Update the exception
gating in bench_deriver_silent_loss.py near the RuntimeError handler to key off
a structured indicator from process_representation_tasks_batch in
src/deriver/deriver.py, or otherwise narrow the match to a stable error
attribute/type instead of relying on literal message fragments like observer and
RESOURCE_EXHAUSTED. Ensure the benchmark still counts only the expected save
failure while letting unrelated RuntimeError cases fail normally.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6da81a45-c08e-4e84-af1d-b512f838ae19

📥 Commits

Reviewing files that changed from the base of the PR and between 7ce4914 and 1fb31ce.

📒 Files selected for processing (1)

tests/bench/bench_deriver_silent_loss.py

leo.lc.chien added 4 commits June 30, 2026 18:34

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread tests/bench/bench_deriver_silent_loss.py

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

leo922oel mentioned this pull request Jul 1, 2026

fix(deriver): truncate oversize observations so one can't drop the batch (#569) #864

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(deriver): surface representation save failures instead of silent loss (#728)#860

fix(deriver): surface representation save failures instead of silent loss (#728)#860
leo922oel wants to merge 5 commits into
plastic-labs:mainfrom
leo922oel:fix/deriver-surface-save-failures

leo922oel commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

leo922oel commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Linked issues

Type of change

Test plan

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leo922oel commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading