Skip to content

fix(deriver): surface representation save failures instead of silent loss (#728)#860

Open
leo922oel wants to merge 5 commits into
plastic-labs:mainfrom
leo922oel:fix/deriver-surface-save-failures
Open

fix(deriver): surface representation save failures instead of silent loss (#728)#860
leo922oel wants to merge 5 commits into
plastic-labs:mainfrom
leo922oel:fix/deriver-surface-save-failures

Conversation

@leo922oel

@leo922oel leo922oel commented Jun 30, 2026

Copy link
Copy Markdown

Description

When every observer's representation save failed (e.g. embedding 429 RESOURCE_EXHAUSTED past the in-batch retries), process_representation_tasks_batch
(src/deriver/deriver.py) caught the exception, logged it, and returned normally —
so the queue marked the work unit processed with zero documents saved:
silent, permanent memory loss with no error state or alertable signal. (The errored
path mark_queue_item_as_errored existed but was never reached.)

This makes a save failure either loud or recorded — never a silent success:

  • Total failure → errored. Collect per-observer errors; raise when no observer
    succeeded, so the queue marks the unit errored.
  • Partial failure → processed + recorded. Observers are independent
    (observer, observed) collections, so the saved ones are kept (re-running would
    duplicate; the queue has no requeue). The failure is recorded via a new additive
    failed_observer_count on RepresentationCompletedEvent (schema unchanged).

Not a retry fix: the embedding client already retries transient errors; the bug is
the swallow after they're exhausted.

Quantified (tests/bench/bench_deriver_silent_loss.py — real deriver path with
a faked failing save, trials=20):

Metric Before (main) After
silent_loss_rate 100% 0%
failure_visibility_rate 0% 100%

#728 reported docs dropping ~300 → ~121 (~60%) in a 429 burst. Supersedes #660
(raised on any failure, no tests); coordinates with #839 (@akattelu).

Linked issues

Fixes #728

Type of change

  • fix — bug fix
  • feat — new feature
  • refactor — neither fixes a bug nor adds a feature
  • perf — performance improvement
  • docs — documentation only
  • test — adding or correcting tests
  • chore — build process or tooling

Test plan

New / updated tests (RED→GREEN):

  • tests/deriver/test_deriver_processing.py
    • test_all_observer_saves_failing_surfaces_failure — every observer save
      raises ⇒ process_representation_tasks_batch raises (queue marks errored)
      instead of returning silently.
    • test_partial_observer_failure_is_processed_and_surfaced — one observer
      succeeds, one fails ⇒ no raise (unit stays processed), and the emitted
      RepresentationCompletedEvent reports observer_count == 1 and
      failed_observer_count == 1.
  • tests/telemetry/test_representation_v2_fields.py — asserts the new
    failed_observer_count defaults to 0 (additive; schema stays at v2).

Quantitative before/after (a standalone harness, not a CI test):

# AFTER (this branch)
PYTHONPATH=. uv run python tests/bench/bench_deriver_silent_loss.py
# BEFORE (main source)
git checkout main -- src/deriver/deriver.py src/telemetry/events/representation.py
PYTHONPATH=. uv run python tests/bench/bench_deriver_silent_loss.py
git checkout HEAD -- src/deriver/deriver.py src/telemetry/events/representation.py

Checklist

  • Tests added/updated for the change
  • Existing tests pass (tests/deriver/ + tests/telemetry/ green; 4 unrelated
    failures in test_auth_route_policy.py / test_document.py dedup require a
    local embedding API key and fail identically on main)
  • ruff and basedpyright clean on every changed file
  • Commit messages follow Conventional Commits
  • Documentation update — n/a (no user-facing/API change; failed_observer_count
    is internal telemetry, additive, schema unchanged)

Summary by CodeRabbit

  • Bug Fixes

    • Fixed a scenario where representation processing could complete successfully even when all observer saves failed.
    • Failure details are now properly surfaced, and partial observer failures are tracked instead of being hidden.
  • New Features

    • Added a new completion telemetry field for representation processing that reports how many observer save operations failed.
  • Tests

    • Added coverage for both all-failure and partial-failure cases, plus a benchmark to measure the “silent loss” behavior.

leo.lc.chien added 4 commits June 30, 2026 18:34
When every observer's save_representation failed (e.g. embedding retries
exhausted under a sustained 429), the deriver logged the error and returned
normally, so the queue marked the work unit processed with zero documents
saved — silent memory loss. Accumulate per-observer save errors and, after
telemetry is emitted, raise when no observer succeeded so the queue marks
the unit errored and visible.

Refs plastic-labs#728
…s are visible

A partially-failed representation batch (some observers saved, one failed)
stays processed — the saved observers' work must not be discarded by raising.
Previously the failed observer was only logged. Add an additive
failed_observer_count field to RepresentationCompletedEvent (schema unchanged,
per the additive-field convention) and populate it from the deriver's
per-observer save_errors, so partial (and total) save failures are
discoverable and alertable in telemetry instead of silent.

Refs plastic-labs#728 plastic-labs#839
A standalone, DB-free harness quantifying the fix: drives the real deriver
representation path with a faked save under sustained failure and reports
silent_loss_rate and failure_visibility_rate. Runs unmodified against any
checkout (before/after via main vs this branch).

Measured: silent-loss 100% -> 0%, visibility 0% -> 100%.

Refs plastic-labs#728
The lambda emit-capture (`lambda e: emitted.append(e)`) left e/_sink untyped,
which basedpyright reports as reportUnknownLambdaType — and the pre-commit /
staticanalysis gate fails on warnings. Pass the bound `emitted.append` directly:
no lambda, fully typed, and it also avoids the loop-variable closure concern.
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds per-observer save-error tracking to process_representation_tasks_batch, emits failed_observer_count in telemetry, raises on total observer save failure, and adds tests plus a benchmark for partial and total failure behavior.

Fail-loud observer save failure

Layer / File(s) Summary
Telemetry field: failed_observer_count
src/telemetry/events/representation.py, tests/telemetry/test_representation_v2_fields.py
Adds failed_observer_count: int (default 0) to RepresentationCompletedEvent and asserts the default in the existing optional-fields test.
Deriver save-error tracking and fail-loud logic
src/deriver/deriver.py
Introduces save_errors list, appends structured error entries on per-observer exceptions, passes failed_observer_count to telemetry, and raises RuntimeError after telemetry when successful_observer_count == 0.
Unit tests for total and partial observer failure
tests/deriver/test_deriver_processing.py
Adds two async tests: total save failure asserts RuntimeError propagation; partial failure asserts no raise and correct observer_count/failed_observer_count in emitted telemetry.
Benchmark: silent-loss and failure-visibility rates
tests/bench/bench_deriver_silent_loss.py
Measures silent_loss_rate and failure_visibility_rate across trials by forcing save_representation to raise and detecting exceptions or non-zero failed_observer_count in captured telemetry.

Changes

Fail-loud observer save failure

Layer / File(s) Summary
Telemetry field: failed_observer_count
src/telemetry/events/representation.py, tests/telemetry/test_representation_v2_fields.py
Adds failed_observer_count: int (default 0) to RepresentationCompletedEvent and asserts the default in the existing optional-fields test.
Deriver save-error tracking and fail-loud logic
src/deriver/deriver.py
Introduces save_errors list, appends structured error entries on per-observer exceptions, passes failed_observer_count to telemetry, and raises RuntimeError after telemetry when successful_observer_count == 0.
Unit tests for total and partial observer failure
tests/deriver/test_deriver_processing.py
Adds two async tests: total save failure asserts RuntimeError propagation; partial failure asserts no raise and correct observer_count/failed_observer_count in emitted telemetry.
Benchmark: silent-loss and failure-visibility rates
tests/bench/bench_deriver_silent_loss.py
Measures silent_loss_rate and failure_visibility_rate across trials by forcing save_representation to raise and detecting exceptions or non-zero failed_observer_count in captured telemetry.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • VVoruganti
  • akattelu

Poem

🐇 I hopped through the saves in the moonlit air,
One fail, two fails — but we now beware.
If all go wrong, the queue hears the sound,
No silent loss can sink underground.
Telemetry twinkles, the errors are seen,
And bunny memory stays nice and clean. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main fix: making representation save failures surface instead of being silently lost.
Linked Issues check ✅ Passed The code now raises on total observer-save failure and records partial failures in telemetry, matching #728's expected behavior.
Out of Scope Changes check ✅ Passed The benchmark and tests are directly tied to validating the silent-loss fix and do not appear unrelated to the PR scope.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/telemetry/test_representation_v2_fields.py (1)

56-56: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Cover the serialized form of this new field too.

This assertion proves the constructor default, but the existing model_dump coverage in this file still will not catch failed_observer_count disappearing from emitted telemetry. Please add it to that field list as well.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/telemetry/test_representation_v2_fields.py` at line 56, The new
failed_observer_count field is only covered by the constructor default
assertion, so telemetry serialization could regress unnoticed. Update the
existing model_dump coverage in the telemetry representation v2 fields test to
include failed_observer_count in the asserted field list, using the same test
case around the representation/event object so the serialized output is
validated too.
tests/deriver/test_deriver_processing.py (1)

74-121: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the telemetry side of the total-failure contract here too.

The production path now emits RepresentationCompletedEvent before it raises. This test would still pass if someone accidentally moved emit(...) after the exception, which would reintroduce invisible total-save failures for telemetry consumers. Patch src.deriver.deriver.emit here and assert observer_count == 0 plus failed_observer_count == 1.

Example update
         failing_save = AsyncMock(side_effect=RuntimeError("429 RESOURCE_EXHAUSTED"))
+        emitted: list[Any] = []
         with (
             patch(
                 "src.deriver.deriver.honcho_llm_call",
                 new_callable=AsyncMock,
                 return_value=mock_response,
             ),
             patch.object(RepresentationManager, "save_representation", failing_save),
+            patch("src.deriver.deriver.emit", side_effect=emitted.append),
             pytest.raises(RuntimeError, match="save_representation failed"),
         ):
             await process_representation_tasks_batch(
                 messages=[message],
                 message_level_configuration=configuration,
@@
                 observed="alice",
                 queue_item_message_ids=[1],
             )
+
+        assert emitted, "expected telemetry before the raised failure"
+        assert emitted[-1].observer_count == 0
+        assert emitted[-1].failed_observer_count == 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/deriver/test_deriver_processing.py` around lines 74 - 121, This test
only verifies the exception path and should also assert telemetry for the
total-failure case. Patch src.deriver.deriver.emit alongside honcho_llm_call and
RepresentationManager.save_representation in
test_all_observer_saves_failing_surfaces_failure, then assert the emitted
RepresentationCompletedEvent reports observer_count as 0 and
failed_observer_count as 1 before the RuntimeError is raised. Keep the existing
failure setup, but add the emit assertion so the contract stays covered if emit
is moved after the exception.
src/deriver/deriver.py (1)

193-224: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Preserve the real save exception on the fail-loud path.

save_errors now keeps only formatted strings, so the all-failed branch throws a fresh RuntimeError with no original type or traceback attached. That makes root-cause debugging and any exception-based handling much weaker than the underlying save_representation() failure. Keep the exception objects, then raise a custom exception from src/exceptions.py with from ... so the queue still sees the batch as failed without losing the cause.

Proposed direction
-    save_errors: list[str] = []
+    save_errors: list[tuple[str, Exception]] = []
...
-            except Exception as e:
+            except Exception as e:
                 logger.error(
                     "Failed to save representation for observer %s: %s", observer, e
                 )
-                save_errors.append(f"{observer}: {e.__class__.__name__}: {e}")
+                save_errors.append((observer, e))
...
     if save_errors and successful_observer_count == 0:
-        raise RuntimeError(
-            f"save_representation failed for all {len(save_errors)} observer(s): "
-            + "; ".join(save_errors)
-        )
+        details = "; ".join(
+            f"{observer}: {exc.__class__.__name__}: {exc}"
+            for observer, exc in save_errors
+        )
+        raise exceptions.RepresentationSaveFailedError(
+            f"save_representation failed for all {len(save_errors)} observer(s): "
+            + details
+        ) from save_errors[0][1]

As per coding guidelines, "Use explicit error handling with appropriate exception types from src/exceptions.py" and "Define custom exception types in src/exceptions.py and use them throughout the codebase."

Also applies to: 318-329

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/deriver/deriver.py` around lines 193 - 224, The fail-loud path in
deriver.save_representation handling is losing the original save exception
because save_errors stores only strings and the later RuntimeError has no cause.
Update the save loop around RepresentationManager.save_representation to retain
the actual exception objects (not just formatted text), then in the all-failed
branch raise an appropriate custom exception from src/exceptions.py using
exception chaining with from so the original traceback and type are preserved;
also apply the same pattern to the related batch-failure branch mentioned in the
comment.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/bench/bench_deriver_silent_loss.py`:
- Around line 87-96: The benchmark in process_representation_tasks_batch
handling is too broad because the try/except marks any Exception as a successful
visibility signal. Tighten the failure check around the
process_representation_tasks_batch call in the benchmark test so only the
expected fail-loud save-failure path sets raised=True, and let unexpected
exceptions propagate instead of being counted as success. Use the existing
raised flag logic in the bench_deriver_silent_loss benchmark to distinguish the
intended failure case from unrelated regressions.

---

Nitpick comments:
In `@src/deriver/deriver.py`:
- Around line 193-224: The fail-loud path in deriver.save_representation
handling is losing the original save exception because save_errors stores only
strings and the later RuntimeError has no cause. Update the save loop around
RepresentationManager.save_representation to retain the actual exception objects
(not just formatted text), then in the all-failed branch raise an appropriate
custom exception from src/exceptions.py using exception chaining with from so
the original traceback and type are preserved; also apply the same pattern to
the related batch-failure branch mentioned in the comment.

In `@tests/deriver/test_deriver_processing.py`:
- Around line 74-121: This test only verifies the exception path and should also
assert telemetry for the total-failure case. Patch src.deriver.deriver.emit
alongside honcho_llm_call and RepresentationManager.save_representation in
test_all_observer_saves_failing_surfaces_failure, then assert the emitted
RepresentationCompletedEvent reports observer_count as 0 and
failed_observer_count as 1 before the RuntimeError is raised. Keep the existing
failure setup, but add the emit assertion so the contract stays covered if emit
is moved after the exception.

In `@tests/telemetry/test_representation_v2_fields.py`:
- Line 56: The new failed_observer_count field is only covered by the
constructor default assertion, so telemetry serialization could regress
unnoticed. Update the existing model_dump coverage in the telemetry
representation v2 fields test to include failed_observer_count in the asserted
field list, using the same test case around the representation/event object so
the serialized output is validated too.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: af574ed4-fea4-4966-8c6f-f6b25158db80

📥 Commits

Reviewing files that changed from the base of the PR and between eb386c3 and 7ce4914.

📒 Files selected for processing (5)
  • src/deriver/deriver.py
  • src/telemetry/events/representation.py
  • tests/bench/bench_deriver_silent_loss.py
  • tests/deriver/test_deriver_processing.py
  • tests/telemetry/test_representation_v2_fields.py

Comment thread tests/bench/bench_deriver_silent_loss.py
Catching every exception counted any error as a visibility signal, so an
unrelated regression could inflate failure_visibility_rate instead of
failing the run. Catch only the expected observer save-failure and
re-raise anything else.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/bench/bench_deriver_silent_loss.py (1)

95-100: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Substring-matching on the exception message is a bit brittle.

Re-raise gating depends on the literal strings "observer"/"RESOURCE_EXHAUSTED" appearing in str(exc). If the message format in process_representation_tasks_batch's RuntimeError (src/deriver/deriver.py) ever changes wording, this benchmark would silently start re-raising the expected failure (or, in the opposite direction, swallowing an unrelated regression that happens to mention those words). Since this is benchmark-only tooling, this is a low-priority nitpick rather than a blocker.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/bench/bench_deriver_silent_loss.py` around lines 95 - 100, Substring
checks on RuntimeError text in the benchmark are brittle and should be replaced
with a more stable signal. Update the exception gating in
bench_deriver_silent_loss.py near the RuntimeError handler to key off a
structured indicator from process_representation_tasks_batch in
src/deriver/deriver.py, or otherwise narrow the match to a stable error
attribute/type instead of relying on literal message fragments like observer and
RESOURCE_EXHAUSTED. Ensure the benchmark still counts only the expected save
failure while letting unrelated RuntimeError cases fail normally.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/bench/bench_deriver_silent_loss.py`:
- Around line 95-100: Substring checks on RuntimeError text in the benchmark are
brittle and should be replaced with a more stable signal. Update the exception
gating in bench_deriver_silent_loss.py near the RuntimeError handler to key off
a structured indicator from process_representation_tasks_batch in
src/deriver/deriver.py, or otherwise narrow the match to a stable error
attribute/type instead of relying on literal message fragments like observer and
RESOURCE_EXHAUSTED. Ensure the benchmark still counts only the expected save
failure while letting unrelated RuntimeError cases fail normally.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6da81a45-c08e-4e84-af1d-b512f838ae19

📥 Commits

Reviewing files that changed from the base of the PR and between 7ce4914 and 1fb31ce.

📒 Files selected for processing (1)
  • tests/bench/bench_deriver_silent_loss.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant