fix(deriver): surface representation save failures instead of silent loss (#728)#860
fix(deriver): surface representation save failures instead of silent loss (#728)#860leo922oel wants to merge 5 commits into
Conversation
When every observer's save_representation failed (e.g. embedding retries exhausted under a sustained 429), the deriver logged the error and returned normally, so the queue marked the work unit processed with zero documents saved — silent memory loss. Accumulate per-observer save errors and, after telemetry is emitted, raise when no observer succeeded so the queue marks the unit errored and visible. Refs plastic-labs#728
…s are visible A partially-failed representation batch (some observers saved, one failed) stays processed — the saved observers' work must not be discarded by raising. Previously the failed observer was only logged. Add an additive failed_observer_count field to RepresentationCompletedEvent (schema unchanged, per the additive-field convention) and populate it from the deriver's per-observer save_errors, so partial (and total) save failures are discoverable and alertable in telemetry instead of silent. Refs plastic-labs#728 plastic-labs#839
A standalone, DB-free harness quantifying the fix: drives the real deriver representation path with a faked save under sustained failure and reports silent_loss_rate and failure_visibility_rate. Runs unmodified against any checkout (before/after via main vs this branch). Measured: silent-loss 100% -> 0%, visibility 0% -> 100%. Refs plastic-labs#728
The lambda emit-capture (`lambda e: emitted.append(e)`) left e/_sink untyped, which basedpyright reports as reportUnknownLambdaType — and the pre-commit / staticanalysis gate fails on warnings. Pass the bound `emitted.append` directly: no lambda, fully typed, and it also avoids the loop-variable closure concern.
WalkthroughAdds per-observer save-error tracking to Fail-loud observer save failure
ChangesFail-loud observer save failure
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/telemetry/test_representation_v2_fields.py (1)
56-56: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winCover the serialized form of this new field too.
This assertion proves the constructor default, but the existing
model_dumpcoverage in this file still will not catchfailed_observer_countdisappearing from emitted telemetry. Please add it to that field list as well.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/telemetry/test_representation_v2_fields.py` at line 56, The new failed_observer_count field is only covered by the constructor default assertion, so telemetry serialization could regress unnoticed. Update the existing model_dump coverage in the telemetry representation v2 fields test to include failed_observer_count in the asserted field list, using the same test case around the representation/event object so the serialized output is validated too.tests/deriver/test_deriver_processing.py (1)
74-121: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAssert the telemetry side of the total-failure contract here too.
The production path now emits
RepresentationCompletedEventbefore it raises. This test would still pass if someone accidentally movedemit(...)after the exception, which would reintroduce invisible total-save failures for telemetry consumers. Patchsrc.deriver.deriver.emithere and assertobserver_count == 0plusfailed_observer_count == 1.Example update
failing_save = AsyncMock(side_effect=RuntimeError("429 RESOURCE_EXHAUSTED")) + emitted: list[Any] = [] with ( patch( "src.deriver.deriver.honcho_llm_call", new_callable=AsyncMock, return_value=mock_response, ), patch.object(RepresentationManager, "save_representation", failing_save), + patch("src.deriver.deriver.emit", side_effect=emitted.append), pytest.raises(RuntimeError, match="save_representation failed"), ): await process_representation_tasks_batch( messages=[message], message_level_configuration=configuration, @@ observed="alice", queue_item_message_ids=[1], ) + + assert emitted, "expected telemetry before the raised failure" + assert emitted[-1].observer_count == 0 + assert emitted[-1].failed_observer_count == 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/deriver/test_deriver_processing.py` around lines 74 - 121, This test only verifies the exception path and should also assert telemetry for the total-failure case. Patch src.deriver.deriver.emit alongside honcho_llm_call and RepresentationManager.save_representation in test_all_observer_saves_failing_surfaces_failure, then assert the emitted RepresentationCompletedEvent reports observer_count as 0 and failed_observer_count as 1 before the RuntimeError is raised. Keep the existing failure setup, but add the emit assertion so the contract stays covered if emit is moved after the exception.src/deriver/deriver.py (1)
193-224: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winPreserve the real save exception on the fail-loud path.
save_errorsnow keeps only formatted strings, so the all-failed branch throws a freshRuntimeErrorwith no original type or traceback attached. That makes root-cause debugging and any exception-based handling much weaker than the underlyingsave_representation()failure. Keep the exception objects, then raise a custom exception fromsrc/exceptions.pywithfrom ...so the queue still sees the batch as failed without losing the cause.Proposed direction
- save_errors: list[str] = [] + save_errors: list[tuple[str, Exception]] = [] ... - except Exception as e: + except Exception as e: logger.error( "Failed to save representation for observer %s: %s", observer, e ) - save_errors.append(f"{observer}: {e.__class__.__name__}: {e}") + save_errors.append((observer, e)) ... if save_errors and successful_observer_count == 0: - raise RuntimeError( - f"save_representation failed for all {len(save_errors)} observer(s): " - + "; ".join(save_errors) - ) + details = "; ".join( + f"{observer}: {exc.__class__.__name__}: {exc}" + for observer, exc in save_errors + ) + raise exceptions.RepresentationSaveFailedError( + f"save_representation failed for all {len(save_errors)} observer(s): " + + details + ) from save_errors[0][1]As per coding guidelines, "Use explicit error handling with appropriate exception types from src/exceptions.py" and "Define custom exception types in src/exceptions.py and use them throughout the codebase."
Also applies to: 318-329
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/deriver/deriver.py` around lines 193 - 224, The fail-loud path in deriver.save_representation handling is losing the original save exception because save_errors stores only strings and the later RuntimeError has no cause. Update the save loop around RepresentationManager.save_representation to retain the actual exception objects (not just formatted text), then in the all-failed branch raise an appropriate custom exception from src/exceptions.py using exception chaining with from so the original traceback and type are preserved; also apply the same pattern to the related batch-failure branch mentioned in the comment.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/bench/bench_deriver_silent_loss.py`:
- Around line 87-96: The benchmark in process_representation_tasks_batch
handling is too broad because the try/except marks any Exception as a successful
visibility signal. Tighten the failure check around the
process_representation_tasks_batch call in the benchmark test so only the
expected fail-loud save-failure path sets raised=True, and let unexpected
exceptions propagate instead of being counted as success. Use the existing
raised flag logic in the bench_deriver_silent_loss benchmark to distinguish the
intended failure case from unrelated regressions.
---
Nitpick comments:
In `@src/deriver/deriver.py`:
- Around line 193-224: The fail-loud path in deriver.save_representation
handling is losing the original save exception because save_errors stores only
strings and the later RuntimeError has no cause. Update the save loop around
RepresentationManager.save_representation to retain the actual exception objects
(not just formatted text), then in the all-failed branch raise an appropriate
custom exception from src/exceptions.py using exception chaining with from so
the original traceback and type are preserved; also apply the same pattern to
the related batch-failure branch mentioned in the comment.
In `@tests/deriver/test_deriver_processing.py`:
- Around line 74-121: This test only verifies the exception path and should also
assert telemetry for the total-failure case. Patch src.deriver.deriver.emit
alongside honcho_llm_call and RepresentationManager.save_representation in
test_all_observer_saves_failing_surfaces_failure, then assert the emitted
RepresentationCompletedEvent reports observer_count as 0 and
failed_observer_count as 1 before the RuntimeError is raised. Keep the existing
failure setup, but add the emit assertion so the contract stays covered if emit
is moved after the exception.
In `@tests/telemetry/test_representation_v2_fields.py`:
- Line 56: The new failed_observer_count field is only covered by the
constructor default assertion, so telemetry serialization could regress
unnoticed. Update the existing model_dump coverage in the telemetry
representation v2 fields test to include failed_observer_count in the asserted
field list, using the same test case around the representation/event object so
the serialized output is validated too.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: af574ed4-fea4-4966-8c6f-f6b25158db80
📒 Files selected for processing (5)
src/deriver/deriver.pysrc/telemetry/events/representation.pytests/bench/bench_deriver_silent_loss.pytests/deriver/test_deriver_processing.pytests/telemetry/test_representation_v2_fields.py
Catching every exception counted any error as a visibility signal, so an unrelated regression could inflate failure_visibility_rate instead of failing the run. Catch only the expected observer save-failure and re-raise anything else.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/bench/bench_deriver_silent_loss.py (1)
95-100: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueSubstring-matching on the exception message is a bit brittle.
Re-raise gating depends on the literal strings
"observer"/"RESOURCE_EXHAUSTED"appearing instr(exc). If the message format inprocess_representation_tasks_batch'sRuntimeError(src/deriver/deriver.py) ever changes wording, this benchmark would silently start re-raising the expected failure (or, in the opposite direction, swallowing an unrelated regression that happens to mention those words). Since this is benchmark-only tooling, this is a low-priority nitpick rather than a blocker.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/bench/bench_deriver_silent_loss.py` around lines 95 - 100, Substring checks on RuntimeError text in the benchmark are brittle and should be replaced with a more stable signal. Update the exception gating in bench_deriver_silent_loss.py near the RuntimeError handler to key off a structured indicator from process_representation_tasks_batch in src/deriver/deriver.py, or otherwise narrow the match to a stable error attribute/type instead of relying on literal message fragments like observer and RESOURCE_EXHAUSTED. Ensure the benchmark still counts only the expected save failure while letting unrelated RuntimeError cases fail normally.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/bench/bench_deriver_silent_loss.py`:
- Around line 95-100: Substring checks on RuntimeError text in the benchmark are
brittle and should be replaced with a more stable signal. Update the exception
gating in bench_deriver_silent_loss.py near the RuntimeError handler to key off
a structured indicator from process_representation_tasks_batch in
src/deriver/deriver.py, or otherwise narrow the match to a stable error
attribute/type instead of relying on literal message fragments like observer and
RESOURCE_EXHAUSTED. Ensure the benchmark still counts only the expected save
failure while letting unrelated RuntimeError cases fail normally.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6da81a45-c08e-4e84-af1d-b512f838ae19
📒 Files selected for processing (1)
tests/bench/bench_deriver_silent_loss.py
Description
When every observer's representation save failed (e.g. embedding
429 RESOURCE_EXHAUSTEDpast the in-batch retries),process_representation_tasks_batch(
src/deriver/deriver.py) caught the exception, logged it, and returned normally —so the queue marked the work unit
processedwith zero documents saved:silent, permanent memory loss with no error state or alertable signal. (The errored
path
mark_queue_item_as_erroredexisted but was never reached.)This makes a save failure either loud or recorded — never a silent success:
succeeded, so the queue marks the unit
errored.(observer, observed)collections, so the saved ones are kept (re-running wouldduplicate; the queue has no requeue). The failure is recorded via a new additive
failed_observer_countonRepresentationCompletedEvent(schema unchanged).Not a retry fix: the embedding client already retries transient errors; the bug is
the swallow after they're exhausted.
Quantified (
tests/bench/bench_deriver_silent_loss.py— real deriver path witha faked failing save, trials=20):
main)silent_loss_ratefailure_visibility_rate#728 reported docs dropping ~300 → ~121 (~60%) in a 429 burst. Supersedes #660
(raised on any failure, no tests); coordinates with #839 (@akattelu).
Linked issues
Fixes #728
Type of change
fix— bug fixfeat— new featurerefactor— neither fixes a bug nor adds a featureperf— performance improvementdocs— documentation onlytest— adding or correcting testschore— build process or toolingTest plan
New / updated tests (RED→GREEN):
tests/deriver/test_deriver_processing.pytest_all_observer_saves_failing_surfaces_failure— every observer saveraises ⇒
process_representation_tasks_batchraises (queue marks errored)instead of returning silently.
test_partial_observer_failure_is_processed_and_surfaced— one observersucceeds, one fails ⇒ no raise (unit stays processed), and the emitted
RepresentationCompletedEventreportsobserver_count == 1andfailed_observer_count == 1.tests/telemetry/test_representation_v2_fields.py— asserts the newfailed_observer_countdefaults to 0 (additive; schema stays at v2).Quantitative before/after (a standalone harness, not a CI test):
Checklist
tests/deriver/+tests/telemetry/green; 4 unrelatedfailures in
test_auth_route_policy.py/test_document.pydedup require alocal embedding API key and fail identically on
main)ruffandbasedpyrightclean on every changed filefailed_observer_countis internal telemetry, additive, schema unchanged)
Summary by CodeRabbit
Bug Fixes
New Features
Tests