Skip to content

fix(tuning): harden extraction runs and prompts#131

Open
yilu331 wants to merge 1 commit into
mainfrom
feature/swe-bench-prompt-tuning-extractor
Open

fix(tuning): harden extraction runs and prompts#131
yilu331 wants to merge 1 commit into
mainfrom
feature/swe-bench-prompt-tuning-extractor

Conversation

@yilu331

@yilu331 yilu331 commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • harden prompt-tuning extraction runs by preserving caller request IDs through generation and storage
  • add hard LiteLLM timeout handling so timed-out extractor calls cannot continue writing late results
  • add extractor and consolidator prompt revisions for SWE-bench failures found during the tuning loop
  • add tests for timeout behavior, request-id persistence, prompt version mapping, and consolidation output shape

Verification

  • uv run ruff check on staged Python files
  • uv run pyright on staged Python files
  • uv run pytest --no-cov -q tests/server/llm/test_litellm_client_unit.py tests/server/services/extraction/test_resumable_agent.py tests/server/services/storage/sqlite_storage/test_agent_run_storage.py tests/server/services/playbook/test_consolidation_output.py tests/server/services/prompt/test_prompt_manager.py tests/server/services/test_prompt_model_mapping.py tests/eval/extraction/test_extraction_eval.py

Notes

This replaces the closed, unmerged PR #130 with a fresh PR containing the latest branch head ee57851.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added optional request ID tracking for idempotent operations and local tracing correlation
    • Enhanced hard timeout protection layer for LLM calls
    • Auto-repair for playbook consolidation outputs with missing identifiers
  • Bug Fixes

    • Late outputs are now properly discarded after extraction timeouts
    • Improved status validation to prevent stale agent run updates
  • Documentation

    • Updated playbook extraction and consolidation prompts with refined policy-mining logic

Honor caller request IDs, isolate timed-out extractor runs, enforce hard LiteLLM timeouts, and add prompt updates for extractor/consolidator failures found during the SWE-bench tuning loop.
@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR hardens extraction reliability and request lifecycle safety by introducing caller-supplied request IDs for idempotency, hard wall-clock timeout enforcement on LLM calls with executor-based cancellation, conditional storage updates to prevent late outputs from overwriting timeout states, and updated prompts for playbook extraction and consolidation with refined decision semantics.

Changes

Timeout Safety and Request Lifecycle

Layer / File(s) Summary
Request ID propagation for idempotency
reflexio/models/api_schema/domain/entities.py, reflexio/server/services/generation_service.py
API schema adds optional request_id field for caller-supplied correlation; generation service uses provided ID or falls back to UUID generation.
Storage layer conditional status updates
reflexio/server/services/storage/storage_base/_agent_run.py, reflexio/server/services/storage/sqlite_storage/_agent_run.py, tests/server/services/storage/sqlite_storage/test_agent_run_storage.py
Storage interfaces add expected_statuses parameter to update_agent_run_status for conditional updates and new fail_running_agent_runs_for_request method for bulk timeout-failure marking by request/extractor/user.
Hard wall-clock timeout enforcement
reflexio/server/llm/litellm_client.py, tests/server/llm/test_litellm_client_unit.py
LiteLLM client implements executor-based hard-timeout wrapper around litellm.completion with grace period; cancels futures and discards late completions; includes wall-clock timeout test and mock adjustment for embedding local routing.
Extractor timeout failure marking
reflexio/server/services/base_generation_service.py
On extractor timeout, generation service marks active agent runs for the request as failed via storage update; defensive error handling for storage failures.
Late output discard after timeout
reflexio/server/services/extraction/resumable_agent.py, tests/server/services/extraction/test_resumable_agent.py
ResumableExtractionAgent gates successful completion on run status being in active states; finish_extraction outputs received after timeout-induced status transition are discarded and marked late_output_discarded; both success and failure paths use status guards.
Consolidation output self-repair
reflexio/server/services/playbook/playbook_consolidator.py, tests/server/services/playbook/test_consolidation_output.py
PlaybookConsolidationOutput schema adds pre-validation repair that auto-injects new_id="NEW-0" when single unify decision omits it; validates against spurious inference in multi-decision payloads.
Playbook extraction and consolidation prompts
reflexio/server/prompt/prompt_bank/playbook_consolidation/*, reflexio/server/prompt/prompt_bank/playbook_extraction_context/*, tests/server/services/prompt/test_prompt_manager.py, tests/server/services/test_prompt_model_mapping.py
Deactivates v2.3.0 consolidation and v4.2.0 extraction prompts; introduces v2.3.1 consolidation with decision-id contract, unify re-synthesis rules, and self-contradiction guard; introduces v4.2.1 and v4.2.2 extraction prompts with resumable-mode tool discipline, Correction SOP/Success Path Recipe categories, grounding rules, and worked examples; updates version mappings.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • ReflexioAI/reflexio#107: Both PRs modify reflexio/server/services/extraction/resumable_agent.py—the main PR changes completion/late-output and status-expectation handling, while PR #107 updates tool-choice and async-info tool behavior in the same function.
  • ReflexioAI/reflexio#101: Main PR's extractor execution and timeout handling in BaseGenerationService overlap with PR #101's refactor to a single configured extractor flow in the same service.

Poem

🐰 A rabbit's ballad for timeouts true:
Request IDs trace the path askew,
Hard timeouts fire with executor's might,
Late outputs falter—discarded right.
Storage whispers: "late work won't do,"
Safety renewed, one checkpoint through! 🏁

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 39.39% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'harden extraction runs and prompts' directly captures the main changes across the PR: timeout handling for extraction runs, request-ID persistence, and prompt revisions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/swe-bench-prompt-tuning-extractor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@reflexio/models/api_schema/domain/entities.py`:
- Around line 559-565: The request_id field currently accepts any string and can
be persisted with only-whitespace or control characters; update handling so
caller-provided request_id is validated/normalized before use: add a helper
(e.g., normalize_request_id or validate_request_id) and invoke it where the code
currently uses request_id || uuid (GenerationService), trimming surrounding
whitespace, rejecting empty/control-only values, and only accepting a safe
pattern (e.g., alphanumeric plus -._) — if validation fails, fall back to
generating a UUID; reference the request_id attribute on the entity and the
GenerationService code path that chooses request_id or uuid to locate where to
add this normalization.

In `@reflexio/server/prompt/prompt_bank/playbook_consolidation/v2.3.1.prompt.md`:
- Around line 30-33: Update the prompt schema text to explicitly require integer
DB ids for the decision fields: ensure `new_id` remains required and add that
`reject_new.superseded_by_existing_id` and `differentiate.existing_id` must be
integer DB `user_playbook_id` values (not label tokens like EXISTING-N) and that
`differentiate.existing_id` is mandatory; also clarify that
`archive_existing_ids` is position-based and distinct from the strict DB-id
fields so it must not be treated the same as
`superseded_by_existing_id`/`existing_id`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bdc4f51b-ee62-469e-9cdf-83fbf2cda3c1

📥 Commits

Reviewing files that changed from the base of the PR and between 65abb88 and ee57851.

📒 Files selected for processing (19)
  • reflexio/models/api_schema/domain/entities.py
  • reflexio/server/llm/litellm_client.py
  • reflexio/server/prompt/prompt_bank/playbook_consolidation/v2.3.0.prompt.md
  • reflexio/server/prompt/prompt_bank/playbook_consolidation/v2.3.1.prompt.md
  • reflexio/server/prompt/prompt_bank/playbook_extraction_context/v4.2.0.prompt.md
  • reflexio/server/prompt/prompt_bank/playbook_extraction_context/v4.2.1.prompt.md
  • reflexio/server/prompt/prompt_bank/playbook_extraction_context/v4.2.2.prompt.md
  • reflexio/server/services/base_generation_service.py
  • reflexio/server/services/extraction/resumable_agent.py
  • reflexio/server/services/generation_service.py
  • reflexio/server/services/playbook/playbook_consolidator.py
  • reflexio/server/services/storage/sqlite_storage/_agent_run.py
  • reflexio/server/services/storage/storage_base/_agent_run.py
  • tests/server/llm/test_litellm_client_unit.py
  • tests/server/services/extraction/test_resumable_agent.py
  • tests/server/services/playbook/test_consolidation_output.py
  • tests/server/services/prompt/test_prompt_manager.py
  • tests/server/services/storage/sqlite_storage/test_agent_run_storage.py
  • tests/server/services/test_prompt_model_mapping.py

Comment on lines +559 to +565
request_id: str | None = None
"""Optional caller-supplied request id for idempotent/local tracing.

If omitted, the backend generates a UUID as before. Supplying this is useful
for async callers that need to observe the exact request in extraction
state without waiting for the full publish response.
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate/normalize caller-provided request_id before using it as a persisted key.

At Line 559, request_id accepts unrestricted str. Because the runtime path uses request_id or uuid (GenerationService Line 175), whitespace/control-character values are treated as valid IDs and can propagate into storage/correlation unexpectedly.

Suggested patch
 class PublishUserInteractionRequest(BaseModel):
     request_id: str | None = None
+    
+    `@field_validator`("request_id", mode="before")
+    `@classmethod`
+    def normalize_request_id(cls, value: str | None) -> str | None:
+        if value is None:
+            return None
+        normalized = str(value).strip()
+        return normalized or None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
request_id: str | None = None
"""Optional caller-supplied request id for idempotent/local tracing.
If omitted, the backend generates a UUID as before. Supplying this is useful
for async callers that need to observe the exact request in extraction
state without waiting for the full publish response.
"""
request_id: str | None = None
"""Optional caller-supplied request id for idempotent/local tracing.
If omitted, the backend generates a UUID as before. Supplying this is useful
for async callers that need to observe the exact request in extraction
state without waiting for the full publish response.
"""
`@field_validator`("request_id", mode="before")
`@classmethod`
def normalize_request_id(cls, value: str | None) -> str | None:
if value is None:
return None
normalized = str(value).strip()
return normalized or None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/models/api_schema/domain/entities.py` around lines 559 - 565, The
request_id field currently accepts any string and can be persisted with
only-whitespace or control characters; update handling so caller-provided
request_id is validated/normalized before use: add a helper (e.g.,
normalize_request_id or validate_request_id) and invoke it where the code
currently uses request_id || uuid (GenerationService), trimming surrounding
whitespace, rejecting empty/control-only values, and only accepting a safe
pattern (e.g., alphanumeric plus -._) — if validation fails, fall back to
generating a UUID; reference the request_id attribute on the entity and the
GenerationService code path that chooses request_id or uuid to locate where to
add this normalization.

Comment on lines +30 to +33
- **reject_new** — an EXISTING row already covers NEW or makes NEW redundant. Name the EXISTING id that wins via `superseded_by_existing_id`. Storage-stability tie-break: when same-situation opposite advice is balanced, default here.

- **differentiate** — both valid in distinct contexts (typically same trigger, opposite advice, where the contexts differ). Set `refined_new_trigger` and `refined_existing_trigger` to be strictly narrower than the originals AND mutually exclusive.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clarify and require DB-ID fields for reject_new/differentiate to match runtime schema.

The prompt now hard-requires new_id, but it still leaves reject_new.superseded_by_existing_id and differentiate.existing_id under-specified. Please explicitly require both as integer DB user_playbook_id values (not EXISTING-N labels), and require differentiate.existing_id to be present. This avoids schema rejections and misrouted decisions.

Suggested prompt patch
 - **reject_new** — an EXISTING row already covers NEW or makes NEW redundant. Name the EXISTING id that wins via `superseded_by_existing_id`. Storage-stability tie-break: when same-situation opposite advice is balanced, default here.
+ - **reject_new** — an EXISTING row already covers NEW or makes NEW redundant. Set `superseded_by_existing_id` to the EXISTING row's integer DB id (`user_playbook_id`), not an `EXISTING-N` label. Storage-stability tie-break: when same-situation opposite advice is balanced, default here.

- - **differentiate** — both valid in distinct contexts (typically same trigger, opposite advice, where the contexts differ). Set `refined_new_trigger` and `refined_existing_trigger` to be strictly narrower than the originals AND mutually exclusive.
+ - **differentiate** — both valid in distinct contexts (typically same trigger, opposite advice, where the contexts differ). Include `existing_id` as the EXISTING row's integer DB id (`user_playbook_id`), and set `refined_new_trigger` and `refined_existing_trigger` to be strictly narrower than the originals AND mutually exclusive.

Based on learnings, archive_existing_ids is position-based while superseded_by_existing_id/existing_id are strict DB ids and must not be treated uniformly.

Also applies to: 54-57

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reflexio/server/prompt/prompt_bank/playbook_consolidation/v2.3.1.prompt.md`
around lines 30 - 33, Update the prompt schema text to explicitly require
integer DB ids for the decision fields: ensure `new_id` remains required and add
that `reject_new.superseded_by_existing_id` and `differentiate.existing_id` must
be integer DB `user_playbook_id` values (not label tokens like EXISTING-N) and
that `differentiate.existing_id` is mandatory; also clarify that
`archive_existing_ids` is position-based and distinct from the strict DB-id
fields so it must not be treated the same as
`superseded_by_existing_id`/`existing_id`.

Source: Learnings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant