fix(tuning): harden extraction runs and prompts by yilu331 · Pull Request #131 · ReflexioAI/reflexio

yilu331 · 2026-06-07T06:50:01Z

Summary

harden prompt-tuning extraction runs by preserving caller request IDs through generation and storage
add hard LiteLLM timeout handling so timed-out extractor calls cannot continue writing late results
add extractor and consolidator prompt revisions for SWE-bench failures found during the tuning loop
add tests for timeout behavior, request-id persistence, prompt version mapping, and consolidation output shape

Verification

uv run ruff check on staged Python files
uv run pyright on staged Python files
uv run pytest --no-cov -q tests/server/llm/test_litellm_client_unit.py tests/server/services/extraction/test_resumable_agent.py tests/server/services/storage/sqlite_storage/test_agent_run_storage.py tests/server/services/playbook/test_consolidation_output.py tests/server/services/prompt/test_prompt_manager.py tests/server/services/test_prompt_model_mapping.py tests/eval/extraction/test_extraction_eval.py

Notes

This replaces the closed, unmerged PR #130 with a fresh PR containing the latest branch head ee57851.

Summary by CodeRabbit

Release Notes

New Features
- Added optional request ID tracking for idempotent operations and local tracing correlation
- Enhanced hard timeout protection layer for LLM calls
- Auto-repair for playbook consolidation outputs with missing identifiers
Bug Fixes
- Late outputs are now properly discarded after extraction timeouts
- Improved status validation to prevent stale agent run updates
Documentation
- Updated playbook extraction and consolidation prompts with refined policy-mining logic

Honor caller request IDs, isolate timed-out extractor runs, enforce hard LiteLLM timeouts, and add prompt updates for extractor/consolidator failures found during the SWE-bench tuning loop.

coderabbitai · 2026-06-07T06:50:13Z

📝 Walkthrough

Walkthrough

This PR hardens extraction reliability and request lifecycle safety by introducing caller-supplied request IDs for idempotency, hard wall-clock timeout enforcement on LLM calls with executor-based cancellation, conditional storage updates to prevent late outputs from overwriting timeout states, and updated prompts for playbook extraction and consolidation with refined decision semantics.

Changes

Timeout Safety and Request Lifecycle

Layer / File(s)	Summary
Request ID propagation for idempotency `reflexio/models/api_schema/domain/entities.py`, `reflexio/server/services/generation_service.py`	API schema adds optional `request_id` field for caller-supplied correlation; generation service uses provided ID or falls back to UUID generation.
Storage layer conditional status updates `reflexio/server/services/storage/storage_base/_agent_run.py`, `reflexio/server/services/storage/sqlite_storage/_agent_run.py`, `tests/server/services/storage/sqlite_storage/test_agent_run_storage.py`	Storage interfaces add `expected_statuses` parameter to `update_agent_run_status` for conditional updates and new `fail_running_agent_runs_for_request` method for bulk timeout-failure marking by request/extractor/user.
Hard wall-clock timeout enforcement `reflexio/server/llm/litellm_client.py`, `tests/server/llm/test_litellm_client_unit.py`	LiteLLM client implements executor-based hard-timeout wrapper around `litellm.completion` with grace period; cancels futures and discards late completions; includes wall-clock timeout test and mock adjustment for embedding local routing.
Extractor timeout failure marking `reflexio/server/services/base_generation_service.py`	On extractor timeout, generation service marks active agent runs for the request as failed via storage update; defensive error handling for storage failures.
Late output discard after timeout `reflexio/server/services/extraction/resumable_agent.py`, `tests/server/services/extraction/test_resumable_agent.py`	ResumableExtractionAgent gates successful completion on run status being in active states; finish_extraction outputs received after timeout-induced status transition are discarded and marked `late_output_discarded`; both success and failure paths use status guards.
Consolidation output self-repair `reflexio/server/services/playbook/playbook_consolidator.py`, `tests/server/services/playbook/test_consolidation_output.py`	PlaybookConsolidationOutput schema adds pre-validation repair that auto-injects `new_id="NEW-0"` when single unify decision omits it; validates against spurious inference in multi-decision payloads.
Playbook extraction and consolidation prompts `reflexio/server/prompt/prompt_bank/playbook_consolidation/`, `reflexio/server/prompt/prompt_bank/playbook_extraction_context/`, `tests/server/services/prompt/test_prompt_manager.py`, `tests/server/services/test_prompt_model_mapping.py`	Deactivates v2.3.0 consolidation and v4.2.0 extraction prompts; introduces v2.3.1 consolidation with decision-id contract, unify re-synthesis rules, and self-contradiction guard; introduces v4.2.1 and v4.2.2 extraction prompts with resumable-mode tool discipline, Correction SOP/Success Path Recipe categories, grounding rules, and worked examples; updates version mappings.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

ReflexioAI/reflexio#107: Both PRs modify reflexio/server/services/extraction/resumable_agent.py—the main PR changes completion/late-output and status-expectation handling, while PR #107 updates tool-choice and async-info tool behavior in the same function.
ReflexioAI/reflexio#101: Main PR's extractor execution and timeout handling in BaseGenerationService overlap with PR #101's refactor to a single configured extractor flow in the same service.

Poem

🐰 A rabbit's ballad for timeouts true:
Request IDs trace the path askew,
Hard timeouts fire with executor's might,
Late outputs falter—discarded right.
Storage whispers: "late work won't do,"
Safety renewed, one checkpoint through! 🏁

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 39.39% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'harden extraction runs and prompts' directly captures the main changes across the PR: timeout handling for extraction runs, request-ID persistence, and prompt revisions.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/swe-bench-prompt-tuning-extractor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@reflexio/models/api_schema/domain/entities.py`:
- Around line 559-565: The request_id field currently accepts any string and can
be persisted with only-whitespace or control characters; update handling so
caller-provided request_id is validated/normalized before use: add a helper
(e.g., normalize_request_id or validate_request_id) and invoke it where the code
currently uses request_id || uuid (GenerationService), trimming surrounding
whitespace, rejecting empty/control-only values, and only accepting a safe
pattern (e.g., alphanumeric plus -._) — if validation fails, fall back to
generating a UUID; reference the request_id attribute on the entity and the
GenerationService code path that chooses request_id or uuid to locate where to
add this normalization.

In `@reflexio/server/prompt/prompt_bank/playbook_consolidation/v2.3.1.prompt.md`:
- Around line 30-33: Update the prompt schema text to explicitly require integer
DB ids for the decision fields: ensure `new_id` remains required and add that
`reject_new.superseded_by_existing_id` and `differentiate.existing_id` must be
integer DB `user_playbook_id` values (not label tokens like EXISTING-N) and that
`differentiate.existing_id` is mandatory; also clarify that
`archive_existing_ids` is position-based and distinct from the strict DB-id
fields so it must not be treated the same as
`superseded_by_existing_id`/`existing_id`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bdc4f51b-ee62-469e-9cdf-83fbf2cda3c1

📥 Commits

Reviewing files that changed from the base of the PR and between 65abb88 and ee57851.

📒 Files selected for processing (19)

reflexio/models/api_schema/domain/entities.py
reflexio/server/llm/litellm_client.py
reflexio/server/prompt/prompt_bank/playbook_consolidation/v2.3.0.prompt.md
reflexio/server/prompt/prompt_bank/playbook_consolidation/v2.3.1.prompt.md
reflexio/server/prompt/prompt_bank/playbook_extraction_context/v4.2.0.prompt.md
reflexio/server/prompt/prompt_bank/playbook_extraction_context/v4.2.1.prompt.md
reflexio/server/prompt/prompt_bank/playbook_extraction_context/v4.2.2.prompt.md
reflexio/server/services/base_generation_service.py
reflexio/server/services/extraction/resumable_agent.py
reflexio/server/services/generation_service.py
reflexio/server/services/playbook/playbook_consolidator.py
reflexio/server/services/storage/sqlite_storage/_agent_run.py
reflexio/server/services/storage/storage_base/_agent_run.py
tests/server/llm/test_litellm_client_unit.py
tests/server/services/extraction/test_resumable_agent.py
tests/server/services/playbook/test_consolidation_output.py
tests/server/services/prompt/test_prompt_manager.py
tests/server/services/storage/sqlite_storage/test_agent_run_storage.py
tests/server/services/test_prompt_model_mapping.py

coderabbitai · 2026-06-07T06:59:26Z

+    request_id: str | None = None
+    """Optional caller-supplied request id for idempotent/local tracing.
+
+    If omitted, the backend generates a UUID as before. Supplying this is useful
+    for async callers that need to observe the exact request in extraction
+    state without waiting for the full publish response.
+    """


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate/normalize caller-provided request_id before using it as a persisted key.

At Line 559, request_id accepts unrestricted str. Because the runtime path uses request_id or uuid (GenerationService Line 175), whitespace/control-character values are treated as valid IDs and can propagate into storage/correlation unexpectedly.

Suggested patch

class PublishUserInteractionRequest(BaseModel): request_id: str | None = None + + `@field_validator`("request_id", mode="before") + `@classmethod` + def normalize_request_id(cls, value: str | None) -> str | None: + if value is None: + return None + normalized = str(value).strip() + return normalized or None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

request_id: str | None = None

"""Optional caller-supplied request id for idempotent/local tracing.

If omitted, the backend generates a UUID as before. Supplying this is useful

for async callers that need to observe the exact request in extraction

state without waiting for the full publish response.

"""

request_id: str | None = None

"""Optional caller-supplied request id for idempotent/local tracing.

If omitted, the backend generates a UUID as before. Supplying this is useful

for async callers that need to observe the exact request in extraction

state without waiting for the full publish response.

"""

`@field_validator`("request_id", mode="before")

`@classmethod`

def normalize_request_id(cls, value: str | None) -> str | None:

if value is None:

return None

normalized = str(value).strip()

return normalized or None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/models/api_schema/domain/entities.py` around lines 559 - 565, The request_id field currently accepts any string and can be persisted with only-whitespace or control characters; update handling so caller-provided request_id is validated/normalized before use: add a helper (e.g., normalize_request_id or validate_request_id) and invoke it where the code currently uses request_id || uuid (GenerationService), trimming surrounding whitespace, rejecting empty/control-only values, and only accepting a safe pattern (e.g., alphanumeric plus -._) — if validation fails, fall back to generating a UUID; reference the request_id attribute on the entity and the GenerationService code path that chooses request_id or uuid to locate where to add this normalization.

coderabbitai · 2026-06-07T06:59:26Z

+- **reject_new** — an EXISTING row already covers NEW or makes NEW redundant. Name the EXISTING id that wins via `superseded_by_existing_id`. Storage-stability tie-break: when same-situation opposite advice is balanced, default here.
+
+- **differentiate** — both valid in distinct contexts (typically same trigger, opposite advice, where the contexts differ). Set `refined_new_trigger` and `refined_existing_trigger` to be strictly narrower than the originals AND mutually exclusive.
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clarify and require DB-ID fields for reject_new/differentiate to match runtime schema.

The prompt now hard-requires new_id, but it still leaves reject_new.superseded_by_existing_id and differentiate.existing_id under-specified. Please explicitly require both as integer DB user_playbook_id values (not EXISTING-N labels), and require differentiate.existing_id to be present. This avoids schema rejections and misrouted decisions.

Suggested prompt patch

- **reject_new** — an EXISTING row already covers NEW or makes NEW redundant. Name the EXISTING id that wins via `superseded_by_existing_id`. Storage-stability tie-break: when same-situation opposite advice is balanced, default here. + - **reject_new** — an EXISTING row already covers NEW or makes NEW redundant. Set `superseded_by_existing_id` to the EXISTING row's integer DB id (`user_playbook_id`), not an `EXISTING-N` label. Storage-stability tie-break: when same-situation opposite advice is balanced, default here. - - **differentiate** — both valid in distinct contexts (typically same trigger, opposite advice, where the contexts differ). Set `refined_new_trigger` and `refined_existing_trigger` to be strictly narrower than the originals AND mutually exclusive. + - **differentiate** — both valid in distinct contexts (typically same trigger, opposite advice, where the contexts differ). Include `existing_id` as the EXISTING row's integer DB id (`user_playbook_id`), and set `refined_new_trigger` and `refined_existing_trigger` to be strictly narrower than the originals AND mutually exclusive.

Based on learnings, archive_existing_ids is position-based while superseded_by_existing_id/existing_id are strict DB ids and must not be treated uniformly.

Also applies to: 54-57

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@reflexio/server/prompt/prompt_bank/playbook_consolidation/v2.3.1.prompt.md` around lines 30 - 33, Update the prompt schema text to explicitly require integer DB ids for the decision fields: ensure `new_id` remains required and add that `reject_new.superseded_by_existing_id` and `differentiate.existing_id` must be integer DB `user_playbook_id` values (not label tokens like EXISTING-N) and that `differentiate.existing_id` is mandatory; also clarify that `archive_existing_ids` is position-based and distinct from the strict DB-id fields so it must not be treated the same as `superseded_by_existing_id`/`existing_id`.

Source: Learnings

fix(tuning): harden extraction runs and prompts

ee57851

Honor caller request IDs, isolate timed-out extractor runs, enforce hard LiteLLM timeouts, and add prompt updates for extractor/consolidator failures found during the SWE-bench tuning loop.

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tuning): harden extraction runs and prompts#131

fix(tuning): harden extraction runs and prompts#131
yilu331 wants to merge 1 commit into
mainfrom
feature/swe-bench-prompt-tuning-extractor

yilu331 commented Jun 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Uh oh!

coderabbitai Bot Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		- reject_new — an EXISTING row already covers NEW or makes NEW redundant. Name the EXISTING id that wins via `superseded_by_existing_id`. Storage-stability tie-break: when same-situation opposite advice is balanced, default here.

		- differentiate — both valid in distinct contexts (typically same trigger, opposite advice, where the contexts differ). Set `refined_new_trigger` and `refined_existing_trigger` to be strictly narrower than the originals AND mutually exclusive.

Conversation

yilu331 commented Jun 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yilu331 commented Jun 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading