fix(sqs,routing): consolidate SQS config defaults and routing recovery#2810
fix(sqs,routing): consolidate SQS config defaults and routing recovery#2810amtulifra wants to merge 5 commits into
Conversation
Adds a read-only SQS tool that lists queues by prefix and returns per-queue attributes (depth, in-flight count, oldest-message age, visibility timeout, DLQ wiring, FIFO flag) — the three fields that expose the poison-pill stuck-consumer pattern invisible in logs and metrics today. - app/integrations/sqs.py: SQSConfig, sqs_is_available, sqs_extract_params - app/tools/SQSQueueAttributesTool: list_queues → get_queue_attributes per queue, normalized output, aws_backend short-circuit for synthetic tests - app/types/evidence.py: register 'sqs' as a valid EvidenceSource - tests/tools/test_sqs_queue_attributes.py: 7 tests incl. poison-pill fixture and backend short-circuit guard - docs/sqs_queues.mdx + docs/docs.json: user-facing page registered in nav
Add contract and integration-helper tests for the SQS queue attributes tool so issue Tracer-Cloud#2803 acceptance criteria explicitly cover schema metadata, availability gating, and parameter extraction behavior.
Harden planner postprocessing for non-actionable/meta prompts, align supported-integrations docs intent to assistant handoff, and preserve runtime API-key env values across provider switches so live routing runs do not lose auth mid-session. Also improve live test resilience to transient provider throttling and include targeted reliability fixes for github_mcp coroutine cleanup, telemetry tool classification, and ReplDriver startup fallback when uv is unavailable.
… grounding Unify SQS defaults and normalization in the integration layer, remove misleading SQS configuration semantics, and align tool/docs/tests to the same contract so behavior cannot drift. Add session-aware planner recovery for follow-up prompts when live planner output is empty or unavailable, so prior-investigation turns resolve to deterministic assistant handoff intents instead of fail-closing under provider instability.
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
Greptile SummaryThis PR addresses two concerns: a new read-only SQS queue-attributes tool (with integration-layer config normalization, per-queue attribute parsing, and pagination) and a routing determinism fix that coerces empty-planner follow-up turns into recoverable handoffs instead of fail-closing, guarded by session state and regex pattern matching.
Confidence Score: 5/5Safe to merge — the routing changes are logically sound and the SQS tool correctly handles pagination and error cases. The stop_when bypass in finalize_planner_result_with_trace is intentional: FAIL_CLOSED_UNCONFIGURED_INTEGRATION_DETAIL produces a non-empty handoff action (not an empty-unhandled state), so it is never affected by the changed gate. The _recover_when_planner_unavailable path correctly converts planner-unavailable follow-ups from a hard deny into an LLM fallback. The SQS pagination loop handles NextToken correctly for all documented AWS response shapes. The one comment (empty-page loop guard) is minor defensive hardening against a non-documented AWS response, not a reachable bug today. app/tools/SQSQueueAttributesTool/init.py — minor pagination robustness suggestion. Important Files Changed
Reviews (2): Last reviewed commit: "fix(sqs,routing): resolve remaining revi..." | Re-trigger Greptile |
…hanges Replace brittle policy-tag string checks with enum-backed values, remove redundant SQS availability branching, and implement paginated queue listing with explicit truncation signaling so queue diagnostics remain complete and truthful at scale. Also simplify follow-up regex case handling to match the function's lowercase normalization path.
586a17d to
a6fb8f0
Compare
|
@greptile review |
Fixes #
2803This PR closes the SQS queue-visibility gap identified in #2803 by adding a read-only queue-attributes tool and then hardening the implementation based on review feedback so defaults, parsing, docs, and tests all share one consistent contract.
In addition, it includes a root-level routing stability fix uncovered during live validation, so follow-up turns with prior investigation context degrade safely and deterministically instead of intermittently fail-closing.
What changed
1) SQS integration + tool groundwork
app/integrations/sqs.pyDEFAULT_SQS_MAX_QUEUEScoerce_sqs_max_queues()as the single normalization path for queue limitsSQSConfig.max_queuesto use shared defaultis_configuredsemantics (region default made it effectively always true)sqs_extract_params()to consume centralized defaults/coercionapp/tools/SQSQueueAttributesTool/__init__.pyaws_backendshort-circuit path for synthetic runtime compatibilityapp/types/evidence.pysqsas a validEvidenceSourcetests/tools/test_sqs_queue_attributes.pydocs/sqs_queues.mdx+docs/docs.jsoncontent_based_deduplication(previously missing)2) Routing root fix (determinism, no test-only patching)
During full live routing validation, follow-up prompts with prior state could intermittently degrade when planner output was empty/unavailable. This PR includes a policy-level fix (not scenario hardcoding):
app/cli/interactive_shell/routing/policy_tags.pyCOERCE_FOLLOW_UP_WITH_PRIOR_STATEapp/cli/interactive_shell/routing/handle_message_with_agent/orchestration/llm_action_planner/postprocessing.pyfollow_up:last_failurefollow_up:spike_causefollow_up:last_investigation_summaryapp/cli/interactive_shell/routing/handle_message_with_agent/orchestration/agent_actions.pyapp/cli/interactive_shell/routing/tests/test_policy_traces.pyWhy this approach?
I intentionally avoided quick test-only fixes and instead addressed source-of-truth and orchestration behavior:
This keeps behavior robust in real CLI sessions and under live LLM variance.
Validation
Quality gates
make lint✅make format-check✅make typecheck✅Tests
pytest tests/tools/test_sqs_queue_attributes.py -q✅pytest app/cli/interactive_shell/routing/tests/test_policy_traces.py app/cli/interactive_shell/routing/tests/test_policy_contracts.py -q✅pytest tests/cli/interactive_shell/orchestration/test_agent_actions_harness.py tests/cli/interactive_shell/orchestration/test_agent_actions.py -q✅make test-scope(escalated tomake test-cov) ✅ on latest runDemo / usage
Terminal demo snippet:
opensre integrations verify sqsopensre investigate --input <alert_with_sqs_context>.jsonTool output highlights:
visible_countin_flight_countoldest_message_age_secondsvisibility_timeout_secondshas_dlq/redrive_policyis_fifocontent_based_deduplicationCode Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Checklist before requesting a review
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.