Skip to content

fix(sqs,routing): consolidate SQS config defaults and routing recovery#2810

Open
amtulifra wants to merge 5 commits into
Tracer-Cloud:mainfrom
amtulifra:fix/sqs-routing-root-stability
Open

fix(sqs,routing): consolidate SQS config defaults and routing recovery#2810
amtulifra wants to merge 5 commits into
Tracer-Cloud:mainfrom
amtulifra:fix/sqs-routing-root-stability

Conversation

@amtulifra

Copy link
Copy Markdown

Fixes #2803

This PR closes the SQS queue-visibility gap identified in #2803 by adding a read-only queue-attributes tool and then hardening the implementation based on review feedback so defaults, parsing, docs, and tests all share one consistent contract.

In addition, it includes a root-level routing stability fix uncovered during live validation, so follow-up turns with prior investigation context degrade safely and deterministically instead of intermittently fail-closing.

What changed

1) SQS integration + tool groundwork

  • app/integrations/sqs.py

    • Introduced shared DEFAULT_SQS_MAX_QUEUES
    • Added coerce_sqs_max_queues() as the single normalization path for queue limits
    • Updated SQSConfig.max_queues to use shared default
    • Removed misleading is_configured semantics (region default made it effectively always true)
    • Updated sqs_extract_params() to consume centralized defaults/coercion
  • app/tools/SQSQueueAttributesTool/__init__.py

    • Switched tool defaults/clamping to integration-layer shared constants/helpers
    • Preserves read-only queue listing + per-queue attributes with normalized typed output
    • Keeps aws_backend short-circuit path for synthetic runtime compatibility
  • app/types/evidence.py

    • Registers sqs as a valid EvidenceSource
  • tests/tools/test_sqs_queue_attributes.py

    • Expanded coverage around default/coercion consistency and existing behavior
    • Includes poison-pill/stuck-consumer fixture assertions and backend short-circuit guardrails
  • docs/sqs_queues.mdx + docs/docs.json

    • Added/registered SQS docs
    • Updated output field table to include content_based_deduplication (previously missing)

2) Routing root fix (determinism, no test-only patching)

During full live routing validation, follow-up prompts with prior state could intermittently degrade when planner output was empty/unavailable. This PR includes a policy-level fix (not scenario hardcoding):

  • app/cli/interactive_shell/routing/policy_tags.py

    • Added COERCE_FOLLOW_UP_WITH_PRIOR_STATE
  • app/cli/interactive_shell/routing/handle_message_with_agent/orchestration/llm_action_planner/postprocessing.py

    • Added session-aware follow-up coercion for:
      • follow_up:last_failure
      • follow_up:spike_cause
      • follow_up:last_investigation_summary
    • Adjusted transform stop condition so follow-up recovery can execute instead of premature fail-close
  • app/cli/interactive_shell/routing/handle_message_with_agent/orchestration/agent_actions.py

    • Added planner-unavailable recovery path that reuses postprocessing recovery logic when feasible, rather than immediately denying
  • app/cli/interactive_shell/routing/tests/test_policy_traces.py

    • Added explicit policy-trace test coverage for follow-up-with-prior-state coercion

Why this approach?

I intentionally avoided quick test-only fixes and instead addressed source-of-truth and orchestration behavior:

  1. SQS defaults/validation now live in one place (integration layer), preventing drift across config/tool/tests/docs.
  2. Follow-up recovery is policy-driven and state-aware, not fixture-specific.
  3. Planner-unavailable behavior degrades gracefully for valid follow-up intents.

This keeps behavior robust in real CLI sessions and under live LLM variance.

Validation

Quality gates

  • make lint
  • make format-check
  • make typecheck

Tests

  • pytest tests/tools/test_sqs_queue_attributes.py -q
  • pytest app/cli/interactive_shell/routing/tests/test_policy_traces.py app/cli/interactive_shell/routing/tests/test_policy_contracts.py -q
  • pytest tests/cli/interactive_shell/orchestration/test_agent_actions_harness.py tests/cli/interactive_shell/orchestration/test_agent_actions.py -q
  • Targeted live routing reruns for previously unstable cases ✅
  • Full gate: make test-scope (escalated to make test-cov) ✅ on latest run

Demo / usage

Terminal demo snippet:

  • opensre integrations verify sqs
  • opensre investigate --input <alert_with_sqs_context>.json

Tool output highlights:

  • visible_count
  • in_flight_count
  • oldest_message_age_seconds
  • visibility_timeout_seconds
  • has_dlq / redrive_policy
  • is_fifo
  • content_based_deduplication
Screenshot 2026-06-13 at 21 18 01

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

Adds a read-only SQS tool that lists queues by prefix and returns per-queue
attributes (depth, in-flight count, oldest-message age, visibility timeout,
DLQ wiring, FIFO flag) — the three fields that expose the poison-pill
stuck-consumer pattern invisible in logs and metrics today.

- app/integrations/sqs.py: SQSConfig, sqs_is_available, sqs_extract_params
- app/tools/SQSQueueAttributesTool: list_queues → get_queue_attributes per queue,
  normalized output, aws_backend short-circuit for synthetic tests
- app/types/evidence.py: register 'sqs' as a valid EvidenceSource
- tests/tools/test_sqs_queue_attributes.py: 7 tests incl. poison-pill fixture
  and backend short-circuit guard
- docs/sqs_queues.mdx + docs/docs.json: user-facing page registered in nav
Add contract and integration-helper tests for the SQS queue attributes tool so issue Tracer-Cloud#2803 acceptance criteria explicitly cover schema metadata, availability gating, and parameter extraction behavior.
Harden planner postprocessing for non-actionable/meta prompts, align supported-integrations docs intent to assistant handoff, and preserve runtime API-key env values across provider switches so live routing runs do not lose auth mid-session. Also improve live test resilience to transient provider throttling and include targeted reliability fixes for github_mcp coroutine cleanup, telemetry tool classification, and ReplDriver startup fallback when uv is unavailable.
… grounding

Unify SQS defaults and normalization in the integration layer, remove misleading SQS configuration semantics, and align tool/docs/tests to the same contract so behavior cannot drift. Add session-aware planner recovery for follow-up prompts when live planner output is empty or unavailable, so prior-investigation turns resolve to deterministic assistant handoff intents instead of fail-closing under provider instability.
@github-actions

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@amtulifra amtulifra changed the title fix(sqs,routing): consolidate SQS config defaults and harden follow-up routing recovery fix(sqs,routing): consolidate SQS config defaults and routing recovery Jun 14, 2026
@greptile-apps

greptile-apps Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR addresses two concerns: a new read-only SQS queue-attributes tool (with integration-layer config normalization, per-queue attribute parsing, and pagination) and a routing determinism fix that coerces empty-planner follow-up turns into recoverable handoffs instead of fail-closing, guarded by session state and regex pattern matching.

  • SQS tool: app/integrations/sqs.py introduces shared defaults (DEFAULT_SQS_MAX_QUEUES, coerce_sqs_max_queues), SQSQueueAttributesTool follows NextToken pagination and emits a truncated flag, and sqs is registered as a valid EvidenceSource.
  • Routing recovery: finalize_planner_result_with_trace now bypasses the early stop_when gate when allow_follow_up_recovery is true, adds COERCE_FOLLOW_UP_WITH_PRIOR_STATE as a new pipeline phase, and _recover_when_planner_unavailable in agent_actions.py attempts a postprocessing-based rescue before falling back to a hard deny when the planner is unavailable.
  • Misc hardening: env_sync.py preserves sensitive env-var credentials during .env resync, github_mcp.py throws GeneratorExit to finalize never-started coroutines on Python 3.12, and test helpers gain a transient-provider-failure skip path.

Confidence Score: 5/5

Safe to merge — the routing changes are logically sound and the SQS tool correctly handles pagination and error cases.

The stop_when bypass in finalize_planner_result_with_trace is intentional: FAIL_CLOSED_UNCONFIGURED_INTEGRATION_DETAIL produces a non-empty handoff action (not an empty-unhandled state), so it is never affected by the changed gate. The _recover_when_planner_unavailable path correctly converts planner-unavailable follow-ups from a hard deny into an LLM fallback. The SQS pagination loop handles NextToken correctly for all documented AWS response shapes. The one comment (empty-page loop guard) is minor defensive hardening against a non-documented AWS response, not a reachable bug today.

app/tools/SQSQueueAttributesTool/init.py — minor pagination robustness suggestion.

Important Files Changed

Filename Overview
app/tools/SQSQueueAttributesTool/init.py New SQS tool with correct pagination loop and truncation handling; minor robustness gap if AWS ever returns an empty page with a NextToken.
app/cli/interactive_shell/routing/handle_message_with_agent/orchestration/llm_action_planner/postprocessing.py Adds COERCE_FOLLOW_UP_WITH_PRIOR_STATE and COERCE_SUPPORTED_INTEGRATIONS_TO_HANDOFF phases, correctly scopes stop_when bypass to allow_follow_up_recovery, and preserves existing early-return denials.
app/cli/interactive_shell/routing/handle_message_with_agent/orchestration/agent_actions.py _recover_when_planner_unavailable correctly routes follow-up messages to LLM fallback (not fail-close) when planner is unavailable; FAIL_CLOSED_META_SELF_IMPROVEMENT check correctly sets denied=True from policy_trace.
app/integrations/sqs.py Clean integration module; coerce_sqs_max_queues is the single normalization point, sqs_config_from_env correctly requires an explicit region, and sqs_is_available handles both source-dict and env-var paths.
app/integrations/github_mcp.py Adds GeneratorExit throw + suppress(BaseException) cleanup for never-started coroutines on Python 3.12; both cleanup paths are safe and order-independent due to suppress.
app/cli/wizard/env_sync.py Correctly skips os.environ.pop for sensitive keys during env resync, preventing mid-session auth regressions; the guard depends on pre-existing _is_sensitive_env_key.
tests/tools/test_sqs_queue_attributes.py Comprehensive test coverage for the new SQS tool including pagination truncation, poison-pill fixture, FIFO detection, partial-attr-failure, and aws_backend short-circuit.

Reviews (2): Last reviewed commit: "fix(sqs,routing): resolve remaining revi..." | Re-trigger Greptile

Comment thread app/tools/SQSQueueAttributesTool/__init__.py Outdated
Comment thread app/integrations/sqs.py
…hanges

Replace brittle policy-tag string checks with enum-backed values, remove redundant SQS availability branching, and implement paginated queue listing with explicit truncation signaling so queue diagnostics remain complete and truthful at scale. Also simplify follow-up regex case handling to match the function's lowercase normalization path.
@amtulifra amtulifra force-pushed the fix/sqs-routing-root-stability branch from 586a17d to a6fb8f0 Compare June 14, 2026 13:02
@amtulifra

Copy link
Copy Markdown
Author

@greptile review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant