fix(sqs,routing): consolidate SQS config defaults and routing recovery by amtulifra · Pull Request #2810 · Tracer-Cloud/opensre

amtulifra · 2026-06-14T12:45:22Z

Fixes #2803

This PR closes the SQS queue-visibility gap identified in #2803 by adding a read-only queue-attributes tool and then hardening the implementation based on review feedback so defaults, parsing, docs, and tests all share one consistent contract.

In addition, it includes a root-level routing stability fix uncovered during live validation, so follow-up turns with prior investigation context degrade safely and deterministically instead of intermittently fail-closing.

What changed

1) SQS integration + tool groundwork

app/integrations/sqs.py
- Introduced shared DEFAULT_SQS_MAX_QUEUES
- Added coerce_sqs_max_queues() as the single normalization path for queue limits
- Updated SQSConfig.max_queues to use shared default
- Removed misleading is_configured semantics (region default made it effectively always true)
- Updated sqs_extract_params() to consume centralized defaults/coercion
app/tools/SQSQueueAttributesTool/__init__.py
- Switched tool defaults/clamping to integration-layer shared constants/helpers
- Preserves read-only queue listing + per-queue attributes with normalized typed output
- Keeps aws_backend short-circuit path for synthetic runtime compatibility
app/types/evidence.py
- Registers sqs as a valid EvidenceSource
tests/tools/test_sqs_queue_attributes.py
- Expanded coverage around default/coercion consistency and existing behavior
- Includes poison-pill/stuck-consumer fixture assertions and backend short-circuit guardrails
docs/sqs_queues.mdx + docs/docs.json
- Added/registered SQS docs
- Updated output field table to include content_based_deduplication (previously missing)

2) Routing root fix (determinism, no test-only patching)

During full live routing validation, follow-up prompts with prior state could intermittently degrade when planner output was empty/unavailable. This PR includes a policy-level fix (not scenario hardcoding):

app/cli/interactive_shell/routing/policy_tags.py
- Added COERCE_FOLLOW_UP_WITH_PRIOR_STATE
app/cli/interactive_shell/routing/handle_message_with_agent/orchestration/llm_action_planner/postprocessing.py
- Added session-aware follow-up coercion for:
  - follow_up:last_failure
  - follow_up:spike_cause
  - follow_up:last_investigation_summary
- Adjusted transform stop condition so follow-up recovery can execute instead of premature fail-close
app/cli/interactive_shell/routing/handle_message_with_agent/orchestration/agent_actions.py
- Added planner-unavailable recovery path that reuses postprocessing recovery logic when feasible, rather than immediately denying
app/cli/interactive_shell/routing/tests/test_policy_traces.py
- Added explicit policy-trace test coverage for follow-up-with-prior-state coercion

Why this approach?

I intentionally avoided quick test-only fixes and instead addressed source-of-truth and orchestration behavior:

SQS defaults/validation now live in one place (integration layer), preventing drift across config/tool/tests/docs.
Follow-up recovery is policy-driven and state-aware, not fixture-specific.
Planner-unavailable behavior degrades gracefully for valid follow-up intents.

This keeps behavior robust in real CLI sessions and under live LLM variance.

Validation

Quality gates

make lint ✅
make format-check ✅
make typecheck ✅

Tests

pytest tests/tools/test_sqs_queue_attributes.py -q ✅
pytest app/cli/interactive_shell/routing/tests/test_policy_traces.py app/cli/interactive_shell/routing/tests/test_policy_contracts.py -q ✅
pytest tests/cli/interactive_shell/orchestration/test_agent_actions_harness.py tests/cli/interactive_shell/orchestration/test_agent_actions.py -q ✅
Targeted live routing reruns for previously unstable cases ✅
Full gate: make test-scope (escalated to make test-cov) ✅ on latest run

Demo / usage

Terminal demo snippet:

opensre integrations verify sqs
opensre investigate --input <alert_with_sqs_context>.json

Tool output highlights:

visible_count
in_flight_count
oldest_message_age_seconds
visibility_timeout_seconds
has_dlq / redrive_policy
is_fifo
content_based_deduplication

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

Adds a read-only SQS tool that lists queues by prefix and returns per-queue attributes (depth, in-flight count, oldest-message age, visibility timeout, DLQ wiring, FIFO flag) — the three fields that expose the poison-pill stuck-consumer pattern invisible in logs and metrics today. - app/integrations/sqs.py: SQSConfig, sqs_is_available, sqs_extract_params - app/tools/SQSQueueAttributesTool: list_queues → get_queue_attributes per queue, normalized output, aws_backend short-circuit for synthetic tests - app/types/evidence.py: register 'sqs' as a valid EvidenceSource - tests/tools/test_sqs_queue_attributes.py: 7 tests incl. poison-pill fixture and backend short-circuit guard - docs/sqs_queues.mdx + docs/docs.json: user-facing page registered in nav

Add contract and integration-helper tests for the SQS queue attributes tool so issue Tracer-Cloud#2803 acceptance criteria explicitly cover schema metadata, availability gating, and parameter extraction behavior.

Harden planner postprocessing for non-actionable/meta prompts, align supported-integrations docs intent to assistant handoff, and preserve runtime API-key env values across provider switches so live routing runs do not lose auth mid-session. Also improve live test resilience to transient provider throttling and include targeted reliability fixes for github_mcp coroutine cleanup, telemetry tool classification, and ReplDriver startup fallback when uv is unavailable.

… grounding Unify SQS defaults and normalization in the integration layer, remove misleading SQS configuration semantics, and align tool/docs/tests to the same contract so behavior cannot drift. Add session-aware planner recovery for follow-up prompts when live planner output is empty or unavailable, so prior-investigation turns resolve to deterministic assistant handoff intents instead of fail-closing under provider instability.

github-actions · 2026-06-14T12:45:31Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

greptile-apps · 2026-06-14T12:53:40Z

Greptile Summary

This PR addresses two concerns: a new read-only SQS queue-attributes tool (with integration-layer config normalization, per-queue attribute parsing, and pagination) and a routing determinism fix that coerces empty-planner follow-up turns into recoverable handoffs instead of fail-closing, guarded by session state and regex pattern matching.

SQS tool: app/integrations/sqs.py introduces shared defaults (DEFAULT_SQS_MAX_QUEUES, coerce_sqs_max_queues), SQSQueueAttributesTool follows NextToken pagination and emits a truncated flag, and sqs is registered as a valid EvidenceSource.
Routing recovery: finalize_planner_result_with_trace now bypasses the early stop_when gate when allow_follow_up_recovery is true, adds COERCE_FOLLOW_UP_WITH_PRIOR_STATE as a new pipeline phase, and _recover_when_planner_unavailable in agent_actions.py attempts a postprocessing-based rescue before falling back to a hard deny when the planner is unavailable.
Misc hardening: env_sync.py preserves sensitive env-var credentials during .env resync, github_mcp.py throws GeneratorExit to finalize never-started coroutines on Python 3.12, and test helpers gain a transient-provider-failure skip path.

Confidence Score: 5/5

Safe to merge — the routing changes are logically sound and the SQS tool correctly handles pagination and error cases.

The stop_when bypass in finalize_planner_result_with_trace is intentional: FAIL_CLOSED_UNCONFIGURED_INTEGRATION_DETAIL produces a non-empty handoff action (not an empty-unhandled state), so it is never affected by the changed gate. The _recover_when_planner_unavailable path correctly converts planner-unavailable follow-ups from a hard deny into an LLM fallback. The SQS pagination loop handles NextToken correctly for all documented AWS response shapes. The one comment (empty-page loop guard) is minor defensive hardening against a non-documented AWS response, not a reachable bug today.

app/tools/SQSQueueAttributesTool/init.py — minor pagination robustness suggestion.

Important Files Changed

Filename	Overview
app/tools/SQSQueueAttributesTool/init.py	New SQS tool with correct pagination loop and truncation handling; minor robustness gap if AWS ever returns an empty page with a NextToken.
app/cli/interactive_shell/routing/handle_message_with_agent/orchestration/llm_action_planner/postprocessing.py	Adds COERCE_FOLLOW_UP_WITH_PRIOR_STATE and COERCE_SUPPORTED_INTEGRATIONS_TO_HANDOFF phases, correctly scopes stop_when bypass to allow_follow_up_recovery, and preserves existing early-return denials.
app/cli/interactive_shell/routing/handle_message_with_agent/orchestration/agent_actions.py	_recover_when_planner_unavailable correctly routes follow-up messages to LLM fallback (not fail-close) when planner is unavailable; FAIL_CLOSED_META_SELF_IMPROVEMENT check correctly sets denied=True from policy_trace.
app/integrations/sqs.py	Clean integration module; coerce_sqs_max_queues is the single normalization point, sqs_config_from_env correctly requires an explicit region, and sqs_is_available handles both source-dict and env-var paths.
app/integrations/github_mcp.py	Adds GeneratorExit throw + suppress(BaseException) cleanup for never-started coroutines on Python 3.12; both cleanup paths are safe and order-independent due to suppress.
app/cli/wizard/env_sync.py	Correctly skips os.environ.pop for sensitive keys during env resync, preventing mid-session auth regressions; the guard depends on pre-existing _is_sensitive_env_key.
tests/tools/test_sqs_queue_attributes.py	Comprehensive test coverage for the new SQS tool including pagination truncation, poison-pill fixture, FIFO detection, partial-attr-failure, and aws_backend short-circuit.

_{Reviews (2): Last reviewed commit: "fix(sqs,routing): resolve remaining revi..." | Re-trigger Greptile}

…hanges Replace brittle policy-tag string checks with enum-backed values, remove redundant SQS availability branching, and implement paginated queue listing with explicit truncation signaling so queue diagnostics remain complete and truthful at scale. Also simplify follow-up regex case handling to match the function's lowercase normalization path.

amtulifra · 2026-06-14T13:04:26Z

@greptile review

amtulifra added 4 commits June 13, 2026 20:46

test(sqs): add availability and extract param coverage

93ffcc3

Add contract and integration-helper tests for the SQS queue attributes tool so issue Tracer-Cloud#2803 acceptance criteria explicitly cover schema metadata, availability gating, and parameter extraction behavior.

amtulifra changed the title ~~fix(sqs,routing): consolidate SQS config defaults and harden follow-up routing recovery~~ fix(sqs,routing): consolidate SQS config defaults and routing recovery Jun 14, 2026

greptile-apps Bot reviewed Jun 14, 2026

View reviewed changes

amtulifra force-pushed the fix/sqs-routing-root-stability branch from 586a17d to a6fb8f0 Compare June 14, 2026 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sqs,routing): consolidate SQS config defaults and routing recovery#2810

fix(sqs,routing): consolidate SQS config defaults and routing recovery#2810
amtulifra wants to merge 5 commits into
Tracer-Cloud:mainfrom
amtulifra:fix/sqs-routing-root-stability

amtulifra commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

greptile-apps Bot commented Jun 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amtulifra commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amtulifra commented Jun 14, 2026

What changed

1) SQS integration + tool groundwork

2) Routing root fix (determinism, no test-only patching)

Why this approach?

Validation

Quality gates

Tests

Demo / usage

Code Understanding and AI Usage

Checklist before requesting a review

Uh oh!

github-actions Bot commented Jun 14, 2026

Greptile code review

Uh oh!

greptile-apps Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amtulifra commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 14, 2026 •

edited

Loading