Skip to content

feat(whatsapp): retry pattern with silent dedup, backoff, DLQ + UI replay#74

Open
mt-alarcon wants to merge 4 commits into
evolution-foundation:mainfrom
mt-alarcon:pr/whatsapp-retry-pattern
Open

feat(whatsapp): retry pattern with silent dedup, backoff, DLQ + UI replay#74
mt-alarcon wants to merge 4 commits into
evolution-foundation:mainfrom
mt-alarcon:pr/whatsapp-retry-pattern

Conversation

@mt-alarcon
Copy link
Copy Markdown

@mt-alarcon mt-alarcon commented May 11, 2026

Problem

EvoNexus delivers webhook events via Cloudflare Tunnel. Under flaky connectivity, CF Tunnel occasionally re-delivers the same event, causing duplicate WhatsApp message executions — no dedup layer existed.

Additionally, transient HTTP 5xx and network errors during send_whatsapp / api_request caused immediate failures with no retry, leading to silent message loss.

Solution — 3 layers

Layer 1 — Idempotency key + silent dedup

  • Migration adds idempotency_key column (unique, nullable) to triggers table
  • Same idempotency_key → HTTP 200 + {"status":"duplicate","skipped":true} (silent)
  • Lookup checks both top-level payload and event_data.data
  • Backward-compatible: events without idempotency_key pass through unchanged

Layer 2 — Exponential backoff + jitter

  • ADWs/runner.py: _retry_http_call() — up to 3 attempts, backoff 2^n + jitter(0–0.5s), max 8s, never retries HTTP 4xx
  • evolution_go_client.py: _retry_http_call_client() — same policy
  • send_whatsapp_file(): uploads to Cloudflare R2, sends via /send/media
  • Tests: tests/whatsapp/test_retry_backoff.py (289 lines)

Layer 3 — DLQ + UI Replay + instrumentation

  • Dead Letter Queue: failed triggers classified by category (permanent / transient / unknown)
  • UI Replay in Triggers.tsx — replay button, status badges, error detail panel
  • Execution logs include attempts, final_status, category fields
  • Tests: dashboard/tests/test_wpp_retry_pr3.py (203 lines)

Production evidence

Active at MTA digital since 11/05/2026, validated via:

  • curl re-delivery of same event ×2 → 1 execution, 1 duplicate skip
  • CF Tunnel flap simulation → backoff fires, message delivered on attempt 2

Risk

  • Silent dedup: backward-compatible (null idempotency_key bypasses dedup) — low risk
  • Backoff: adds latency only on failure paths — low risk
  • DLQ + UI: additive, no changes to existing trigger execution path — low risk
  • 0 breaking changes

Em produção na agência MTA digital desde 11/05/2026, validado via curl re-entrega 2× → 1 execution

Summary by Sourcery

Introduce a resilient WhatsApp delivery pipeline with idempotent trigger executions, retry-aware failure handling, and operator tooling for replay and monitoring.

New Features:

  • Add idempotency-key based deduplication for trigger webhooks to avoid duplicate executions.
  • Expose a replay endpoint and UI flow to re-run failed trigger executions with rate limiting and safety checks.
  • Provide operational trigger stats API and dashboard badge for DLQ size, replays, and WhatsApp volume.
  • Add helper to send WhatsApp media via Cloudflare R2 and Evolution Go, including media-type detection and env-based configuration.
  • Introduce Webshare proxy configuration support for the Evolution Go client.

Enhancements:

  • Classify trigger failures as transient or permanent and route transient ones to a retryable DLQ state.
  • Extend trigger execution model and auto-migrations with idempotency, error category, and replay metadata.
  • Add exponential backoff with jitter for WhatsApp sends and Evolution Go API requests, avoiding retries on HTTP 4xx.
  • Improve logging and instrumentation for trigger execution attempts, categories, and circuit-breaker watermark breaches.

Tests:

  • Add WhatsApp backoff tests covering HTTP/URLError retry behaviour and latency bounds.
  • Add dashboard tests for error classification, replay endpoint behaviour, stats shape, and watermark/rate-limit logic.

Marcello Alarcon and others added 4 commits May 11, 2026 15:28
…ps 1+2)

Step 1 — Migration (models.py + app.py):
- TriggerExecution ganha 3 colunas nullable: idempotency_key, error_category, last_replay_at
- to_dict() exposto com os 3 campos novos
- Auto-migrate idempotente no startup: ALTER TABLE + IF NOT EXISTS em cada bloco
- Partial unique index uq_trigger_idem (trigger_id, idempotency_key) WHERE NOT NULL
- Basic index ix_trigger_executions_idem_key para lookups por key
- SQLite 3.51 confirmado — partial index nativo; EXPLAIN QUERY PLAN confirma uso do índice

Step 2 — Silent dedup (triggers.py):
- webhook_receiver extrai idem_key de idempotency_key / messageId / data.messageId
- Se key já existe: log idempotent_replay + 200 OK silencioso (pattern F6)
- Race condition: IntegrityError no db.commit() → rollback + 200 OK silencioso
- Legado (GitHub, Stripe, Linear): sem key → idem_key=None → fluxo normal inalterado
- Limpeza: current_app movido para import no topo; imports inline removidos

Testes passados: migration up/down idempotente, partial index unicidade, NULLs livres,
extração de key (6 casos), race condition via IntegrityError, EXPLAIN QUERY PLAN.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t (PR-2)

- Adiciona _retry_http_call em ADWs/runner.py: 3 tentativas, sleep
  min(2^attempt + jitter(0-0.5), 8s), retry só em HTTP 5xx/URLError/timeout,
  sem retry em 4xx, retorna (ok, attempts, error_category)
- Refatora send_whatsapp para usar o helper; log estruturado com attempts +
  final_status + category em cada chamada
- Adiciona _retry_http_call_client em evolution_go_client.py: mesma mecânica
  com logs JSON estruturados (evt=api_request_retry/failed)
- Refatora api_request para usar o helper; remove sys.exit(1) em HTTPError/
  URLError — agora raise para callers de lib; CLI main() captura e sys.exit(1)
  mantendo comportamento identico ao anterior
- Worst-case latencia (3 tentativas, todas 5xx): ~4-7s de sleep total <= 8s
- Testes sinteticos: 9/9 passando (5 runner + 4 evolution_go_client)
  cobrindo 500x3, 502x2+200, 400 sem retry, URLError, budget de sleep

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…teps 4+5+6)

Step 4: _classify_error helper em triggers.py; _execute_trigger popula
error_category (transient | permanent) e usa status failed_retryable
para timeouts e HTTP 5xx, status failed para erros permanentes.

Step 5: POST /api/triggers/executions/<id>/replay com rate-limit 60s,
modal de confirmação no frontend (preview: destinatário/comando/timestamp),
botão Replay condicional a status=failed_retryable, status replayed na
execution original após replay iniciado.

Step 6: GET /api/triggers/stats retorna 8 métricas (total, by_status,
dlq_size, idempotent_replays, wpp_command_count, retries_observed,
circuit_breaker_watermark_hit). Badge no /triggers UI com alerta amarelo
quando watermark_hit=true (>50 WPP/dia). Log WARNING ao virar True.
Logs estruturados evt= em webhook_receiver, _execute_trigger, replay.

Testes sintéticos: 24/24 passam (unit: _classify_error 13 cenários,
watermark 5 cenários, rate-limit 4 cenários; integração: auth guard +
stats shape contra dashboard.db real).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug: _parse_webhook_event move payload pro nível event_data['data'] (branch
else default), mas o silent dedup só buscava 'idempotency_key' no nível
de cima — nunca encontrava em payloads N8N reais.

Sintoma: 2 POSTs idênticos do N8N criavam 2 TriggerExecution rows com
idempotency_key=NULL, dedup totalmente no-op.

Fix: adicionar fallback data.idempotency_key na cadeia de lookup
(mantendo a ordem: top-level > data.* para casos de payloads diferentes).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 11, 2026

Reviewer's Guide

Implements a three-layer WhatsApp reliability pattern: database-backed idempotency for trigger executions, exponential backoff retries for WhatsApp/Evolution Go HTTP calls, and a DLQ with replay UI and operational stats, including tests for retry logic and pipeline instrumentation.

Sequence diagram for webhook idempotency and replay pipeline

sequenceDiagram
    actor CF_Tunnel
    participant webhook_receiver
    participant TriggerExecution_DB as TriggerExecution
    participant replay_execution
    participant _execute_trigger

    CF_Tunnel->>webhook_receiver: POST /api/triggers/<id>/webhook
    webhook_receiver->>webhook_receiver: extract idem_key
    alt idem_key exists and duplicate
        webhook_receiver->>TriggerExecution_DB: query.filter_by(trigger_id, idempotency_key)
        TriggerExecution_DB-->>webhook_receiver: existing execution
        webhook_receiver-->>CF_Tunnel: 200 {status: ok}
    else new or no idem_key
        webhook_receiver->>TriggerExecution_DB: INSERT TriggerExecution(idempotency_key)
        alt IntegrityError (unique index)
            webhook_receiver->>TriggerExecution_DB: rollback()
            webhook_receiver-->>CF_Tunnel: 200 {status: ok}
        else committed
            webhook_receiver-->>CF_Tunnel: 200 {status: ok}
            webhook_receiver->>_execute_trigger: _execute_trigger(trigger_id, execution_id, event_data)
        end
    end

    Note over _execute_trigger,TriggerExecution_DB: classify result via _classify_error
    _execute_trigger->>TriggerExecution_DB: update status=completed|failed_retryable|failed
    _execute_trigger-->>TriggerExecution_DB: set error_category

    actor Operator
    Operator->>replay_execution: POST /api/triggers/executions/<exec_id>/replay
    replay_execution->>TriggerExecution_DB: get(exec_id)
    alt status in failed_retryable|failed and not rate_limited
        replay_execution->>TriggerExecution_DB: INSERT new TriggerExecution(pending, same idempotency_key)
        replay_execution->>TriggerExecution_DB: update original status=replayed, last_replay_at
        alt IntegrityError (idempotency_key already completed)
            replay_execution->>TriggerExecution_DB: rollback()
            replay_execution-->>Operator: 200 {status: ok, note: idempotent_skip}
        else committed
            replay_execution-->>Operator: 200 {status: ok, new_execution_id}
            replay_execution->>_execute_trigger: _execute_trigger(trigger_id, new_execution_id, original_event_data)
        end
    else not_replayable or rate_limited
        replay_execution-->>Operator: 4xx error
    end
Loading

Sequence diagram for WhatsApp and Evolution Go HTTP retry backoff

sequenceDiagram
    participant send_whatsapp
    participant _retry_http_call
    participant Evolution_Go_API as Evolution_Go

    send_whatsapp->>_retry_http_call: _retry_http_call(_do_call)
    loop attempts <= max_attempts
        _retry_http_call->>Evolution_Go_API: _do_call() /send/text
        alt HTTP 2xx
            Evolution_Go_API-->>_retry_http_call: ok=True
            _retry_http_call-->>send_whatsapp: (ok=True, attempts, category=None)
            send_whatsapp-->>send_whatsapp: log attempts, final_status
        else HTTP 4xx
            Evolution_Go_API-->>_retry_http_call: raise HTTPError(code<500)
            _retry_http_call-->>send_whatsapp: ok=False, attempts, category=permanent
        else HTTP 5xx
            Evolution_Go_API-->>_retry_http_call: raise HTTPError(code>=500)
            alt attempt < max_attempts
                _retry_http_call-->>_retry_http_call: sleep backoff+jitter
            else last attempt
                _retry_http_call-->>send_whatsapp: ok=False, attempts, category=transient
            end
        else URLError or timeout
            Evolution_Go_API-->>_retry_http_call: raise URLError/timeout
            alt attempt < max_attempts
                _retry_http_call-->>_retry_http_call: sleep backoff+jitter
            else last attempt
                _retry_http_call-->>send_whatsapp: ok=False, attempts, category=transient
            end
        end
    end

    participant api_request
    participant _retry_http_call_client

    api_request->>_retry_http_call_client: _retry_http_call_client(_do_call)
    _retry_http_call_client->>Evolution_Go_API: _do_call() /instance/*
    alt success within retries
        Evolution_Go_API-->>_retry_http_call_client: JSON response
        _retry_http_call_client-->>api_request: result
    else persistent HTTP 5xx/URLError
        Evolution_Go_API-->>_retry_http_call_client: repeated errors
        _retry_http_call_client-->>api_request: raise last exception
    end
Loading

File-Level Changes

Change Details Files
Add idempotent trigger executions with DB-level uniqueness and silent deduplication of webhook replays.
  • Extend TriggerExecution model and auto-migration to include idempotency_key, error_category, and last_replay_at columns plus indices and partial unique constraint on (trigger_id, idempotency_key).
  • Extract idempotency key from webhook payloads (various WhatsApp/N8N-compatible fields), check for existing executions, and short-circuit duplicate requests with HTTP 200.
  • Handle race conditions on concurrent inserts via IntegrityError, rolling back and treating the second request as a deduplicated replay without executing again.
dashboard/backend/app.py
dashboard/backend/models.py
dashboard/backend/routes/triggers.py
Introduce error classification, DLQ semantics, replay endpoint, and trigger stats API for monitoring and replaying failed executions.
  • Add _classify_error helper and use it in _execute_trigger to distinguish transient vs permanent failures, setting status failed_retryable and error_category accordingly and logging structured events.
  • Add /api/triggers/executions//replay endpoint with auth, status gating, 60s per-execution rate limiting, creation of new TriggerExecution rows, and silent success when idempotency uniqueness is violated.
  • Add /api/triggers/stats endpoint computing by_status aggregates, DLQ size, idempotent replay count, WhatsApp command volume, retry observations, and watermark flag with logging when thresholds are exceeded.
dashboard/backend/routes/triggers.py
Add WhatsApp sending helpers and generalized HTTP retry with exponential backoff + jitter for WhatsApp notifications and Evolution Go client API calls.
  • Implement send_whatsapp_file to upload media to Cloudflare R2 via boto3, generate presigned URLs, infer media type, and send via Evolution Go /send/media with configuration from env.
  • Implement _retry_http_call in runner.py to wrap HTTP calls with up to 3 attempts, exponential backoff plus jitter, and selective retry on HTTP 5xx and network errors while avoiding retries for HTTP 4xx.
  • Update send_whatsapp to use _retry_http_call and emit structured logs with attempts and error category, and adjust Evolution Go client api_request to use a similar _retry_http_call_client with structured JSON logging and non-exiting error propagation; move HTTPError/URLError sys.exit handling into main().
ADWs/runner.py
.claude/skills/int-evolution-go/scripts/evolution_go_client.py
Enhance Triggers UI with DLQ/retry visibility, replay UX, and stats badge.
  • Extend Execution type to surface error_category, idempotency_key, and last_replay_at fields from backend JSON, and add Stats/ReplayPreview types and state.
  • Fetch /triggers/stats to display DLQ size, replay counts, WhatsApp command volume per day, and a circuit breaker watermark badge, refreshing when executions modal opens.
  • Add per-execution replay button for failed_retryable statuses, a replay confirmation modal that previews target recipient and command parsed from event_data, and wire it to the replay endpoint with toast feedback and executions refresh; also show error_category label alongside status badge.
dashboard/frontend/src/pages/Triggers.tsx
Add automated tests covering retry backoff logic, error classification, replay and stats endpoints, and watermark/rate-limit helpers.
  • Add tests/whatsapp/test_retry_backoff.py to unit-test _retry_http_call in runner.py and _retry_http_call_client/api_request in evolution_go_client, including latency budget and retry behaviour for HTTP 4xx/5xx and URLError cases.
  • Add dashboard/tests/test_wpp_retry_pr3.py to test _classify_error markers, validate stats endpoint JSON shape (when DB present), and unit-test watermark and 60s replay rate-limit logic; include guarded integration tests for the replay endpoint.
  • Ensure tests are import-safe given Python version differences by selectively loading helper code and mocking environment/config as needed.
tests/whatsapp/test_retry_backoff.py
dashboard/tests/test_wpp_retry_pr3.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The new trigger_stats endpoint builds SQL via string interpolation (e.g. since_clause and the IN list for wpp_ids); even though current inputs are constrained, it would be safer and more maintainable to use SQLAlchemy parameters / the ORM for these aggregates instead of composing raw SQL strings.
  • There are now two separate retry implementations (_retry_http_call in runner.py and _retry_http_call_client in evolution_go_client.py) that share very similar logic; consider extracting a shared helper or at least consolidating behavior/logging to avoid divergence over time.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `trigger_stats` endpoint builds SQL via string interpolation (e.g. `since_clause` and the `IN` list for `wpp_ids`); even though current inputs are constrained, it would be safer and more maintainable to use SQLAlchemy parameters / the ORM for these aggregates instead of composing raw SQL strings.
- There are now two separate retry implementations (`_retry_http_call` in `runner.py` and `_retry_http_call_client` in `evolution_go_client.py`) that share very similar logic; consider extracting a shared helper or at least consolidating behavior/logging to avoid divergence over time.

## Individual Comments

### Comment 1
<location path="dashboard/tests/test_wpp_retry_pr3.py" line_range="87-88" />
<code_context>
+    _DB_EXISTS = False
+
+
+@pytest.mark.skipif(not _DB_EXISTS, reason="dashboard.db not found — integration tests skipped")
+class TestReplayEndpoint:
+    """Integration tests for POST /api/triggers/executions/<id>/replay."""
+
</code_context>
<issue_to_address>
**issue (testing):** Integration tests being skipped when dashboard.db is absent may hide regressions in CI; consider using a dedicated test database instead

These integration tests currently depend on an existing on-disk dashboard.db, so in a clean CI environment (or if the DB path changes), the whole class will silently skip and replay/stats won’t be covered.

Instead, consider wiring the app to a dedicated test database (e.g., in-memory/temporary SQLite) via test config, creating the minimal schema and a seed Trigger in a fixture, and removing the dashboard.db existence check (or replacing it with an explicit opt-out like an env var for local runs). This keeps the tests consistently exercising replay and stats in CI.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +87 to +88
@pytest.mark.skipif(not _DB_EXISTS, reason="dashboard.db not found — integration tests skipped")
class TestReplayEndpoint:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (testing): Integration tests being skipped when dashboard.db is absent may hide regressions in CI; consider using a dedicated test database instead

These integration tests currently depend on an existing on-disk dashboard.db, so in a clean CI environment (or if the DB path changes), the whole class will silently skip and replay/stats won’t be covered.

Instead, consider wiring the app to a dedicated test database (e.g., in-memory/temporary SQLite) via test config, creating the minimal schema and a seed Trigger in a fixture, and removing the dashboard.db existence check (or replacing it with an explicit opt-out like an env var for local runs). This keeps the tests consistently exercising replay and stats in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant