feat(whatsapp): retry pattern with silent dedup, backoff, DLQ + UI replay by mt-alarcon · Pull Request #74 · evolution-foundation/evo-nexus

mt-alarcon · 2026-05-11T18:31:19Z

Problem

EvoNexus delivers webhook events via Cloudflare Tunnel. Under flaky connectivity, CF Tunnel occasionally re-delivers the same event, causing duplicate WhatsApp message executions — no dedup layer existed.

Additionally, transient HTTP 5xx and network errors during send_whatsapp / api_request caused immediate failures with no retry, leading to silent message loss.

Solution — 3 layers

Layer 1 — Idempotency key + silent dedup

Migration adds idempotency_key column (unique, nullable) to triggers table
Same idempotency_key → HTTP 200 + {"status":"duplicate","skipped":true} (silent)
Lookup checks both top-level payload and event_data.data
Backward-compatible: events without idempotency_key pass through unchanged

Layer 2 — Exponential backoff + jitter

ADWs/runner.py: _retry_http_call() — up to 3 attempts, backoff 2^n + jitter(0–0.5s), max 8s, never retries HTTP 4xx
evolution_go_client.py: _retry_http_call_client() — same policy
send_whatsapp_file(): uploads to Cloudflare R2, sends via /send/media
Tests: tests/whatsapp/test_retry_backoff.py (289 lines)

Layer 3 — DLQ + UI Replay + instrumentation

Dead Letter Queue: failed triggers classified by category (permanent / transient / unknown)
UI Replay in Triggers.tsx — replay button, status badges, error detail panel
Execution logs include attempts, final_status, category fields
Tests: dashboard/tests/test_wpp_retry_pr3.py (203 lines)

Production evidence

Active at MTA digital since 11/05/2026, validated via:

curl re-delivery of same event ×2 → 1 execution, 1 duplicate skip
CF Tunnel flap simulation → backoff fires, message delivered on attempt 2

Risk

Silent dedup: backward-compatible (null idempotency_key bypasses dedup) — low risk
Backoff: adds latency only on failure paths — low risk
DLQ + UI: additive, no changes to existing trigger execution path — low risk
0 breaking changes

Em produção na agência MTA digital desde 11/05/2026, validado via curl re-entrega 2× → 1 execution

Summary by Sourcery

Introduce a resilient WhatsApp delivery pipeline with idempotent trigger executions, retry-aware failure handling, and operator tooling for replay and monitoring.

New Features:

Add idempotency-key based deduplication for trigger webhooks to avoid duplicate executions.
Expose a replay endpoint and UI flow to re-run failed trigger executions with rate limiting and safety checks.
Provide operational trigger stats API and dashboard badge for DLQ size, replays, and WhatsApp volume.
Add helper to send WhatsApp media via Cloudflare R2 and Evolution Go, including media-type detection and env-based configuration.
Introduce Webshare proxy configuration support for the Evolution Go client.

Enhancements:

Classify trigger failures as transient or permanent and route transient ones to a retryable DLQ state.
Extend trigger execution model and auto-migrations with idempotency, error category, and replay metadata.
Add exponential backoff with jitter for WhatsApp sends and Evolution Go API requests, avoiding retries on HTTP 4xx.
Improve logging and instrumentation for trigger execution attempts, categories, and circuit-breaker watermark breaches.

Tests:

Add WhatsApp backoff tests covering HTTP/URLError retry behaviour and latency bounds.
Add dashboard tests for error classification, replay endpoint behaviour, stats shape, and watermark/rate-limit logic.

…ps 1+2) Step 1 — Migration (models.py + app.py): - TriggerExecution ganha 3 colunas nullable: idempotency_key, error_category, last_replay_at - to_dict() exposto com os 3 campos novos - Auto-migrate idempotente no startup: ALTER TABLE + IF NOT EXISTS em cada bloco - Partial unique index uq_trigger_idem (trigger_id, idempotency_key) WHERE NOT NULL - Basic index ix_trigger_executions_idem_key para lookups por key - SQLite 3.51 confirmado — partial index nativo; EXPLAIN QUERY PLAN confirma uso do índice Step 2 — Silent dedup (triggers.py): - webhook_receiver extrai idem_key de idempotency_key / messageId / data.messageId - Se key já existe: log idempotent_replay + 200 OK silencioso (pattern F6) - Race condition: IntegrityError no db.commit() → rollback + 200 OK silencioso - Legado (GitHub, Stripe, Linear): sem key → idem_key=None → fluxo normal inalterado - Limpeza: current_app movido para import no topo; imports inline removidos Testes passados: migration up/down idempotente, partial index unicidade, NULLs livres, extração de key (6 casos), race condition via IntegrityError, EXPLAIN QUERY PLAN. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t (PR-2) - Adiciona _retry_http_call em ADWs/runner.py: 3 tentativas, sleep min(2^attempt + jitter(0-0.5), 8s), retry só em HTTP 5xx/URLError/timeout, sem retry em 4xx, retorna (ok, attempts, error_category) - Refatora send_whatsapp para usar o helper; log estruturado com attempts + final_status + category em cada chamada - Adiciona _retry_http_call_client em evolution_go_client.py: mesma mecânica com logs JSON estruturados (evt=api_request_retry/failed) - Refatora api_request para usar o helper; remove sys.exit(1) em HTTPError/ URLError — agora raise para callers de lib; CLI main() captura e sys.exit(1) mantendo comportamento identico ao anterior - Worst-case latencia (3 tentativas, todas 5xx): ~4-7s de sleep total <= 8s - Testes sinteticos: 9/9 passando (5 runner + 4 evolution_go_client) cobrindo 500x3, 502x2+200, 400 sem retry, URLError, budget de sleep Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…teps 4+5+6) Step 4: _classify_error helper em triggers.py; _execute_trigger popula error_category (transient | permanent) e usa status failed_retryable para timeouts e HTTP 5xx, status failed para erros permanentes. Step 5: POST /api/triggers/executions/<id>/replay com rate-limit 60s, modal de confirmação no frontend (preview: destinatário/comando/timestamp), botão Replay condicional a status=failed_retryable, status replayed na execution original após replay iniciado. Step 6: GET /api/triggers/stats retorna 8 métricas (total, by_status, dlq_size, idempotent_replays, wpp_command_count, retries_observed, circuit_breaker_watermark_hit). Badge no /triggers UI com alerta amarelo quando watermark_hit=true (>50 WPP/dia). Log WARNING ao virar True. Logs estruturados evt= em webhook_receiver, _execute_trigger, replay. Testes sintéticos: 24/24 passam (unit: _classify_error 13 cenários, watermark 5 cenários, rate-limit 4 cenários; integração: auth guard + stats shape contra dashboard.db real). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bug: _parse_webhook_event move payload pro nível event_data['data'] (branch else default), mas o silent dedup só buscava 'idempotency_key' no nível de cima — nunca encontrava em payloads N8N reais. Sintoma: 2 POSTs idênticos do N8N criavam 2 TriggerExecution rows com idempotency_key=NULL, dedup totalmente no-op. Fix: adicionar fallback data.idempotency_key na cadeia de lookup (mantendo a ordem: top-level > data.* para casos de payloads diferentes). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

sourcery-ai · 2026-05-11T18:31:33Z

Reviewer's Guide

Implements a three-layer WhatsApp reliability pattern: database-backed idempotency for trigger executions, exponential backoff retries for WhatsApp/Evolution Go HTTP calls, and a DLQ with replay UI and operational stats, including tests for retry logic and pipeline instrumentation.

Sequence diagram for webhook idempotency and replay pipeline

sequenceDiagram
    actor CF_Tunnel
    participant webhook_receiver
    participant TriggerExecution_DB as TriggerExecution
    participant replay_execution
    participant _execute_trigger

    CF_Tunnel->>webhook_receiver: POST /api/triggers/<id>/webhook
    webhook_receiver->>webhook_receiver: extract idem_key
    alt idem_key exists and duplicate
        webhook_receiver->>TriggerExecution_DB: query.filter_by(trigger_id, idempotency_key)
        TriggerExecution_DB-->>webhook_receiver: existing execution
        webhook_receiver-->>CF_Tunnel: 200 {status: ok}
    else new or no idem_key
        webhook_receiver->>TriggerExecution_DB: INSERT TriggerExecution(idempotency_key)
        alt IntegrityError (unique index)
            webhook_receiver->>TriggerExecution_DB: rollback()
            webhook_receiver-->>CF_Tunnel: 200 {status: ok}
        else committed
            webhook_receiver-->>CF_Tunnel: 200 {status: ok}
            webhook_receiver->>_execute_trigger: _execute_trigger(trigger_id, execution_id, event_data)
        end
    end

    Note over _execute_trigger,TriggerExecution_DB: classify result via _classify_error
    _execute_trigger->>TriggerExecution_DB: update status=completed|failed_retryable|failed
    _execute_trigger-->>TriggerExecution_DB: set error_category

    actor Operator
    Operator->>replay_execution: POST /api/triggers/executions/<exec_id>/replay
    replay_execution->>TriggerExecution_DB: get(exec_id)
    alt status in failed_retryable|failed and not rate_limited
        replay_execution->>TriggerExecution_DB: INSERT new TriggerExecution(pending, same idempotency_key)
        replay_execution->>TriggerExecution_DB: update original status=replayed, last_replay_at
        alt IntegrityError (idempotency_key already completed)
            replay_execution->>TriggerExecution_DB: rollback()
            replay_execution-->>Operator: 200 {status: ok, note: idempotent_skip}
        else committed
            replay_execution-->>Operator: 200 {status: ok, new_execution_id}
            replay_execution->>_execute_trigger: _execute_trigger(trigger_id, new_execution_id, original_event_data)
        end
    else not_replayable or rate_limited
        replay_execution-->>Operator: 4xx error
    end

Sequence diagram for WhatsApp and Evolution Go HTTP retry backoff

sequenceDiagram
    participant send_whatsapp
    participant _retry_http_call
    participant Evolution_Go_API as Evolution_Go

    send_whatsapp->>_retry_http_call: _retry_http_call(_do_call)
    loop attempts <= max_attempts
        _retry_http_call->>Evolution_Go_API: _do_call() /send/text
        alt HTTP 2xx
            Evolution_Go_API-->>_retry_http_call: ok=True
            _retry_http_call-->>send_whatsapp: (ok=True, attempts, category=None)
            send_whatsapp-->>send_whatsapp: log attempts, final_status
        else HTTP 4xx
            Evolution_Go_API-->>_retry_http_call: raise HTTPError(code<500)
            _retry_http_call-->>send_whatsapp: ok=False, attempts, category=permanent
        else HTTP 5xx
            Evolution_Go_API-->>_retry_http_call: raise HTTPError(code>=500)
            alt attempt < max_attempts
                _retry_http_call-->>_retry_http_call: sleep backoff+jitter
            else last attempt
                _retry_http_call-->>send_whatsapp: ok=False, attempts, category=transient
            end
        else URLError or timeout
            Evolution_Go_API-->>_retry_http_call: raise URLError/timeout
            alt attempt < max_attempts
                _retry_http_call-->>_retry_http_call: sleep backoff+jitter
            else last attempt
                _retry_http_call-->>send_whatsapp: ok=False, attempts, category=transient
            end
        end
    end

    participant api_request
    participant _retry_http_call_client

    api_request->>_retry_http_call_client: _retry_http_call_client(_do_call)
    _retry_http_call_client->>Evolution_Go_API: _do_call() /instance/*
    alt success within retries
        Evolution_Go_API-->>_retry_http_call_client: JSON response
        _retry_http_call_client-->>api_request: result
    else persistent HTTP 5xx/URLError
        Evolution_Go_API-->>_retry_http_call_client: repeated errors
        _retry_http_call_client-->>api_request: raise last exception
    end

File-Level Changes

Change	Details	Files
Add idempotent trigger executions with DB-level uniqueness and silent deduplication of webhook replays.	Extend TriggerExecution model and auto-migration to include idempotency_key, error_category, and last_replay_at columns plus indices and partial unique constraint on (trigger_id, idempotency_key). Extract idempotency key from webhook payloads (various WhatsApp/N8N-compatible fields), check for existing executions, and short-circuit duplicate requests with HTTP 200. Handle race conditions on concurrent inserts via IntegrityError, rolling back and treating the second request as a deduplicated replay without executing again.	`dashboard/backend/app.py` `dashboard/backend/models.py` `dashboard/backend/routes/triggers.py`
Introduce error classification, DLQ semantics, replay endpoint, and trigger stats API for monitoring and replaying failed executions.	Add _classify_error helper and use it in _execute_trigger to distinguish transient vs permanent failures, setting status failed_retryable and error_category accordingly and logging structured events. Add /api/triggers/executions//replay endpoint with auth, status gating, 60s per-execution rate limiting, creation of new TriggerExecution rows, and silent success when idempotency uniqueness is violated. Add /api/triggers/stats endpoint computing by_status aggregates, DLQ size, idempotent replay count, WhatsApp command volume, retry observations, and watermark flag with logging when thresholds are exceeded.	`dashboard/backend/routes/triggers.py`
Add WhatsApp sending helpers and generalized HTTP retry with exponential backoff + jitter for WhatsApp notifications and Evolution Go client API calls.	Implement send_whatsapp_file to upload media to Cloudflare R2 via boto3, generate presigned URLs, infer media type, and send via Evolution Go /send/media with configuration from env. Implement _retry_http_call in runner.py to wrap HTTP calls with up to 3 attempts, exponential backoff plus jitter, and selective retry on HTTP 5xx and network errors while avoiding retries for HTTP 4xx. Update send_whatsapp to use _retry_http_call and emit structured logs with attempts and error category, and adjust Evolution Go client api_request to use a similar _retry_http_call_client with structured JSON logging and non-exiting error propagation; move HTTPError/URLError sys.exit handling into main().	`ADWs/runner.py` `.claude/skills/int-evolution-go/scripts/evolution_go_client.py`
Enhance Triggers UI with DLQ/retry visibility, replay UX, and stats badge.	Extend Execution type to surface error_category, idempotency_key, and last_replay_at fields from backend JSON, and add Stats/ReplayPreview types and state. Fetch /triggers/stats to display DLQ size, replay counts, WhatsApp command volume per day, and a circuit breaker watermark badge, refreshing when executions modal opens. Add per-execution replay button for failed_retryable statuses, a replay confirmation modal that previews target recipient and command parsed from event_data, and wire it to the replay endpoint with toast feedback and executions refresh; also show error_category label alongside status badge.	`dashboard/frontend/src/pages/Triggers.tsx`
Add automated tests covering retry backoff logic, error classification, replay and stats endpoints, and watermark/rate-limit helpers.	Add tests/whatsapp/test_retry_backoff.py to unit-test _retry_http_call in runner.py and _retry_http_call_client/api_request in evolution_go_client, including latency budget and retry behaviour for HTTP 4xx/5xx and URLError cases. Add dashboard/tests/test_wpp_retry_pr3.py to test _classify_error markers, validate stats endpoint JSON shape (when DB present), and unit-test watermark and 60s replay rate-limit logic; include guarded integration tests for the replay endpoint. Ensure tests are import-safe given Python version differences by selectively loading helper code and mocking environment/config as needed.	`tests/whatsapp/test_retry_backoff.py` `dashboard/tests/test_wpp_retry_pr3.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The new trigger_stats endpoint builds SQL via string interpolation (e.g. since_clause and the IN list for wpp_ids); even though current inputs are constrained, it would be safer and more maintainable to use SQLAlchemy parameters / the ORM for these aggregates instead of composing raw SQL strings.
There are now two separate retry implementations (_retry_http_call in runner.py and _retry_http_call_client in evolution_go_client.py) that share very similar logic; consider extracting a shared helper or at least consolidating behavior/logging to avoid divergence over time.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new `trigger_stats` endpoint builds SQL via string interpolation (e.g. `since_clause` and the `IN` list for `wpp_ids`); even though current inputs are constrained, it would be safer and more maintainable to use SQLAlchemy parameters / the ORM for these aggregates instead of composing raw SQL strings.
- There are now two separate retry implementations (`_retry_http_call` in `runner.py` and `_retry_http_call_client` in `evolution_go_client.py`) that share very similar logic; consider extracting a shared helper or at least consolidating behavior/logging to avoid divergence over time.

## Individual Comments

### Comment 1
<location path="dashboard/tests/test_wpp_retry_pr3.py" line_range="87-88" />
<code_context>
+    _DB_EXISTS = False
+
+
+@pytest.mark.skipif(not _DB_EXISTS, reason="dashboard.db not found — integration tests skipped")
+class TestReplayEndpoint:
+    """Integration tests for POST /api/triggers/executions/<id>/replay."""
+
</code_context>
<issue_to_address>
**issue (testing):** Integration tests being skipped when dashboard.db is absent may hide regressions in CI; consider using a dedicated test database instead

These integration tests currently depend on an existing on-disk dashboard.db, so in a clean CI environment (or if the DB path changes), the whole class will silently skip and replay/stats won’t be covered.

Instead, consider wiring the app to a dedicated test database (e.g., in-memory/temporary SQLite) via test config, creating the minimal schema and a seed Trigger in a fixture, and removing the dashboard.db existence check (or replacing it with an explicit opt-out like an env var for local runs). This keeps the tests consistently exercising replay and stats in CI.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-05-11T18:35:10Z

+@pytest.mark.skipif(not _DB_EXISTS, reason="dashboard.db not found — integration tests skipped")
+class TestReplayEndpoint:


issue (testing): Integration tests being skipped when dashboard.db is absent may hide regressions in CI; consider using a dedicated test database instead

These integration tests currently depend on an existing on-disk dashboard.db, so in a clean CI environment (or if the DB path changes), the whole class will silently skip and replay/stats won’t be covered.

Instead, consider wiring the app to a dedicated test database (e.g., in-memory/temporary SQLite) via test config, creating the minimal schema and a seed Trigger in a fixture, and removing the dashboard.db existence check (or replacing it with an explicit opt-out like an env var for local runs). This keeps the tests consistently exercising replay and stats in CI.

Marcello Alarcon and others added 4 commits May 11, 2026 15:28

sourcery-ai Bot reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(whatsapp): retry pattern with silent dedup, backoff, DLQ + UI replay#74

feat(whatsapp): retry pattern with silent dedup, backoff, DLQ + UI replay#74
mt-alarcon wants to merge 4 commits into
evolution-foundation:mainfrom
mt-alarcon:pr/whatsapp-retry-pattern

mt-alarcon commented May 11, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented May 11, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@pytest.mark.skipif(not _DB_EXISTS, reason="dashboard.db not found — integration tests skipped")
		class TestReplayEndpoint:

Conversation

mt-alarcon commented May 11, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution — 3 layers

Layer 1 — Idempotency key + silent dedup

Layer 2 — Exponential backoff + jitter

Layer 3 — DLQ + UI Replay + instrumentation

Production evidence

Risk

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for webhook idempotency and replay pipeline

Sequence diagram for WhatsApp and Evolution Go HTTP retry backoff

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mt-alarcon commented May 11, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented May 11, 2026 •

edited

Loading