feat(whatsapp): retry pattern with silent dedup, backoff, DLQ + UI replay#74
feat(whatsapp): retry pattern with silent dedup, backoff, DLQ + UI replay#74mt-alarcon wants to merge 4 commits into
Conversation
…ps 1+2) Step 1 — Migration (models.py + app.py): - TriggerExecution ganha 3 colunas nullable: idempotency_key, error_category, last_replay_at - to_dict() exposto com os 3 campos novos - Auto-migrate idempotente no startup: ALTER TABLE + IF NOT EXISTS em cada bloco - Partial unique index uq_trigger_idem (trigger_id, idempotency_key) WHERE NOT NULL - Basic index ix_trigger_executions_idem_key para lookups por key - SQLite 3.51 confirmado — partial index nativo; EXPLAIN QUERY PLAN confirma uso do índice Step 2 — Silent dedup (triggers.py): - webhook_receiver extrai idem_key de idempotency_key / messageId / data.messageId - Se key já existe: log idempotent_replay + 200 OK silencioso (pattern F6) - Race condition: IntegrityError no db.commit() → rollback + 200 OK silencioso - Legado (GitHub, Stripe, Linear): sem key → idem_key=None → fluxo normal inalterado - Limpeza: current_app movido para import no topo; imports inline removidos Testes passados: migration up/down idempotente, partial index unicidade, NULLs livres, extração de key (6 casos), race condition via IntegrityError, EXPLAIN QUERY PLAN. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t (PR-2) - Adiciona _retry_http_call em ADWs/runner.py: 3 tentativas, sleep min(2^attempt + jitter(0-0.5), 8s), retry só em HTTP 5xx/URLError/timeout, sem retry em 4xx, retorna (ok, attempts, error_category) - Refatora send_whatsapp para usar o helper; log estruturado com attempts + final_status + category em cada chamada - Adiciona _retry_http_call_client em evolution_go_client.py: mesma mecânica com logs JSON estruturados (evt=api_request_retry/failed) - Refatora api_request para usar o helper; remove sys.exit(1) em HTTPError/ URLError — agora raise para callers de lib; CLI main() captura e sys.exit(1) mantendo comportamento identico ao anterior - Worst-case latencia (3 tentativas, todas 5xx): ~4-7s de sleep total <= 8s - Testes sinteticos: 9/9 passando (5 runner + 4 evolution_go_client) cobrindo 500x3, 502x2+200, 400 sem retry, URLError, budget de sleep Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…teps 4+5+6) Step 4: _classify_error helper em triggers.py; _execute_trigger popula error_category (transient | permanent) e usa status failed_retryable para timeouts e HTTP 5xx, status failed para erros permanentes. Step 5: POST /api/triggers/executions/<id>/replay com rate-limit 60s, modal de confirmação no frontend (preview: destinatário/comando/timestamp), botão Replay condicional a status=failed_retryable, status replayed na execution original após replay iniciado. Step 6: GET /api/triggers/stats retorna 8 métricas (total, by_status, dlq_size, idempotent_replays, wpp_command_count, retries_observed, circuit_breaker_watermark_hit). Badge no /triggers UI com alerta amarelo quando watermark_hit=true (>50 WPP/dia). Log WARNING ao virar True. Logs estruturados evt= em webhook_receiver, _execute_trigger, replay. Testes sintéticos: 24/24 passam (unit: _classify_error 13 cenários, watermark 5 cenários, rate-limit 4 cenários; integração: auth guard + stats shape contra dashboard.db real). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug: _parse_webhook_event move payload pro nível event_data['data'] (branch else default), mas o silent dedup só buscava 'idempotency_key' no nível de cima — nunca encontrava em payloads N8N reais. Sintoma: 2 POSTs idênticos do N8N criavam 2 TriggerExecution rows com idempotency_key=NULL, dedup totalmente no-op. Fix: adicionar fallback data.idempotency_key na cadeia de lookup (mantendo a ordem: top-level > data.* para casos de payloads diferentes). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer's GuideImplements a three-layer WhatsApp reliability pattern: database-backed idempotency for trigger executions, exponential backoff retries for WhatsApp/Evolution Go HTTP calls, and a DLQ with replay UI and operational stats, including tests for retry logic and pipeline instrumentation. Sequence diagram for webhook idempotency and replay pipelinesequenceDiagram
actor CF_Tunnel
participant webhook_receiver
participant TriggerExecution_DB as TriggerExecution
participant replay_execution
participant _execute_trigger
CF_Tunnel->>webhook_receiver: POST /api/triggers/<id>/webhook
webhook_receiver->>webhook_receiver: extract idem_key
alt idem_key exists and duplicate
webhook_receiver->>TriggerExecution_DB: query.filter_by(trigger_id, idempotency_key)
TriggerExecution_DB-->>webhook_receiver: existing execution
webhook_receiver-->>CF_Tunnel: 200 {status: ok}
else new or no idem_key
webhook_receiver->>TriggerExecution_DB: INSERT TriggerExecution(idempotency_key)
alt IntegrityError (unique index)
webhook_receiver->>TriggerExecution_DB: rollback()
webhook_receiver-->>CF_Tunnel: 200 {status: ok}
else committed
webhook_receiver-->>CF_Tunnel: 200 {status: ok}
webhook_receiver->>_execute_trigger: _execute_trigger(trigger_id, execution_id, event_data)
end
end
Note over _execute_trigger,TriggerExecution_DB: classify result via _classify_error
_execute_trigger->>TriggerExecution_DB: update status=completed|failed_retryable|failed
_execute_trigger-->>TriggerExecution_DB: set error_category
actor Operator
Operator->>replay_execution: POST /api/triggers/executions/<exec_id>/replay
replay_execution->>TriggerExecution_DB: get(exec_id)
alt status in failed_retryable|failed and not rate_limited
replay_execution->>TriggerExecution_DB: INSERT new TriggerExecution(pending, same idempotency_key)
replay_execution->>TriggerExecution_DB: update original status=replayed, last_replay_at
alt IntegrityError (idempotency_key already completed)
replay_execution->>TriggerExecution_DB: rollback()
replay_execution-->>Operator: 200 {status: ok, note: idempotent_skip}
else committed
replay_execution-->>Operator: 200 {status: ok, new_execution_id}
replay_execution->>_execute_trigger: _execute_trigger(trigger_id, new_execution_id, original_event_data)
end
else not_replayable or rate_limited
replay_execution-->>Operator: 4xx error
end
Sequence diagram for WhatsApp and Evolution Go HTTP retry backoffsequenceDiagram
participant send_whatsapp
participant _retry_http_call
participant Evolution_Go_API as Evolution_Go
send_whatsapp->>_retry_http_call: _retry_http_call(_do_call)
loop attempts <= max_attempts
_retry_http_call->>Evolution_Go_API: _do_call() /send/text
alt HTTP 2xx
Evolution_Go_API-->>_retry_http_call: ok=True
_retry_http_call-->>send_whatsapp: (ok=True, attempts, category=None)
send_whatsapp-->>send_whatsapp: log attempts, final_status
else HTTP 4xx
Evolution_Go_API-->>_retry_http_call: raise HTTPError(code<500)
_retry_http_call-->>send_whatsapp: ok=False, attempts, category=permanent
else HTTP 5xx
Evolution_Go_API-->>_retry_http_call: raise HTTPError(code>=500)
alt attempt < max_attempts
_retry_http_call-->>_retry_http_call: sleep backoff+jitter
else last attempt
_retry_http_call-->>send_whatsapp: ok=False, attempts, category=transient
end
else URLError or timeout
Evolution_Go_API-->>_retry_http_call: raise URLError/timeout
alt attempt < max_attempts
_retry_http_call-->>_retry_http_call: sleep backoff+jitter
else last attempt
_retry_http_call-->>send_whatsapp: ok=False, attempts, category=transient
end
end
end
participant api_request
participant _retry_http_call_client
api_request->>_retry_http_call_client: _retry_http_call_client(_do_call)
_retry_http_call_client->>Evolution_Go_API: _do_call() /instance/*
alt success within retries
Evolution_Go_API-->>_retry_http_call_client: JSON response
_retry_http_call_client-->>api_request: result
else persistent HTTP 5xx/URLError
Evolution_Go_API-->>_retry_http_call_client: repeated errors
_retry_http_call_client-->>api_request: raise last exception
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The new
trigger_statsendpoint builds SQL via string interpolation (e.g.since_clauseand theINlist forwpp_ids); even though current inputs are constrained, it would be safer and more maintainable to use SQLAlchemy parameters / the ORM for these aggregates instead of composing raw SQL strings. - There are now two separate retry implementations (
_retry_http_callinrunner.pyand_retry_http_call_clientinevolution_go_client.py) that share very similar logic; consider extracting a shared helper or at least consolidating behavior/logging to avoid divergence over time.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `trigger_stats` endpoint builds SQL via string interpolation (e.g. `since_clause` and the `IN` list for `wpp_ids`); even though current inputs are constrained, it would be safer and more maintainable to use SQLAlchemy parameters / the ORM for these aggregates instead of composing raw SQL strings.
- There are now two separate retry implementations (`_retry_http_call` in `runner.py` and `_retry_http_call_client` in `evolution_go_client.py`) that share very similar logic; consider extracting a shared helper or at least consolidating behavior/logging to avoid divergence over time.
## Individual Comments
### Comment 1
<location path="dashboard/tests/test_wpp_retry_pr3.py" line_range="87-88" />
<code_context>
+ _DB_EXISTS = False
+
+
+@pytest.mark.skipif(not _DB_EXISTS, reason="dashboard.db not found — integration tests skipped")
+class TestReplayEndpoint:
+ """Integration tests for POST /api/triggers/executions/<id>/replay."""
+
</code_context>
<issue_to_address>
**issue (testing):** Integration tests being skipped when dashboard.db is absent may hide regressions in CI; consider using a dedicated test database instead
These integration tests currently depend on an existing on-disk dashboard.db, so in a clean CI environment (or if the DB path changes), the whole class will silently skip and replay/stats won’t be covered.
Instead, consider wiring the app to a dedicated test database (e.g., in-memory/temporary SQLite) via test config, creating the minimal schema and a seed Trigger in a fixture, and removing the dashboard.db existence check (or replacing it with an explicit opt-out like an env var for local runs). This keeps the tests consistently exercising replay and stats in CI.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| @pytest.mark.skipif(not _DB_EXISTS, reason="dashboard.db not found — integration tests skipped") | ||
| class TestReplayEndpoint: |
There was a problem hiding this comment.
issue (testing): Integration tests being skipped when dashboard.db is absent may hide regressions in CI; consider using a dedicated test database instead
These integration tests currently depend on an existing on-disk dashboard.db, so in a clean CI environment (or if the DB path changes), the whole class will silently skip and replay/stats won’t be covered.
Instead, consider wiring the app to a dedicated test database (e.g., in-memory/temporary SQLite) via test config, creating the minimal schema and a seed Trigger in a fixture, and removing the dashboard.db existence check (or replacing it with an explicit opt-out like an env var for local runs). This keeps the tests consistently exercising replay and stats in CI.
Problem
EvoNexus delivers webhook events via Cloudflare Tunnel. Under flaky connectivity, CF Tunnel occasionally re-delivers the same event, causing duplicate WhatsApp message executions — no dedup layer existed.
Additionally, transient HTTP 5xx and network errors during
send_whatsapp/api_requestcaused immediate failures with no retry, leading to silent message loss.Solution — 3 layers
Layer 1 — Idempotency key + silent dedup
idempotency_keycolumn (unique, nullable) totriggerstableidempotency_key→ HTTP 200 +{"status":"duplicate","skipped":true}(silent)event_data.dataidempotency_keypass through unchangedLayer 2 — Exponential backoff + jitter
ADWs/runner.py:_retry_http_call()— up to 3 attempts, backoff2^n + jitter(0–0.5s), max 8s, never retries HTTP 4xxevolution_go_client.py:_retry_http_call_client()— same policysend_whatsapp_file(): uploads to Cloudflare R2, sends via/send/mediatests/whatsapp/test_retry_backoff.py(289 lines)Layer 3 — DLQ + UI Replay + instrumentation
permanent/transient/unknown)Triggers.tsx— replay button, status badges, error detail panelattempts,final_status,categoryfieldsdashboard/tests/test_wpp_retry_pr3.py(203 lines)Production evidence
Active at MTA digital since 11/05/2026, validated via:
duplicateskipRisk
Summary by Sourcery
Introduce a resilient WhatsApp delivery pipeline with idempotent trigger executions, retry-aware failure handling, and operator tooling for replay and monitoring.
New Features:
Enhancements:
Tests: