Long-term Planning for Webhook Captures #3923
Replies: 2 comments 4 replies
-
|
Thanks for digging in and evaluating the challenges & potential options for building more complex push-based captures, Nicolas! We've talked about these details in our 1:1s, but to also mention it here, I'm in favor of option D (expanding the Python |
Beta Was this translation helpful? Give feedback.
-
|
There's mention of minimizing connector downtime being an objective due to the failed delivery semantics of different sources - I think that's going to be difficult to make reliable. Connectors also restart during deployments or shard rebalancing, and various other reasons. If we're requiring very high uptime to be nearly 100% available to receive webhooks for correctness reasons, I don't think we're going to be able to do a very good job with a system like that. Some amount of downtime is inevitable. To me this suggests that having a separate connector for the webhook "push" side vs. the traditional API pull side wouldn't be motivated, and that's generally good for simplicity of use too. I think there'd be quite a bit of friction if users needed to create 2 separate captures for a source. It does make sense to me that webhook capability be built into the CDK, along the lines of Option D. I'm thinking it would need to be incorporated into the "main" connector, with explicitly defined bindings. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Long-term Planning for Webhook Captures
Introduction
Estuary currently offers a
source-http-ingestconnector to collect webhook event data. Unlike other SaaS connectors that poll source servers for data, it runs an Axum HTTP server and waits for incoming requests. It was developed early in the company's history, before the Python CDK cemented its place as the de facto tool for building integrations. It is one of the most used connectors and works to wide user satisfaction.Recently, the integrations team learned that some users prefer each event topic to be written to its own collection. This could easily be addressed by writing a transformation that copies data entered into a centralised input collection, or by forking the existing
source-http-ingestcode. The main issue is that the first connector to include this requirement is for AppsFlyer, which requires a mix of REST API and webhook-based resources. This opens the discussion on how to integrate webhook data into the connector development flow long-term.Objective
This design document discusses how to more naturally integrate push API data sources into new and existing connectors.
Our MVP targets AppsFlyer's push-API-exclusive endpoints, while also leaving room for future expansion.
Overview
Unlike most REST APIs, webhook event handling is relatively straightforward: rather than adapting to another organisation's design and requirements, we passively wait for messages to be sent to us. There is little variety in how different sources send webhook events:
This indicates that supporting new vendors would usually involve little code — a spec definition should be enough1. The only exception would be REST replay or backfill endpoints. Few sources offer these, and there is no short- or medium-term need to support them.
Sources we currently have connectors for that also support webhooks
User experience
Webhook event type discrimination systems
Sources are not consistent about where they specify which topic an event belongs to.
Discriminator fields are the most common practice.
Ideally, users would not have to specify the discriminator's location — it introduces the possibility of human error and is source-specific, so this research needs to be done once per source.
Event type discriminator values
The same goes for discriminator values. These are used to distinguish which collection an event should be routed to.
Example 1:
{ "type": "order.created", "order_id": "ord_482", "amount_cents": 4999, "currency": "USD", "created_at": "2025-04-10T14:22:00Z" }Example 2:
{ "type": "refund.issued", "refund_id": "ref_71", "order_id": "ord_482", "amount_cents": 4999, "reason": "customer_request", "issued_at": "2025-04-10T15:03:00Z" }If one user uses the
typefield as the discriminator, it is safe to assume the same applies to all other connector users. But how can we determine all the required collections before the capture is published and receiving messages?Manual definition
Known discriminator values could be hardcoded into the connector's code and presented to users through the standard resource UI flow, but that carries some unfortunate implications:
Connector developers could hardcode their known set of discriminator values while also allowing users to add their own. Any incoming events with discriminator values that are neither collected nor explicitly disabled could be logged as warnings. This would require expanding our existing protocol spec to allow for an "expected webhook event types" list.
Automatic discovery
Another alternative, though, could be to dynamically create new collections when a new event type is discovered. CDK connectors can emit config updates, and the scheduled auto-discovery mechanism (which triggers every two hours) would then create the corresponding collection. The control plane could be updated to immediately trigger a discovery when a config contains webhook bindings.
The downside of the approach is the possibility of dropping at least one message per event type, the one that triggers the creation of the newly discovered collection2, which is especially wasteful given that most discriminator values are the same across all captures for that source.
Hybrid approach
The manual definition solution could naturally develop into a hybrid alternative, where most collections are human-defined but newly discovered ones get automatically added. This approach combines the flexibility of automatic discovery with the reduced downtime of source-specific presets.
Service reliability
Beyond user-facing configuration, webhook captures also raise infrastructure concerns around reliability. Despite our best efforts, it is inevitable that at some point messages will be missed. Sources may offer different recovery systems in case of a 4xx/5xx HTTP response:
Handling automatic replays should be enough for our MVP3. The last option indicates that some webhook bindings may require cursor management.
To minimise replays and the risk of delayed or lost messages, webhook captures should maximise uptime. This goes against one of the core principles behind the Python CDK's design: that connector tasks should be able to shut down at any point, for any reason, and reliably pick up where they left off.
At present, regular Python connectors restart for many reasons:
Most of these are unrelated to webhook captures, reinforcing the need to keep them decoupled. A REST API capture might stay in a crash loop, holding up all bindings in that config until the bug is fixed — at which point they resume from where they left off. Webhooks may not enjoy that same privilege. They'll need to run as a separate container and likely be defined as a different connector.
Implementation
With these constraints in mind, the following options explore different implementation approaches.
Option A: Expanding the existing
source-http-ingestconnectorThe recent signature verification mechanism merged into
source-http-ingestcould provide a model for expanding to accommodate this new feature. As with discriminator fields, signing systems are not standardised across sources and require significant research from the user.As with that feature, all parameters could be exposed to users (discriminator field location, discriminator values to collect events for, signature location, signature encryption algorithm, etc.) to support as many different sources as possible. Extra setup flows could then be created for known vendors, where a connector developer would hardcode the known parameters and only expose input fields for user-specific credentials.
The problem with this approach is that we would effectively be building a two-tiered connector listing system, with webhook configurations existing as second-class citizens. A user setting up an AppsFlyer capture would have to select
source-http-ingestfrom N connectors, then navigate a list of M vendors to find their actual source.This solution is optimal if we're aiming for minimal development time.
Option B: Establishing a discrete webhook connector development framework
There is currently no straightforward way to turn a webhook connector variant into a first-class citizen. The implementation could be replicated in a different codebase, but each vendor would share 90% of the same source code and each new feature or fix would have to be implemented across many connectors.
There is, however, a precedent for a legacy, non-Python specialised connector type: filesource connectors have a minimal go module dedicated to the replication of file-like objects.
source-http-ingestcould be turned into a thin wrapper with all parameters exposed, with the actual business logic extracted into a shared Rust crate. Creating new webhook connectors would be trivial from that point.While a more natural fit for our platform, this solution would still require separate connectors for sources that offer both push- and pull-based APIs. Though slightly more involved than option A, this solution still fits into our existing ecosystem and requires no feature rewrites.
Option C: Allowing the CDK to deploy
source-http-ingestinstances as a sidecarRather than replacing
source-http-ingestwith a native implementation, aWebhookCaptureSpeccovering all required configuration could be implemented in Python. Once the CDK finds an instance of that spec in a capture configuration, it would spin up asource-http-ingestcontainer to receive and buffer incoming webhook events. This would effectively decouple the lifetimes of the more downtime-prone CDK connectors from the listening HTTP server. The CDK would then read the output Flow protocol messages and either forward them or adapt them to the larger context as necessary.Parsing and manipulating a protocol originally meant for another consumer is far from ideal, but would allow us to more spontaneously accept incoming network requests. A good example would be Zoho's bulk read jobs: instead of polling for minutes waiting for the operation to end, we could suspend the task until a callback URL was hit.
The main issue here lies in how to respond to event submitters upon receiving messages. The current connector implementation only acknowledges incoming HTTP requests once the document has been saved and checkpointed. If this proposed solution were to wait for that confirmation, there would be delays when the connector container is restarting, effectively undoing the benefits of keeping both processes separate in the first place. If we were to respond with 200 codes on messages being buffered, data could accumulate indefinitely and potentially crash the system while waiting for a processor to take them — all while the source assumes everything has been properly persisted.
Option C would leverage the proven Rust code while making that functionality available within the CDK ecosystem.
Option D: Implementing HTTP server deployments for the CDK
Like B, this option would require maintaining separate connectors for push- and pull-based APIs. Rather than enforcing a difference between webhook and REST connectors at the architecture level, we could simply implement an
aiohttpserver for the CDK and let connector developers determine which resources should go to which connector version. Revisiting the usual reboot causes:The CDK could suppress programmed shutdown signals if at least one webhook binding is found.
Connector developers could attempt to exhaustively define models to minimise inferred schema updates.
Rather than crashing, webhook captures could log an error and suppress the validation exception.
These could be minimised or completely eliminated by keeping REST resources in a different connector.
This solution brings push-based API handling to a language the integrations team is more comfortable supporting, and widens future hiring options.
Option E: CDK HTTP servers, transparent dual-container setup
Expanding on the previous solution, we could present what would internally be two connector definitions as a single option with all resources merged. The viability of this solution is yet to be discussed with the UI and control plane teams.
Hiding our internal implementation would make the user experience seamless and would avoid confusion as to which connector is needed for which data type.
Comparison
source-http-ingestFootnotes
This would make webhook connectors trivial to AI-generate from start to end. ↩
Though the raw message could be stored in the connector's state, to be reprocessed upon restart. This connector restart still affects uptime. ↩
Users should be able to create captures for any vendor, regardless of whether the replay mechanism is yet supported. Our documentation could notify them that we cannot guarantee data completeness at the time. ↩
Beta Was this translation helpful? Give feedback.
All reactions