Long-term Planning for Webhook Captures #3923

nicolaslazo · 2026-02-23T16:02:53Z

nicolaslazo
Feb 23, 2026
Maintainer

Long-term Planning for Webhook Captures

Introduction

Estuary currently offers a source-http-ingest connector to collect webhook event data. Unlike other SaaS connectors that poll source servers for data, it runs an Axum HTTP server and waits for incoming requests. It was developed early in the company's history, before the Python CDK cemented its place as the de facto tool for building integrations. It is one of the most used connectors and works to wide user satisfaction.

Recently, the integrations team learned that some users prefer each event topic to be written to its own collection. This could easily be addressed by writing a transformation that copies data entered into a centralised input collection, or by forking the existing source-http-ingest code. The main issue is that the first connector to include this requirement is for AppsFlyer, which requires a mix of REST API and webhook-based resources. This opens the discussion on how to integrate webhook data into the connector development flow long-term.

Objective

This design document discusses how to more naturally integrate push API data sources into new and existing connectors.

How can we help users through the non-standardised process of setting up webhook captures?
Can we integrate webhook captures and callbacks into the CDK? What would the user and developer experiences be like?
Can we tweak our existing processes to maximise availability in this specific area?
How can we recover dropped messages?

Our MVP targets AppsFlyer's push-API-exclusive endpoints, while also leaving room for future expansion.

Overview

Unlike most REST APIs, webhook event handling is relatively straightforward: rather than adapting to another organisation's design and requirements, we passively wait for messages to be sent to us. There is little variety in how different sources send webhook events:

One or multiple URL paths
Bearer token auth
Header-based id deduplication
CORS
Okta verification
Write schemas
Signature verification

This indicates that supporting new vendors would usually involve little code — a spec definition should be enough¹. The only exception would be REST replay or backfill endpoints. Few sources offer these, and there is no short- or medium-term need to support them.

Sources we currently have connectors for that also support webhooks

Asana
Braintree
Brevo
Chargebee
Datadog
Facebook Marketing
Front
HubSpot
Intercom
Iterable
Jira
Klaviyo
Monday
Outreach
QuickBooks
Recharge
Salesforce
Sentry
Shopify
Stripe
Twilio
Zendesk

User experience

Webhook event type discrimination systems

Sources are not consistent about where they specify which topic an event belongs to.

Polymorphic schema, discriminator field in contents
Polymorphic schema, discriminator field in headers
Per-topic subscription

Discriminator fields are the most common practice.

Ideally, users would not have to specify the discriminator's location — it introduces the possibility of human error and is source-specific, so this research needs to be done once per source.

Event type discriminator values

The same goes for discriminator values. These are used to distinguish which collection an event should be routed to.

Example 1:

{
  "type": "order.created",
  "order_id": "ord_482",
  "amount_cents": 4999,
  "currency": "USD",
  "created_at": "2025-04-10T14:22:00Z"
}

Example 2:

{
  "type": "refund.issued",
  "refund_id": "ref_71",
  "order_id": "ord_482",
  "amount_cents": 4999,
  "reason": "customer_request",
  "issued_at": "2025-04-10T15:03:00Z"
}

If one user uses the type field as the discriminator, it is safe to assume the same applies to all other connector users. But how can we determine all the required collections before the capture is published and receiving messages?

Manual definition

Known discriminator values could be hardcoded into the connector's code and presented to users through the standard resource UI flow, but that carries some unfortunate implications:

We may know which event types are produced, but not the exact strings they use
Undocumented event types might exist
Connector developers might not have access to a test account to generate messages for each type
Sources might allow users to create custom event types
Discriminator values might contain a randomised component (like a UUID) specific to every user

Connector developers could hardcode their known set of discriminator values while also allowing users to add their own. Any incoming events with discriminator values that are neither collected nor explicitly disabled could be logged as warnings. This would require expanding our existing protocol spec to allow for an "expected webhook event types" list.

Automatic discovery

Another alternative, though, could be to dynamically create new collections when a new event type is discovered. CDK connectors can emit config updates, and the scheduled auto-discovery mechanism (which triggers every two hours) would then create the corresponding collection. The control plane could be updated to immediately trigger a discovery when a config contains webhook bindings.

The downside of the approach is the possibility of dropping at least one message per event type, the one that triggers the creation of the newly discovered collection², which is especially wasteful given that most discriminator values are the same across all captures for that source.

Hybrid approach

The manual definition solution could naturally develop into a hybrid alternative, where most collections are human-defined but newly discovered ones get automatically added. This approach combines the flexibility of automatic discovery with the reduced downtime of source-specific presets.

Service reliability

Beyond user-facing configuration, webhook captures also raise infrastructure concerns around reliability. Despite our best efforts, it is inevitable that at some point messages will be missed. Sources may offer different recovery systems in case of a 4xx/5xx HTTP response:

Retrying later, up to N times, at regular intervals or with an exponential backoff
A dead letter queue REST API endpoint
A backfill REST API endpoint
No recovery whatsoever
Strictly increasing event ids combined with a REST API endpoint from which missing messages can be retrieved

Handling automatic replays should be enough for our MVP³. The last option indicates that some webhook bindings may require cursor management.

To minimise replays and the risk of delayed or lost messages, webhook captures should maximise uptime. This goes against one of the core principles behind the Python CDK's design: that connector tasks should be able to shut down at any point, for any reason, and reliably pick up where they left off.

At present, regular Python connectors restart for many reasons:

Daily restarts to safeguard against memory leaks or stale state
Schema or endpoint config updates
Input model validation failures
OAuth2 token refreshes
OOM errors
Unhandled exceptions

Most of these are unrelated to webhook captures, reinforcing the need to keep them decoupled. A REST API capture might stay in a crash loop, holding up all bindings in that config until the bug is fixed — at which point they resume from where they left off. Webhooks may not enjoy that same privilege. They'll need to run as a separate container and likely be defined as a different connector.

Implementation

With these constraints in mind, the following options explore different implementation approaches.

Option A: Expanding the existing `source-http-ingest` connector

The recent signature verification mechanism merged into source-http-ingest could provide a model for expanding to accommodate this new feature. As with discriminator fields, signing systems are not standardised across sources and require significant research from the user.

As with that feature, all parameters could be exposed to users (discriminator field location, discriminator values to collect events for, signature location, signature encryption algorithm, etc.) to support as many different sources as possible. Extra setup flows could then be created for known vendors, where a connector developer would hardcode the known parameters and only expose input fields for user-specific credentials.

The problem with this approach is that we would effectively be building a two-tiered connector listing system, with webhook configurations existing as second-class citizens. A user setting up an AppsFlyer capture would have to select source-http-ingest from N connectors, then navigate a list of M vendors to find their actual source.

This solution is optimal if we're aiming for minimal development time.

Option B: Establishing a discrete webhook connector development framework

There is currently no straightforward way to turn a webhook connector variant into a first-class citizen. The implementation could be replicated in a different codebase, but each vendor would share 90% of the same source code and each new feature or fix would have to be implemented across many connectors.

There is, however, a precedent for a legacy, non-Python specialised connector type: filesource connectors have a minimal go module dedicated to the replication of file-like objects.

source-http-ingest could be turned into a thin wrapper with all parameters exposed, with the actual business logic extracted into a shared Rust crate. Creating new webhook connectors would be trivial from that point.

While a more natural fit for our platform, this solution would still require separate connectors for sources that offer both push- and pull-based APIs. Though slightly more involved than option A, this solution still fits into our existing ecosystem and requires no feature rewrites.

Option C: Allowing the CDK to deploy `source-http-ingest` instances as a sidecar

Rather than replacing source-http-ingest with a native implementation, a WebhookCaptureSpec covering all required configuration could be implemented in Python. Once the CDK finds an instance of that spec in a capture configuration, it would spin up a source-http-ingest container to receive and buffer incoming webhook events. This would effectively decouple the lifetimes of the more downtime-prone CDK connectors from the listening HTTP server. The CDK would then read the output Flow protocol messages and either forward them or adapt them to the larger context as necessary.

Parsing and manipulating a protocol originally meant for another consumer is far from ideal, but would allow us to more spontaneously accept incoming network requests. A good example would be Zoho's bulk read jobs: instead of polling for minutes waiting for the operation to end, we could suspend the task until a callback URL was hit.

The main issue here lies in how to respond to event submitters upon receiving messages. The current connector implementation only acknowledges incoming HTTP requests once the document has been saved and checkpointed. If this proposed solution were to wait for that confirmation, there would be delays when the connector container is restarting, effectively undoing the benefits of keeping both processes separate in the first place. If we were to respond with 200 codes on messages being buffered, data could accumulate indefinitely and potentially crash the system while waiting for a processor to take them — all while the source assumes everything has been properly persisted.

Option C would leverage the proven Rust code while making that functionality available within the CDK ecosystem.

Option D: Implementing HTTP server deployments for the CDK

Like B, this option would require maintaining separate connectors for push- and pull-based APIs. Rather than enforcing a difference between webhook and REST connectors at the architecture level, we could simply implement an aiohttp server for the CDK and let connector developers determine which resources should go to which connector version. Revisiting the usual reboot causes:

Daily restarts to safeguard against memory leaks or stale state

The CDK could suppress programmed shutdown signals if at least one webhook binding is found.

Schema or endpoint config updates

Connector developers could attempt to exhaustively define models to minimise inferred schema updates.

Input model validation failures

Rather than crashing, webhook captures could log an error and suppress the validation exception.

OAuth2 token refreshes

OOM errors

Unhandled exceptions

These could be minimised or completely eliminated by keeping REST resources in a different connector.

This solution brings push-based API handling to a language the integrations team is more comfortable supporting, and widens future hiring options.

Option E: CDK HTTP servers, transparent dual-container setup

Expanding on the previous solution, we could present what would internally be two connector definitions as a single option with all resources merged. The viability of this solution is yet to be discussed with the UI and control plane teams.

Hiding our internal implementation would make the user experience seamless and would avoid confusion as to which connector is needed for which data type.

Comparison

	A: Expand `source-http-ingest`	B: Webhook framework (Rust)	C: CDK sidecar	D: CDK HTTP server	E: Transparent dual containers
Implementation difficulty	Lowest	Low	High	Medium	High
Separate connector for webhooks	No (single generic connector)	Yes	No	Yes	No (transparent to user)
Auxiliary REST endpoint support	None	Manual	Natural	Natural	Natural
Control plane involvement	None	None	Yes (sidecar orchestration)	None	Yes (dual container orchestration)
UI team involvement	None	None	None	None	Yes (merged resource view)
Fits existing ecosystem	Yes	Yes	Partially	Yes	Requires new patterns

This would make webhook connectors trivial to AI-generate from start to end. ↩
Though the raw message could be stored in the connector's state, to be reprocessed upon restart. This connector restart still affects uptime. ↩
Users should be able to create captures for any vendor, regardless of whether the replay mechanism is yet supported. Our documentation could notify them that we cannot guarantee data completeness at the time. ↩

Alex-Bair · 2026-02-24T17:41:48Z

Alex-Bair
Feb 24, 2026
Maintainer

Thanks for digging in and evaluating the challenges & potential options for building more complex push-based captures, Nicolas! We've talked about these details in our 1:1s, but to also mention it here, I'm in favor of option D (expanding the Python estuary-cdk with webhook support) for all the reasons you stated. Plus, I'm assuming that during development of this capability and source-appsflyer, there would be common webhook related functionality (ex: spin up a server, advertise which port the connector is listening on, etc.) built into the CDK that would make it easier to add webhook support to pre-existing connectors for stuff like bulk job completion notifications.

0 replies

williamhbaker · 2026-02-24T18:41:26Z

williamhbaker
Feb 24, 2026
Maintainer

There's mention of minimizing connector downtime being an objective due to the failed delivery semantics of different sources - I think that's going to be difficult to make reliable. Connectors also restart during deployments or shard rebalancing, and various other reasons. If we're requiring very high uptime to be nearly 100% available to receive webhooks for correctness reasons, I don't think we're going to be able to do a very good job with a system like that. Some amount of downtime is inevitable.

To me this suggests that having a separate connector for the webhook "push" side vs. the traditional API pull side wouldn't be motivated, and that's generally good for simplicity of use too. I think there'd be quite a bit of friction if users needed to create 2 separate captures for a source.

It does make sense to me that webhook capability be built into the CDK, along the lines of Option D. I'm thinking it would need to be incorporated into the "main" connector, with explicitly defined bindings.

4 replies

nicolaslazo Feb 26, 2026
Maintainer Author

From my conversation with Dave yesterday re: building filesource/CDK hybrid connectors¹, it is my impression that he also agrees we should aim for a unified user experience — even at the cost of deprecating an existing framework.

Though I implicitly discarded the possibility of hosting everything in a single container, it is true that source-http-ingest is bound to the same architecture and yet performs with acceptable levels of downtime. My only concern is that I wouldn't know how to measure the impact of CDK-driven restarts on the existing uptime. If we decide it's still fine, simply deploying everything together would be the most straightforward approach.

IDK if posting the Slack thread link here is ok, but I can DM it to you ↩

dyaffe Feb 26, 2026
Maintainer

Yes, that sounds good to me though I can't comment on cdk-driven restarts.

psFried Feb 27, 2026

Is there some reason why a CDK connector would restart more frequently than source-http-ingest, or is the concern just about the startup time? Unless there's something I'm missing, I'd expect the availability of both to be more or less equivalent (and agree, definitely not 100% in either case)

nicolaslazo Feb 27, 2026
Maintainer Author

If a webhook binding exists alongside some non-webhook bindings in the same capture, the task might have to restart for several reasons — chiefly inferred schema updates, configUpdate events, OOM errors, or a failed binding triggering a graceful shutdown.

My main worry was that any REST API binding that failed for whatever reason (an outdated URL path, a 5xx HTTP response, an undiscovered bug) could potentially become a high severity ticket, as it would repeatedly bring down webhook captures with it and lead to permanent loss of data. In private conversations with @Alex-Bair I learnt this was a somewhat misplaced concern, as a failing binding will let any other bindings checkpoint their collected data before exiting. That means that a single bug won't bring webhook listening uptime down to 0%; but it would still be tied to how much work incremental streams have to do before they can gracefully shut down.

E.g.: we have a capture with incremental bindings A and B, and a webhook capture C. A fails immediately upon starting, but B takes 60 minutes before it can checkpoint and gracefully exit. This means that C will have 1 hour of uninterrupted service before it goes down. If B were to take 5 minutes to finish, now C would be restarting 12x as much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long-term Planning for Webhook Captures #3923

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Long-term Planning for Webhook Captures #3923

Uh oh!

Uh oh!

nicolaslazo Feb 23, 2026 Maintainer

Long-term Planning for Webhook Captures

Introduction

Objective

Overview

Sources we currently have connectors for that also support webhooks

User experience

Webhook event type discrimination systems

Event type discriminator values

Manual definition

Automatic discovery

Hybrid approach

Service reliability

Implementation

Option A: Expanding the existing source-http-ingest connector

Option B: Establishing a discrete webhook connector development framework

Option C: Allowing the CDK to deploy source-http-ingest instances as a sidecar

Option D: Implementing HTTP server deployments for the CDK

Option E: CDK HTTP servers, transparent dual-container setup

Comparison

Footnotes

Replies: 2 comments · 4 replies

Uh oh!

Alex-Bair Feb 24, 2026 Maintainer

Uh oh!

williamhbaker Feb 24, 2026 Maintainer

Uh oh!

nicolaslazo Feb 26, 2026 Maintainer Author

Footnotes

Uh oh!

dyaffe Feb 26, 2026 Maintainer

Uh oh!

psFried Feb 27, 2026

Uh oh!

Uh oh!

nicolaslazo Feb 27, 2026 Maintainer Author

nicolaslazo
Feb 23, 2026
Maintainer

Option A: Expanding the existing `source-http-ingest` connector

Option C: Allowing the CDK to deploy `source-http-ingest` instances as a sidecar

Replies: 2 comments 4 replies

Alex-Bair
Feb 24, 2026
Maintainer

williamhbaker
Feb 24, 2026
Maintainer

nicolaslazo Feb 26, 2026
Maintainer Author

dyaffe Feb 26, 2026
Maintainer

nicolaslazo Feb 27, 2026
Maintainer Author