Question: DeliveryMQ edge case around post-publish errors #326

alexluong · 2024-12-12T10:23:29Z

alexluong
Dec 12, 2024
Collaborator

Problem

This is a simplistic pseudo-code of the deliverymq flow, with emphasis on some interesting cases:

step 1: pre-publish operations (query destination, query event, etc.)
step 2: publish event
step 3: post-publish operations (schedule retry, send event to logmq, idempotency cleanup etc.)

We have clear error handling around pre-publish ops & the publish step. The post-publish ops error handling is a bit trickier.

Currently, we don't have any special error handling to differentiate pre vs post publish ops. Is this something we should consider?

For example:

for event A

1: deliverymq
  a: publish succeeds
  b: logmq fails
  c: nack

2: deliverymq
  a: publish succeeds
  b: logmq fails
  c: nack

...

As you can see, essentially logmq becomes a very critical piece of infrastructure where if it fails, we will essentially spam all destinations with however many retries we can until the message ends up in DLQ.

It's not super clear to me if this is an expected problem of distributed systems, or if there's a way to limit the impact.

Another scenario is what is publish fails & log also fails.

Should we simply nack & let the mq system retry?
Should we schedule a retry via the Redis-based system? There's a chance that may fail too. If yes, should we nack or ack?

alexbouchardd · 2024-12-12T22:23:46Z

alexbouchardd
Dec 12, 2024
Maintainer

This is inherently a tradeoff to make. Ultimately, it's impossible to do both, meaning you can be 100% sure you persisted the logs without risking sending the event multiple times.

So there are two possible paths here with different tradeoffs:

If the post-delivery fails, you ack and accept you lost the delivery trace.
If the post-delivery fails, you nack and retry (right away / very quickly), accepting you'll deliver the event multiple times.

In both cases, you can add degrees of complexity to reduce the likelihood. The first most obvious thing is to retry, in process, the post-delivery a few times with a small exponential back-off.

For 2, you can use a fallback, in which case you'd need both persistence methods to fail. You could save the delivery output to Redis and have some process (cron or similar) check if there are entries, and, if so, re-execute the post-delivery logic and purge it from Redis.

I would tend to advocate for 2 without implementing a fallback (yet)

0 replies

leggetter · 2024-12-13T11:24:14Z

leggetter
Dec 13, 2024
Maintainer

In @alexluong's first scenario, the logmq failure results in re-publishing the events, and thus, we end up with duplicate events. Although logging is very important, I'd say it's less important than trying to publish events only once.

This doesn't address everything. However, consistent publishing/delivery behavior seems to be the most important thing to me.

0 replies

alexbouchardd · 2024-12-13T23:46:04Z

alexbouchardd
Dec 13, 2024
Maintainer

I'd say it's less important than trying to publish events only once.

I'm sympathetic to this, but to play devils advocate, the system is already "at-least-once" in which case that's already a condition the receiving system needs to account for, meanwhile missing delivery log is a new failure condition

0 replies

alexluong · 2024-12-14T08:34:53Z

alexluong
Dec 14, 2024
Collaborator Author

FWIW, I do agree that losing delivery log is a serious concern that we should minimize if possible. Ideally, we should not spam the destination, so we should come up with a solution to prevent it from happening, but occasional duplicate redelivery is acceptable I think.

The first most obvious thing is to retry, in process, the post-delivery a few times with a small exponential back-off.

This is probably a good thing to try regardless. It can probably help with weird issues but ultimately, it still relies on the logmq infrastructure to be up. If anything, we should be extra careful because we don't want to DDOS our own logmq infra by adding extra requests because of the retry logic.

As for the Redis fallback, my only concern is whether it could cause stress on the Redis itself. The data we store on Redis are a lot smaller (tenants, destinations, idempotency, retry queue) whereas the event data is both large in quantity and in size (semi-unbounded; not sure what we're storing yet, but are we supposed to store the webhook request response?).

Here's some back of the napkin math for the amount of memory we need for Redis:

Assumption:

1 KB event data
10,000 requests per minute

1min storage = 10,000 requests * 1 KB/request = 10,000 KB = 10 MB
1hr = 10 MB * 60 = 600 MB
6hr = 3.6 GB
24hr = 14.4 GB

I think 1 KB is probably an underestimate of the actual event data, so the exact number could be meaningfully more.

I don't know if we should expect significantly more than 10,000 rpm or not, so that's also something to think about.

I think the fallback idea makes sense but I wonder if we can come up with safer solutions. We can default to Redis, but for serious services, maybe we can consider supporting an S3-compat fallback storage?

Should we consider supporting a healthcheck mechanism, so when something is down, the service will stop the delivery until things recover?

0 replies

leggetter · 2025-02-21T14:15:43Z

leggetter
Feb 21, 2025
Maintainer

@alexluong, can we clarify the current behavior, make a call on what we feel the ALPHA behavior should be, and then move this discussion issue?

0 replies

leggetter · 2025-04-10T11:25:46Z

leggetter
Apr 10, 2025
Maintainer

Current behavior is a NACK.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: DeliveryMQ edge case around post-publish errors #326

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Question: DeliveryMQ edge case around post-publish errors #326

alexluong Dec 12, 2024 Collaborator

Problem

Replies: 6 comments

alexbouchardd Dec 12, 2024 Maintainer

leggetter Dec 13, 2024 Maintainer

alexbouchardd Dec 13, 2024 Maintainer

alexluong Dec 14, 2024 Collaborator Author

leggetter Feb 21, 2025 Maintainer

leggetter Apr 10, 2025 Maintainer

alexluong
Dec 12, 2024
Collaborator

alexbouchardd
Dec 12, 2024
Maintainer

leggetter
Dec 13, 2024
Maintainer

alexbouchardd
Dec 13, 2024
Maintainer

alexluong
Dec 14, 2024
Collaborator Author

leggetter
Feb 21, 2025
Maintainer

leggetter
Apr 10, 2025
Maintainer