Replies: 6 comments
-
This is inherently a tradeoff to make. Ultimately, it's impossible to do both, meaning you can be 100% sure you persisted the logs without risking sending the event multiple times. So there are two possible paths here with different tradeoffs:
In both cases, you can add degrees of complexity to reduce the likelihood. The first most obvious thing is to retry, in process, the post-delivery a few times with a small exponential back-off. For 2, you can use a fallback, in which case you'd need both persistence methods to fail. You could save the delivery output to Redis and have some process (cron or similar) check if there are entries, and, if so, re-execute the post-delivery logic and purge it from Redis. I would tend to advocate for 2 without implementing a fallback (yet) |
Beta Was this translation helpful? Give feedback.
-
In @alexluong's first scenario, the logmq failure results in re-publishing the events, and thus, we end up with duplicate events. Although logging is very important, I'd say it's less important than trying to publish events only once. This doesn't address everything. However, consistent publishing/delivery behavior seems to be the most important thing to me. |
Beta Was this translation helpful? Give feedback.
-
I'm sympathetic to this, but to play devils advocate, the system is already "at-least-once" in which case that's already a condition the receiving system needs to account for, meanwhile missing delivery log is a new failure condition |
Beta Was this translation helpful? Give feedback.
-
FWIW, I do agree that losing delivery log is a serious concern that we should minimize if possible. Ideally, we should not spam the destination, so we should come up with a solution to prevent it from happening, but occasional duplicate redelivery is acceptable I think.
This is probably a good thing to try regardless. It can probably help with weird issues but ultimately, it still relies on the logmq infrastructure to be up. If anything, we should be extra careful because we don't want to DDOS our own logmq infra by adding extra requests because of the retry logic. As for the Redis fallback, my only concern is whether it could cause stress on the Redis itself. The data we store on Redis are a lot smaller (tenants, destinations, idempotency, retry queue) whereas the event data is both large in quantity and in size (semi-unbounded; not sure what we're storing yet, but are we supposed to store the webhook request response?). Here's some back of the napkin math for the amount of memory we need for Redis: Assumption:
1min storage = 10,000 requests * 1 KB/request = 10,000 KB = 10 MB I think 1 KB is probably an underestimate of the actual event data, so the exact number could be meaningfully more. I don't know if we should expect significantly more than 10,000 rpm or not, so that's also something to think about. I think the fallback idea makes sense but I wonder if we can come up with safer solutions. We can default to Redis, but for serious services, maybe we can consider supporting an S3-compat fallback storage? Should we consider supporting a healthcheck mechanism, so when something is down, the service will stop the delivery until things recover? |
Beta Was this translation helpful? Give feedback.
-
@alexluong, can we clarify the current behavior, make a call on what we feel the ALPHA behavior should be, and then move this discussion issue? |
Beta Was this translation helpful? Give feedback.
-
Current behavior is a NACK. |
Beta Was this translation helpful? Give feedback.
-
Problem
This is a simplistic pseudo-code of the deliverymq flow, with emphasis on some interesting cases:
We have clear error handling around pre-publish ops & the publish step. The post-publish ops error handling is a bit trickier.
Currently, we don't have any special error handling to differentiate pre vs post publish ops. Is this something we should consider?
For example:
As you can see, essentially logmq becomes a very critical piece of infrastructure where if it fails, we will essentially spam all destinations with however many retries we can until the message ends up in DLQ.
It's not super clear to me if this is an expected problem of distributed systems, or if there's a way to limit the impact.
Another scenario is what is publish fails & log also fails.
Beta Was this translation helpful? Give feedback.
All reactions