-
Notifications
You must be signed in to change notification settings - Fork 866
Description
Is your feature request related to a problem? Please describe.
A variety of issues can result in decision tasks repeatedly failing, such as specifying the wrong Workflow type, non-determinism, or potentially invalid input. Retrying the decision task forever can be convenient as the workflows will automatically resume once the issue is fixed, but it doesn't effectively convey to the user that some action is required.
These retries additionally create unnecessary load on the server, and can potentially conflict with user's workloads as they compete for resources, scheduling, and rate limiting. There have been a number of attempts to work address this problem, such as adding backoff or eventually abandoning dispatching the task altogether. These both deliver a poor user experience.
If the workflows were failed after a certain number of attempts for a given decision task that provides a clear signal to the user that it will not complete without additional intervention, and the user can reset the workflow to resume execution once the problem has been addressed.
Proposed Solution
Similar to restrictions on workflow size, or concurrent activity execution, we should add configuration options to warn about and ultimately terminate workflows that exceed a certain number of attempts for a given decision task.
Ideally we add some sort of search attribute or metadata to the workflow to make them easy to find, and provide a clear mechanism for users to reset to that specific point in the workflow history in bulk.
We should document the existing behavior around retries, backoff, and abandoning tasks as well.
Additional context
One additional piece of nuance is that the first decision task for a Workflow has a TTL equivalent to the Workflow's overall TTL. This is an optimization to avoid redispatching it over and over. The solution described here will not fail workflows started with the wrong TaskList.