-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Milestone
Description
Background:
Quartz workflow jobs (such as RunWorkflowJob and ResumeWorkflowJob) are currently susceptible to permanent failure in the face of transient infrastructure disruptions, such as temporary PostgreSQL connection issues (e.g. Connection refused, EndOfStreamException, stream read errors), which are common in distributed and containerized environments. Current job execution logic does not implement a retry or resilience mechanism, resulting in failed jobs that could otherwise succeed once the infrastructure recovers.
Proposal:
- Introduce a Polly-based, configurable retry policy for all Quartz job executions related to workflows.
- The retry policy should:
- Handle known transient failures (e.g.,
NpgsqlExceptioncaused by I/O errors,EndOfStreamException,SocketException,TimeoutException). - Wrap the entire job execution to prevent partial execution or inconsistent states.
- Use exponential backoff with jitter to avoid concentrated retry attempts during outages.
- Log each retry at warning level, including job name, attempt number, delay, and exception details for diagnostics.
- Handle known transient failures (e.g.,
- Configuration options (via application settings):
- Enable/disable retries
- Maximum retry count
- Initial backoff delay and backoff strategy
- Exception filtering (with safe defaults)
- Custom strategy support
- Default values should enable retries with safe, production-sensible settings: 3–5 retries, exponential backoff starting at several hundred milliseconds.
- Advisory lock handling: When releasing advisory locks, failures caused by broken database connections should not fail the job, since PostgreSQL releases locks on session termination automatically.
Benefits:
- Improved resilience of workflow job execution against transient failures.
- Avoids permanent failures due to short-lived infrastructure issues, without changing workflow execution semantics.
- Aligns with best practices for reliability and cloud-native resilience.
This improvement focuses on generalized Quartz resilience; no customer-specific or proprietary context is included.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
In Progress