Skip to content

Introduce Resilient Retry Policy for Quartz Workflow Jobs Using Polly #102

@sfmskywalker

Description

@sfmskywalker

Background:
Quartz workflow jobs (such as RunWorkflowJob and ResumeWorkflowJob) are currently susceptible to permanent failure in the face of transient infrastructure disruptions, such as temporary PostgreSQL connection issues (e.g. Connection refused, EndOfStreamException, stream read errors), which are common in distributed and containerized environments. Current job execution logic does not implement a retry or resilience mechanism, resulting in failed jobs that could otherwise succeed once the infrastructure recovers.

Proposal:

  • Introduce a Polly-based, configurable retry policy for all Quartz job executions related to workflows.
  • The retry policy should:
    • Handle known transient failures (e.g., NpgsqlException caused by I/O errors, EndOfStreamException, SocketException, TimeoutException).
    • Wrap the entire job execution to prevent partial execution or inconsistent states.
    • Use exponential backoff with jitter to avoid concentrated retry attempts during outages.
    • Log each retry at warning level, including job name, attempt number, delay, and exception details for diagnostics.
  • Configuration options (via application settings):
    • Enable/disable retries
    • Maximum retry count
    • Initial backoff delay and backoff strategy
    • Exception filtering (with safe defaults)
    • Custom strategy support
  • Default values should enable retries with safe, production-sensible settings: 3–5 retries, exponential backoff starting at several hundred milliseconds.
  • Advisory lock handling: When releasing advisory locks, failures caused by broken database connections should not fail the job, since PostgreSQL releases locks on session termination automatically.

Benefits:

  • Improved resilience of workflow job execution against transient failures.
  • Avoids permanent failures due to short-lived infrastructure issues, without changing workflow execution semantics.
  • Aligns with best practices for reliability and cloud-native resilience.

This improvement focuses on generalized Quartz resilience; no customer-specific or proprietary context is included.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

Status

In Progress

Relationships

None yet

Development

No branches or pull requests

Issue actions