-
Notifications
You must be signed in to change notification settings - Fork 151
Description
The Problem
When a job fails due to infrastructure reasons (preemption, node eviction, OOMKill), Armada permanently fails the job. Users must manually resubmit, even though the failure wasn't their fault.
The existing retry logic only retries jobs in narrow cases:
- Executor heartbeat timeout
- Executor explicitly returns the lease
- Armada preemption marks the run as "Returned"
Any other failure, including OOMKill, node eviction, or application-level transient errors, results in permanent failure.
// Current logic: only "Returned" runs can retry
requeueJob = !failFast && lastRun.Returned() && job.NumAttempts() < s.maxAttemptedRunsThe Solution
A configurable retry policy system. Operators define policies with rules that match failures by exit code, condition (OOMKilled, Evicted, etc.), termination message, or error category (from #4713). The scheduler evaluates failed jobs against these rules and retries when appropriate.
- User code failures still fail by default. Operators explicitly configure what to retry.
- Feature-flagged with
retryPolicy.enabled: falseby default. When disabled, pod names use legacy format. - RetryPolicy is a CRUD resource (like Queue), managed via
armadactlwithout scheduler restarts. - Queue policies always apply. Job annotations can add more policies but cannot remove queue-level ones.
RetryPolicy Resource
Managed the same way as Queue: stored in the database, cached by the scheduler.
armadactl create -f ./retry-policy.yaml
armadactl get retrypolicy infrastructure
armadactl update -f ./retry-policy.yaml
armadactl delete retrypolicy infrastructurePolicy YAML
apiVersion: armadaproject.io/v1
kind: RetryPolicy
metadata:
name: ml-training
spec:
retryLimit: 5
defaultAction: Fail
backoff:
initialDelay: 0s
antiAffinity:
mode: none
rules:
- action: Retry
onConditions:
- Preempted
- Evicted
antiAffinity:
mode: node
- action: Retry
onConditions:
- OOMKilled
retryLimit: 3
- action: Retry
onExitCodes:
operator: In
values: [137]
- action: Retry
onFailureCategory: [cuda_error, infiniband_error]Scheduler Config
Global settings only, not policy definitions:
scheduling:
retryPolicy:
enabled: true
globalMaxRetries: 20
defaultPolicyName: default
defaultBackoff:
initialDelay: 0s
maxDelay: 10m
multiplier: 2.0Policy Resolution and Composition
Policies are additive. Queue-level policies always apply. Job annotations can add more but cannot remove or replace queue-level ones.
- Start with queue policies:
armadactl create queue ml-queue --retry-policy infra,ml-training - Job annotation adds additional policies:
armadaproject.io/retry-policy: extra-retry - If neither specifies anything, the default policy applies:
scheduling.retryPolicy.defaultPolicyName
# Job adds additional policies (cannot remove queue-level ones)
jobs:
- namespace: default
annotations:
armadaproject.io/retry-policy: extra-retry
podSpec:
containers:
- name: training
image: my-image:latestEffective policies for this job: [infra, ml-training, extra-retry].
When multiple policies apply, their rules are concatenated in order (first policy's rules first). First-match-wins. Each rule tracks its own retry count separately. A rule's effective limit is:
effectiveLimit = rule.retryLimit ?? policy.retryLimit ?? globalMaxRetries
A retry is allowed when the rule's count is under its effective limit AND total retries are under globalMaxRetries.
globalMaxRetries can be lowered at runtime (scheduler config reload) as an emergency kill switch. Jobs already over the new limit fail on next evaluation.
Example
Queue has policies infra (retryLimit: 10, rules for Preempted/Evicted) and ml-training (retryLimit: 5, rule for OOMKilled with retryLimit: 3).
| Failure | Matched rule | Effective limit | Check |
|---|---|---|---|
| Preempted | infra/Preempted | 10 (from policy) | preemption count < 10 AND total < 20 |
| OOMKilled | ml/OOMKilled | 3 (from rule) | OOM count < 3 AND total < 20 |
The job can be preempted up to 10 times AND OOMKilled up to 3 times independently. globalMaxRetries: 20 caps total retries regardless.
Rule Types
Rules are evaluated in order. First match wins. No match = defaultAction (see below).
Container Targeting
Every rule can optionally specify containerName to restrict matching to a specific container. This matters for multi-container pods (sidecars, init containers) where different containers fail for different reasons.
rules:
- action: Retry
containerName: "main" # only match failures from this container
onExitCodes:
operator: In
values: [137]
- action: Fail
containerName: "log-shipper" # sidecar OOM is not worth retrying
onConditions:
- OOMKilledWhen containerName is omitted, the rule matches against the first failed container (current behavior). When set, the rule only considers that container's exit code, termination reason, and message.
includeInitContainers: false (default) skips init containers. Set to true to also consider init container failures.
Default Action
Each policy has a defaultAction that applies when no rule matches a failure:
spec:
defaultAction: Fail # Fail (default) or Retry
rules: [...]Fail is the safe default: unrecognized failures stop the job. Operators running fault-tolerant batch work can set defaultAction: Retry to catch unexpected infrastructure failures without writing catch-all rules.
Exit Code Matching
rules:
- action: Retry
onExitCodes:
operator: In # or NotIn
values: [137, 143]Exit code 0 is the proto3 default ("not set") and never matches any exit code rule. Combine with containerName to match a specific container's exit code.
Condition Matching
| Condition | When it's set |
|---|---|
OOMKilled |
Container exceeded memory limit |
Evicted |
Pod evicted due to node pressure |
Preempted |
Pod preempted by scheduler |
DeadlineExceeded |
Pod exceeded activeDeadlineSeconds |
Unschedulable |
Pod couldn't be scheduled |
rules:
- action: Retry
onConditions:
- OOMKilled
- EvictedTermination Message Matching
Jobs can signal retry intent by writing to /dev/termination-log.
rules:
- action: Retry
containerName: "main"
onTerminationMessage:
pattern: ".*TRANSIENT.*"containerName works the same as on other rule types. When omitted, all non-init container messages are checked.
Error Category Matching
Uses categories from #4713. The executor classifies failures into named categories; retry policies can match on them.
rules:
- action: Retry
onFailureCategory: [cuda_error, infiniband_error]Per-Rule Overrides
Each rule can override policy-level retryLimit, antiAffinity, and backoff:
rules:
- action: Retry
retryLimit: 3
antiAffinity:
mode: node
backoff:
initialDelay: 30s
maxDelay: 5m
multiplier: 2.0
onConditions:
- OOMKilledBackoff
Exponential backoff between retries. Configured at three levels (global default, policy, rule) with the most specific taking precedence.
# Policy-level with a rule-level override
spec:
backoff:
initialDelay: 10s
maxDelay: 5m
multiplier: 2.0
rules:
- action: Retry
onConditions: [Evicted]
backoff:
initialDelay: 30s
maxDelay: 10m
multiplier: 3.0Implementation:
- Add
retryAfterfield to the Job (in-memory + DB columnretry_after) - On retry, compute delay:
min(initialDelay * multiplier^(failureCount-1), maxDelay) - Set
retryAfter = now + delay - Scheduling loop skips jobs where
retryAfter > now
Anti-Affinity
Configurable per-policy and per-rule. No anti-affinity by default.
| Mode | Behavior |
|---|---|
none |
No avoidance. (default) |
node |
Avoid the node where the most recent failed run executed. |
Only the most recent failed run's node is avoided, not all previous nodes. This prevents accumulating constraints that make the job unschedulable after several retries.
spec:
antiAffinity:
mode: none
rules:
- action: Retry
onConditions: [Preempted, Evicted]
antiAffinity:
mode: node # avoid same node for infra failures
- action: Retry
onConditions: [OOMKilled]
# inherits policy-level: nonePod Naming
When retryPolicy.enabled: true:
- First run (
runIndex=0):armada-<jobId>-0(legacy format) - Retry runs:
armada-<jobId>-0-<runIndex>, e.g.armada-<jobId>-0-1
When disabled, all runs use legacy format. Services and ingresses follow the same pattern.
Graceful Shutdown on Retry
When a preempted job is retried, the previous run must have terminated first. Otherwise both old and new pods run simultaneously, potentially corrupting shared state (GPU memory, file locks, checkpoints).
The race: the scheduler marks a run as preempted and requeues immediately, but the old pod is still in graceful shutdown on the executor.
Solution: set retryAfter = now + terminationGracePeriodSeconds. If the executor confirms the old run terminated before the timer expires, clear retryAfter immediately. The retry becomes eligible at whichever comes first: termination confirmation or timer expiry.
This shares the retryAfter mechanism with backoff. For preemption retries, the effective delay is max(backoffDelay, terminationGracePeriodSeconds), reduced once termination is confirmed.
Breaking Changes
Pod and service naming: Retry runs get armada-<jobId>-0-<runIndex>. First run keeps the legacy name. Disabling the feature restores legacy naming entirely.
Gang scheduling: Retries create multiple "generations" of gang members. The scheduler filters by generation when counting active gang jobs to avoid double-counting via gangGenerationKey{gangId, NumAttempts}.
Queue ordering: Requeued jobs are scheduled before new jobs at the same priority. Under high retry rates this could starve new submissions.
JobRequeued event: New event type not yet exposed via the public API. Consumers that exhaustively match on event types will need updating.
Blockers and Related
- Error Categorization #4713 (Error Categorization): The failure classification system that retry policies build on.
onFailureCategoryreferences categories defined there. - Continues work from Native support for preemption retries #4340 by @Sovietaced.