Native support for retry policies

## The Problem

When a job fails due to infrastructure reasons (preemption, node eviction, OOMKill), Armada permanently fails the job. Users must manually resubmit, even though the failure wasn't their fault.

The existing retry logic only retries jobs in narrow cases:
- Executor heartbeat timeout
- Executor explicitly returns the lease
- Armada preemption marks the run as "Returned"

Any other failure, including OOMKill, node eviction, or application-level transient errors, results in permanent failure.

```go
// Current logic: only "Returned" runs can retry
requeueJob = !failFast && lastRun.Returned() && job.NumAttempts() < s.maxAttemptedRuns
```

## The Solution

A configurable retry policy system. Operators define policies with rules that match failures by exit code, condition (OOMKilled, Evicted, etc.), termination message, or error category (from #4713). The scheduler evaluates failed jobs against these rules and retries when appropriate.

- User code failures still fail by default. Operators explicitly configure what to retry.
- Feature-flagged with `retryPolicy.enabled: false` by default. When disabled, pod names use legacy format.
- RetryPolicy is a CRUD resource (like Queue), managed via `armadactl` without scheduler restarts.
- Queue policies always apply. Job annotations can add more policies but cannot remove queue-level ones.

---

## RetryPolicy Resource

Managed the same way as Queue: stored in the database, cached by the scheduler.

```bash
armadactl create -f ./retry-policy.yaml
armadactl get retrypolicy infrastructure
armadactl update -f ./retry-policy.yaml
armadactl delete retrypolicy infrastructure
```

### Policy YAML

```yaml
apiVersion: armadaproject.io/v1
kind: RetryPolicy
metadata:
  name: ml-training
spec:
  retryLimit: 5
  defaultAction: Fail
  backoff:
    initialDelay: 0s
  antiAffinity:
    mode: none
  rules:
    - action: Retry
      onConditions:
        - Preempted
        - Evicted
      antiAffinity:
        mode: node
    - action: Retry
      onConditions:
        - OOMKilled
      retryLimit: 3
    - action: Retry
      onExitCodes:
        operator: In
        values: [137]
    - action: Retry
      onFailureCategory: [cuda_error, infiniband_error]
```

### Scheduler Config

Global settings only, not policy definitions:

```yaml
scheduling:
  retryPolicy:
    enabled: true
    globalMaxRetries: 20
    defaultPolicyName: default
    defaultBackoff:
      initialDelay: 0s
      maxDelay: 10m
      multiplier: 2.0
```

---

## Policy Resolution and Composition

Policies are additive. Queue-level policies always apply. Job annotations can add more but cannot remove or replace queue-level ones.

1. Start with queue policies: `armadactl create queue ml-queue --retry-policy infra,ml-training`
2. Job annotation adds additional policies: `armadaproject.io/retry-policy: extra-retry`
3. If neither specifies anything, the default policy applies: `scheduling.retryPolicy.defaultPolicyName`

```yaml
# Job adds additional policies (cannot remove queue-level ones)
jobs:
  - namespace: default
    annotations:
      armadaproject.io/retry-policy: extra-retry
    podSpec:
      containers:
        - name: training
          image: my-image:latest
```

Effective policies for this job: `[infra, ml-training, extra-retry]`.

When multiple policies apply, their rules are concatenated in order (first policy's rules first). First-match-wins. Each rule tracks its own retry count separately. A rule's effective limit is:

```
effectiveLimit = rule.retryLimit ?? policy.retryLimit ?? globalMaxRetries
```

A retry is allowed when the rule's count is under its effective limit AND total retries are under `globalMaxRetries`.

`globalMaxRetries` can be lowered at runtime (scheduler config reload) as an emergency kill switch. Jobs already over the new limit fail on next evaluation.

### Example

Queue has policies `infra` (`retryLimit: 10`, rules for Preempted/Evicted) and `ml-training` (`retryLimit: 5`, rule for OOMKilled with `retryLimit: 3`).

| Failure | Matched rule | Effective limit | Check |
|---------|-------------|-----------------|-------|
| Preempted | infra/Preempted | 10 (from policy) | preemption count < 10 AND total < 20 |
| OOMKilled | ml/OOMKilled | 3 (from rule) | OOM count < 3 AND total < 20 |

The job can be preempted up to 10 times AND OOMKilled up to 3 times independently. `globalMaxRetries: 20` caps total retries regardless.

---

## Rule Types

Rules are evaluated in order. First match wins. No match = `defaultAction` (see below).

### Container Targeting

Every rule can optionally specify `containerName` to restrict matching to a specific container. This matters for multi-container pods (sidecars, init containers) where different containers fail for different reasons.

```yaml
rules:
  - action: Retry
    containerName: "main"           # only match failures from this container
    onExitCodes:
      operator: In
      values: [137]
  - action: Fail
    containerName: "log-shipper"    # sidecar OOM is not worth retrying
    onConditions:
      - OOMKilled
```

When `containerName` is omitted, the rule matches against the first failed container (current behavior). When set, the rule only considers that container's exit code, termination reason, and message.

`includeInitContainers: false` (default) skips init containers. Set to `true` to also consider init container failures.

### Default Action

Each policy has a `defaultAction` that applies when no rule matches a failure:

```yaml
spec:
  defaultAction: Fail    # Fail (default) or Retry
  rules: [...]
```

`Fail` is the safe default: unrecognized failures stop the job. Operators running fault-tolerant batch work can set `defaultAction: Retry` to catch unexpected infrastructure failures without writing catch-all rules.

### Exit Code Matching

```yaml
rules:
  - action: Retry
    onExitCodes:
      operator: In        # or NotIn
      values: [137, 143]
```

Exit code 0 is the proto3 default ("not set") and never matches any exit code rule. Combine with `containerName` to match a specific container's exit code.

### Condition Matching

| Condition | When it's set |
|-----------|-----------------|
| `OOMKilled` | Container exceeded memory limit |
| `Evicted` | Pod evicted due to node pressure |
| `Preempted` | Pod preempted by scheduler |
| `DeadlineExceeded` | Pod exceeded `activeDeadlineSeconds` |
| `Unschedulable` | Pod couldn't be scheduled |

```yaml
rules:
  - action: Retry
    onConditions:
      - OOMKilled
      - Evicted
```

### Termination Message Matching

Jobs can signal retry intent by writing to `/dev/termination-log`.

```yaml
rules:
  - action: Retry
    containerName: "main"
    onTerminationMessage:
      pattern: ".*TRANSIENT.*"
```

`containerName` works the same as on other rule types. When omitted, all non-init container messages are checked.

### Error Category Matching

Uses categories from #4713. The executor classifies failures into named categories; retry policies can match on them.

```yaml
rules:
  - action: Retry
    onFailureCategory: [cuda_error, infiniband_error]
```

### Per-Rule Overrides

Each rule can override policy-level `retryLimit`, `antiAffinity`, and `backoff`:

```yaml
rules:
  - action: Retry
    retryLimit: 3
    antiAffinity:
      mode: node
    backoff:
      initialDelay: 30s
      maxDelay: 5m
      multiplier: 2.0
    onConditions:
      - OOMKilled
```

---

## Backoff

Exponential backoff between retries. Configured at three levels (global default, policy, rule) with the most specific taking precedence.

```yaml
# Policy-level with a rule-level override
spec:
  backoff:
    initialDelay: 10s
    maxDelay: 5m
    multiplier: 2.0
  rules:
    - action: Retry
      onConditions: [Evicted]
      backoff:
        initialDelay: 30s
        maxDelay: 10m
        multiplier: 3.0
```

Implementation:
1. Add `retryAfter` field to the Job (in-memory + DB column `retry_after`)
2. On retry, compute delay: `min(initialDelay * multiplier^(failureCount-1), maxDelay)`
3. Set `retryAfter = now + delay`
4. Scheduling loop skips jobs where `retryAfter > now`

---

## Anti-Affinity

Configurable per-policy and per-rule. No anti-affinity by default.

| Mode | Behavior |
|------|----------|
| `none` | No avoidance. (default) |
| `node` | Avoid the node where the most recent failed run executed. |

Only the most recent failed run's node is avoided, not all previous nodes. This prevents accumulating constraints that make the job unschedulable after several retries.

```yaml
spec:
  antiAffinity:
    mode: none
  rules:
    - action: Retry
      onConditions: [Preempted, Evicted]
      antiAffinity:
        mode: node              # avoid same node for infra failures
    - action: Retry
      onConditions: [OOMKilled]
      # inherits policy-level: none
```

---

## Pod Naming

When `retryPolicy.enabled: true`:
- First run (`runIndex=0`): `armada-<jobId>-0` (legacy format)
- Retry runs: `armada-<jobId>-0-<runIndex>`, e.g. `armada-<jobId>-0-1`

When disabled, all runs use legacy format. Services and ingresses follow the same pattern.

---

## Graceful Shutdown on Retry

When a preempted job is retried, the previous run must have terminated first. Otherwise both old and new pods run simultaneously, potentially corrupting shared state (GPU memory, file locks, checkpoints).

The race: the scheduler marks a run as preempted and requeues immediately, but the old pod is still in graceful shutdown on the executor.

Solution: set `retryAfter = now + terminationGracePeriodSeconds`. If the executor confirms the old run terminated before the timer expires, clear `retryAfter` immediately. The retry becomes eligible at whichever comes first: termination confirmation or timer expiry.

This shares the `retryAfter` mechanism with backoff. For preemption retries, the effective delay is `max(backoffDelay, terminationGracePeriodSeconds)`, reduced once termination is confirmed.

---

## Breaking Changes

**Pod and service naming**: Retry runs get `armada-<jobId>-0-<runIndex>`. First run keeps the legacy name. Disabling the feature restores legacy naming entirely.

**Gang scheduling**: Retries create multiple "generations" of gang members. The scheduler filters by generation when counting active gang jobs to avoid double-counting via `gangGenerationKey{gangId, NumAttempts}`.

**Queue ordering**: Requeued jobs are scheduled before new jobs at the same priority. Under high retry rates this could starve new submissions.

**`JobRequeued` event**: New event type not yet exposed via the public API. Consumers that exhaustively match on event types will need updating.

---

## Blockers and Related

- #4713 (Error Categorization): The failure classification system that retry policies build on. `onFailureCategory` references categories defined there.
- Continues work from #4340 by @Sovietaced.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native support for retry policies #4683

The Problem

The Solution

RetryPolicy Resource

Policy YAML

Scheduler Config

Policy Resolution and Composition

Example

Rule Types

Container Targeting

Default Action

Exit Code Matching

Condition Matching

Termination Message Matching

Error Category Matching

Per-Rule Overrides

Backoff

Anti-Affinity

Pod Naming

Graceful Shutdown on Retry

Breaking Changes

Blockers and Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failure	Matched rule	Effective limit	Check
Preempted	infra/Preempted	10 (from policy)	preemption count < 10 AND total < 20
OOMKilled	ml/OOMKilled	3 (from rule)	OOM count < 3 AND total < 20

Condition	When it's set
`OOMKilled`	Container exceeded memory limit
`Evicted`	Pod evicted due to node pressure
`Preempted`	Pod preempted by scheduler
`DeadlineExceeded`	Pod exceeded `activeDeadlineSeconds`
`Unschedulable`	Pod couldn't be scheduled

Mode	Behavior
`none`	No avoidance. (default)
`node`	Avoid the node where the most recent failed run executed.

Native support for retry policies #4683

Description

The Problem

The Solution

RetryPolicy Resource

Policy YAML

Scheduler Config

Policy Resolution and Composition

Example

Rule Types

Container Targeting

Default Action

Exit Code Matching

Condition Matching

Termination Message Matching

Error Category Matching

Per-Rule Overrides

Backoff

Anti-Affinity

Pod Naming

Graceful Shutdown on Retry

Breaking Changes

Blockers and Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions