Wire error categories into executor event reporting by dejanzele · Pull Request #4745 · armadaproject/armada

dejanzele · 2026-03-06T14:12:39Z

What type of PR is this?

This is the second PR in the error categorization series. It wires the classifier and FailureInfo into the executor's event reporting path.

What this PR does / why we need it

Constructs the Classifier from config at executor startup and passes it to the pod issue handler and job state reporter.
Calls ExtractFailureInfo() + classifier.Classify() on every pod failure, attaching structured FailureInfo to the Error events sent through Pulsar.
After this PR, every failed pod event flowing into Pulsar carries condition, exit code, termination message, and matched category names.

Which issue(s) this PR fixes

Part of #4713 (Error Categorization) and #4683 (Native support for retry policies)

Special notes for your reviewer

This is PR 2 of 3: Proto + classifier (Add error categorization proto schema and executor classifier #4741) -> Wire into executor (this) -> Store in lookout DB (Store FailureInfo in lookout DB and expose via API #4755)
Depends on Add error categorization proto schema and executor classifier #4741
The classifier is nil-safe: when no errorCategories are configured, FailureInfo is still populated (condition + exit code) but categories will be empty
Changes are in application.go (startup wiring), pod_issue_handler.go, and job_state_reporter.go

greptile-apps · 2026-03-09T13:08:45Z

Greptile Summary

This PR wires a new Classifier (introduced in PR #4741) into the executor's event reporting path, so that every failed pod event sent to Pulsar now carries structured FailureInfo — condition, exit code, termination message, and matched category names. The Classifier is constructed from config at startup and passed to both PodIssueHandler and JobStateReporter. A new ExtractFailureInfo helper in pod_status.go builds the FailureInfo proto, mapping Kubernetes signals to the FailureCondition enum via a new mapReasonToCondition function (which also performs preemption detection via isPreempted).

Key points:

The new errormatch and categorizer packages are well-structured with full upfront validation and comprehensive tests.
The Classifier and Classify are nil-safe throughout, so operators who don't configure errorCategories get a zero-overhead code path.
One coverage gap: RunPreemptedProcessor.reportPodPreempted (preempt_runs.go) emits a JobFailed event with nil failureInfo — preempted pods via this path won't carry structured metadata, contradicting the PR's stated goal of populating FailureInfo on every failed pod event.
Several known design trade-offs (noted in prior review threads) remain open: PodCheckRetryable is always false at current call sites; FailureInfo.Condition and PodError.KubernetesReason can diverge for preempted pods in the PodFailed branch; and ExitCode in FailureInfo may not correspond to the container that triggered a category match in multi-container pods.

Confidence Score: 4/5

Safe to merge with the understanding that preempted pods via RunPreemptedProcessor will not carry FailureInfo in their failed events.
The core wiring is correct and nil-safe throughout. New packages (errormatch, categorizer) are well-validated and thoroughly tested. The main concern is a coverage gap in preempt_runs.go where nil failureInfo is passed, meaning preempted pods on that code path won't have structured metadata in their failed events — inconsistent with the PR's stated goal but not a runtime crash risk. Pre-existing design trade-offs flagged in prior threads remain open but are acknowledged by the author.
internal/executor/job/processors/preempt_runs.go — passes nil failureInfo to CreateSimpleJobFailedEvent, leaving preempted pods without structured metadata.

Important Files Changed

Filename	Overview
internal/executor/job/processors/preempt_runs.go	Updated `CreateSimpleJobFailedEvent` call signature but passes `nil` failureInfo, leaving preempted pods without structured failure metadata in their failed events — inconsistent with PR goal of populating FailureInfo on every failed pod event.
internal/executor/categorizer/classifier.go	New package: well-structured `Classifier` with full upfront config validation (unknown conditions, invalid operators, empty values, bad regexes, duplicate names). Nil receiver on `Classify` is safe. All matchers (conditions, exit codes, termination messages) are correctly OR'd within categories.
internal/executor/util/pod_status.go	Adds `ExtractFailureInfo`, `mapReasonToCondition`, and `isPreempted`. Correctly nil-guards the pod. `ExitCode` and `TerminationMessage` come from the first failed container in `GetPodContainerStatuses` (regular before init), which can diverge from whichever container triggered a category match in the classifier (already flagged in a prior review thread).
internal/executor/reporter/event.go	Threads `*categorizer.Classifier` through `CreateEventForCurrentState`, `CreateSimpleJobFailedEvent`, and `CreateJobFailedEvent`. Nil classifier is handled safely. `FailureInfo` is now attached to `Error` messages for the `PodFailed` path.
internal/executor/service/pod_issue_handler.go	Wires classifier into `PodIssueHandler`; calls `ExtractFailureInfo` + `classifier.Classify` in `handleNonRetryableJobIssue`. Nil classifier is safe via nil-receiver check in `Classify`. Only the non-retryable issue path emits `FailureInfo`.
internal/executor/application.go	Constructs classifier from config at startup only when `ErrorCategories` is non-empty; passes it to both `PodIssueHandler` and `JobStateReporter`. Error is fatal at startup. Clean startup wiring.

Sequence Diagram

sequenceDiagram
    participant App as application.go
    participant Cfg as Config<br/>(ErrorCategories)
    participant C as categorizer.Classifier
    participant PIH as PodIssueHandler
    participant JSR as JobStateReporter
    participant Ev as reporter/event.go
    participant PS as util/pod_status.go
    participant Pulsar as Pulsar (EventSequence)

    App->>Cfg: read ErrorCategories
    App->>C: NewClassifier(ErrorCategories)
    App->>PIH: NewPodIssuerHandler(..., classifier)
    App->>JSR: NewJobStateReporter(..., classifier)

    Note over PIH: handleNonRetryableJobIssue
    PIH->>C: classifier.Classify(pod)
    C-->>PIH: []string{categories}
    PIH->>PS: ExtractFailureInfo(pod, retryable, msg, categories)
    PS-->>PIH: *FailureInfo
    PIH->>Ev: CreateSimpleJobFailedEvent(..., failureInfo)
    Ev-->>Pulsar: Error{PodError, FailureInfo}

    Note over JSR: reportCurrentStatus (PodFailed)
    JSR->>Ev: CreateEventForCurrentState(pod, clusterId, classifier)
    Ev->>C: classifier.Classify(pod)
    C-->>Ev: []string{categories}
    Ev->>PS: ExtractFailureInfo(pod, false, "", categories)
    PS-->>Ev: *FailureInfo
    Ev-->>Pulsar: Error{PodError, FailureInfo}

Comments Outside Diff (3)

internal/executor/job/processors/preempt_runs.go, line 95 (link)

Preempted pods emit JobFailed with nil FailureInfo

reportPodPreempted creates a JobFailed event but passes nil for failureInfo, so preempted pods reported through this path will have no structured failure metadata in Pulsar. This is inconsistent with the PR's stated goal: "every failed pod event flowing into Pulsar carries condition, exit code, termination message, and matched category names."

The isPreempted helper was added specifically to detect this condition — ExtractFailureInfo on this pod would return FAILURE_CONDITION_PREEMPTED. Passing nil here means downstream consumers (e.g., the Lookout DB writer in PR 3) will see a JobFailed event with no FailureInfo for preempted pods that go through this code path.

Consider constructing failureInfo the same way handleNonRetryableJobIssue does:
```
failureInfo := util.ExtractFailureInfo(pod, false, "Run preempted", nil /* no categories for preemption */)
failedEvent, err := reporter.CreateSimpleJobFailedEvent(pod, "Run preempted", "", j.clusterContext.GetClusterId(), armadaevents.KubernetesReason_AppError, failureInfo)
```
internal/executor/categorizer/classifier.go, line 30-49 (link)

NotIn with empty Values always matches — but only partially validated

The ExitCodeOperatorNotIn branch only validates that the Values slice is non-empty at construction time via buildRule. However, if the validation path is somehow bypassed (e.g., a Classifier is constructed directly without NewClassifier), an empty NotIn matcher would silently match every non-zero exit code:
```
case ExitCodeOperatorNotIn:
    for _, v := range matcher.Values {  // empty loop, never runs
        if exitCode == v {
            return false
        }
    }
    return true  // always true for any non-zero exit code
```
The NewClassifier validation prevents this today, but a defensive check in MatchExitCode itself would make the function safer in isolation. Consider adding:
internal/executor/job/processors/preempt_runs.go, line 895 (link)

nil FailureInfo leaves preempted pods without structured metadata

nil is passed for failureInfo here, meaning the Error event emitted for a preempted pod via RunPreemptedProcessor will never have a FailureInfo field — no exit code, no condition, and no categories. This contradicts the PR description's claim that "every failed pod event flowing into Pulsar carries condition, exit code, termination message, and matched category names."

Even without classifier access in this package, calling util.ExtractFailureInfo(pod, false, "", nil) would at minimum populate Condition = FAILURE_CONDITION_PREEMPTED (since mapReasonToCondition calls isPreempted) along with the exit code and termination message from the container statuses. Downstream consumers such as the retry-policy layer in PR 3 that read FailureInfo to make retry decisions will see a nil struct for all pods preempted through this code path.

If the intent is to rely on the companion JobPreempted event for condition detection and intentionally omit FailureInfo from the failed event, that design decision should be documented clearly (e.g., a comment above the failedEvent creation noting that FailureInfo is deliberately omitted because the PreemptedEvent carries the canonical metadata for this path).

_{Last reviewed commit: 5ab3dda}

internal/executor/util/pod_status.go

internal/executor/reporter/event.go

internal/executor/util/pod_status.go

dejanzele · 2026-03-10T12:46:50Z

@greptile

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele changed the title ~~Wire error categories executor~~ Wire error categories into executor event reporting Mar 6, 2026

dejanzele force-pushed the wire-error-categories-executor branch 2 times, most recently from 5a399ca to 8a49600 Compare March 9, 2026 10:11

dejanzele mentioned this pull request Mar 9, 2026

Store FailureInfo in lookout DB and expose via API #4755

Open

dejanzele force-pushed the wire-error-categories-executor branch from 8a49600 to b828caf Compare March 9, 2026 10:21

dejanzele mentioned this pull request Mar 9, 2026

Add error categorization proto schema and executor classifier #4741

Open

dejanzele force-pushed the wire-error-categories-executor branch from b828caf to 30aa1ca Compare March 9, 2026 13:03

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

internal/executor/util/pod_status.go Show resolved Hide resolved

dejanzele force-pushed the wire-error-categories-executor branch from 30aa1ca to ef4e22d Compare March 9, 2026 13:20

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

internal/executor/reporter/event.go Show resolved Hide resolved

internal/executor/util/pod_status.go Show resolved Hide resolved

dejanzele force-pushed the wire-error-categories-executor branch 2 times, most recently from 3d469d0 to 20f2fd3 Compare March 10, 2026 12:34

dejanzele force-pushed the wire-error-categories-executor branch from 20f2fd3 to 71d2cd3 Compare March 10, 2026 16:51

dejanzele mentioned this pull request Mar 11, 2026

Add e2e test cases for error categorization #4760

Open

dejanzele added 3 commits March 11, 2026 16:31

Add FailureCondition enum and FailureInfo message to Error proto

f40fa51

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

Add executor error category classifier and ExtractFailureInfo

67989ad

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

Wire error categories into executor event reporting

5ab3dda

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele force-pushed the wire-error-categories-executor branch from 71d2cd3 to 5ab3dda Compare March 11, 2026 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wire error categories into executor event reporting#4745

Wire error categories into executor event reporting#4745
dejanzele wants to merge 3 commits intoarmadaproject:masterfrom
dejanzele:wire-error-categories-executor

dejanzele commented Mar 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 9, 2026 •

edited

Loading

Comments Outside Diff (3)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dejanzele commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dejanzele commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Uh oh!

greptile-apps bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (3)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dejanzele commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dejanzele commented Mar 6, 2026 •

edited

Loading

greptile-apps bot commented Mar 9, 2026 •

edited

Loading