Skip to content

Wire error categories into executor event reporting#4745

Open
dejanzele wants to merge 3 commits intoarmadaproject:masterfrom
dejanzele:wire-error-categories-executor
Open

Wire error categories into executor event reporting#4745
dejanzele wants to merge 3 commits intoarmadaproject:masterfrom
dejanzele:wire-error-categories-executor

Conversation

@dejanzele
Copy link
Member

@dejanzele dejanzele commented Mar 6, 2026

What type of PR is this?

This is the second PR in the error categorization series. It wires the classifier and FailureInfo into the executor's event reporting path.

What this PR does / why we need it

  • Constructs the Classifier from config at executor startup and passes it to the pod issue handler and job state reporter.
  • Calls ExtractFailureInfo() + classifier.Classify() on every pod failure, attaching structured FailureInfo to the Error events sent through Pulsar.
  • After this PR, every failed pod event flowing into Pulsar carries condition, exit code, termination message, and matched category names.

Which issue(s) this PR fixes

Part of #4713 (Error Categorization) and #4683 (Native support for retry policies)

Special notes for your reviewer

@dejanzele dejanzele changed the title Wire error categories executor Wire error categories into executor event reporting Mar 6, 2026
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch 2 times, most recently from 5a399ca to 8a49600 Compare March 9, 2026 10:11
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch from 8a49600 to b828caf Compare March 9, 2026 10:21
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch from b828caf to 30aa1ca Compare March 9, 2026 13:03
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR wires a new Classifier (introduced in PR #4741) into the executor's event reporting path, so that every failed pod event sent to Pulsar now carries structured FailureInfo — condition, exit code, termination message, and matched category names. The Classifier is constructed from config at startup and passed to both PodIssueHandler and JobStateReporter. A new ExtractFailureInfo helper in pod_status.go builds the FailureInfo proto, mapping Kubernetes signals to the FailureCondition enum via a new mapReasonToCondition function (which also performs preemption detection via isPreempted).

Key points:

  • The new errormatch and categorizer packages are well-structured with full upfront validation and comprehensive tests.
  • The Classifier and Classify are nil-safe throughout, so operators who don't configure errorCategories get a zero-overhead code path.
  • One coverage gap: RunPreemptedProcessor.reportPodPreempted (preempt_runs.go) emits a JobFailed event with nil failureInfo — preempted pods via this path won't carry structured metadata, contradicting the PR's stated goal of populating FailureInfo on every failed pod event.
  • Several known design trade-offs (noted in prior review threads) remain open: PodCheckRetryable is always false at current call sites; FailureInfo.Condition and PodError.KubernetesReason can diverge for preempted pods in the PodFailed branch; and ExitCode in FailureInfo may not correspond to the container that triggered a category match in multi-container pods.

Confidence Score: 4/5

  • Safe to merge with the understanding that preempted pods via RunPreemptedProcessor will not carry FailureInfo in their failed events.
  • The core wiring is correct and nil-safe throughout. New packages (errormatch, categorizer) are well-validated and thoroughly tested. The main concern is a coverage gap in preempt_runs.go where nil failureInfo is passed, meaning preempted pods on that code path won't have structured metadata in their failed events — inconsistent with the PR's stated goal but not a runtime crash risk. Pre-existing design trade-offs flagged in prior threads remain open but are acknowledged by the author.
  • internal/executor/job/processors/preempt_runs.go — passes nil failureInfo to CreateSimpleJobFailedEvent, leaving preempted pods without structured metadata.

Important Files Changed

Filename Overview
internal/executor/job/processors/preempt_runs.go Updated CreateSimpleJobFailedEvent call signature but passes nil failureInfo, leaving preempted pods without structured failure metadata in their failed events — inconsistent with PR goal of populating FailureInfo on every failed pod event.
internal/executor/categorizer/classifier.go New package: well-structured Classifier with full upfront config validation (unknown conditions, invalid operators, empty values, bad regexes, duplicate names). Nil receiver on Classify is safe. All matchers (conditions, exit codes, termination messages) are correctly OR'd within categories.
internal/executor/util/pod_status.go Adds ExtractFailureInfo, mapReasonToCondition, and isPreempted. Correctly nil-guards the pod. ExitCode and TerminationMessage come from the first failed container in GetPodContainerStatuses (regular before init), which can diverge from whichever container triggered a category match in the classifier (already flagged in a prior review thread).
internal/executor/reporter/event.go Threads *categorizer.Classifier through CreateEventForCurrentState, CreateSimpleJobFailedEvent, and CreateJobFailedEvent. Nil classifier is handled safely. FailureInfo is now attached to Error messages for the PodFailed path.
internal/executor/service/pod_issue_handler.go Wires classifier into PodIssueHandler; calls ExtractFailureInfo + classifier.Classify in handleNonRetryableJobIssue. Nil classifier is safe via nil-receiver check in Classify. Only the non-retryable issue path emits FailureInfo.
internal/executor/application.go Constructs classifier from config at startup only when ErrorCategories is non-empty; passes it to both PodIssueHandler and JobStateReporter. Error is fatal at startup. Clean startup wiring.

Sequence Diagram

sequenceDiagram
    participant App as application.go
    participant Cfg as Config<br/>(ErrorCategories)
    participant C as categorizer.Classifier
    participant PIH as PodIssueHandler
    participant JSR as JobStateReporter
    participant Ev as reporter/event.go
    participant PS as util/pod_status.go
    participant Pulsar as Pulsar (EventSequence)

    App->>Cfg: read ErrorCategories
    App->>C: NewClassifier(ErrorCategories)
    App->>PIH: NewPodIssuerHandler(..., classifier)
    App->>JSR: NewJobStateReporter(..., classifier)

    Note over PIH: handleNonRetryableJobIssue
    PIH->>C: classifier.Classify(pod)
    C-->>PIH: []string{categories}
    PIH->>PS: ExtractFailureInfo(pod, retryable, msg, categories)
    PS-->>PIH: *FailureInfo
    PIH->>Ev: CreateSimpleJobFailedEvent(..., failureInfo)
    Ev-->>Pulsar: Error{PodError, FailureInfo}

    Note over JSR: reportCurrentStatus (PodFailed)
    JSR->>Ev: CreateEventForCurrentState(pod, clusterId, classifier)
    Ev->>C: classifier.Classify(pod)
    C-->>Ev: []string{categories}
    Ev->>PS: ExtractFailureInfo(pod, false, "", categories)
    PS-->>Ev: *FailureInfo
    Ev-->>Pulsar: Error{PodError, FailureInfo}
Loading

Comments Outside Diff (3)

  1. internal/executor/job/processors/preempt_runs.go, line 95 (link)

    Preempted pods emit JobFailed with nil FailureInfo

    reportPodPreempted creates a JobFailed event but passes nil for failureInfo, so preempted pods reported through this path will have no structured failure metadata in Pulsar. This is inconsistent with the PR's stated goal: "every failed pod event flowing into Pulsar carries condition, exit code, termination message, and matched category names."

    The isPreempted helper was added specifically to detect this condition — ExtractFailureInfo on this pod would return FAILURE_CONDITION_PREEMPTED. Passing nil here means downstream consumers (e.g., the Lookout DB writer in PR 3) will see a JobFailed event with no FailureInfo for preempted pods that go through this code path.

    Consider constructing failureInfo the same way handleNonRetryableJobIssue does:

    failureInfo := util.ExtractFailureInfo(pod, false, "Run preempted", nil /* no categories for preemption */)
    failedEvent, err := reporter.CreateSimpleJobFailedEvent(pod, "Run preempted", "", j.clusterContext.GetClusterId(), armadaevents.KubernetesReason_AppError, failureInfo)
  2. internal/executor/categorizer/classifier.go, line 30-49 (link)

    NotIn with empty Values always matches — but only partially validated

    The ExitCodeOperatorNotIn branch only validates that the Values slice is non-empty at construction time via buildRule. However, if the validation path is somehow bypassed (e.g., a Classifier is constructed directly without NewClassifier), an empty NotIn matcher would silently match every non-zero exit code:

    case ExitCodeOperatorNotIn:
        for _, v := range matcher.Values {  // empty loop, never runs
            if exitCode == v {
                return false
            }
        }
        return true  // always true for any non-zero exit code

    The NewClassifier validation prevents this today, but a defensive check in MatchExitCode itself would make the function safer in isolation. Consider adding:

  3. internal/executor/job/processors/preempt_runs.go, line 895 (link)

    nil FailureInfo leaves preempted pods without structured metadata

    nil is passed for failureInfo here, meaning the Error event emitted for a preempted pod via RunPreemptedProcessor will never have a FailureInfo field — no exit code, no condition, and no categories. This contradicts the PR description's claim that "every failed pod event flowing into Pulsar carries condition, exit code, termination message, and matched category names."

    Even without classifier access in this package, calling util.ExtractFailureInfo(pod, false, "", nil) would at minimum populate Condition = FAILURE_CONDITION_PREEMPTED (since mapReasonToCondition calls isPreempted) along with the exit code and termination message from the container statuses. Downstream consumers such as the retry-policy layer in PR 3 that read FailureInfo to make retry decisions will see a nil struct for all pods preempted through this code path.

    If the intent is to rely on the companion JobPreempted event for condition detection and intentionally omit FailureInfo from the failed event, that design decision should be documented clearly (e.g., a comment above the failedEvent creation noting that FailureInfo is deliberately omitted because the PreemptedEvent carries the canonical metadata for this path).

Last reviewed commit: 5ab3dda

@dejanzele dejanzele force-pushed the wire-error-categories-executor branch from 30aa1ca to ef4e22d Compare March 9, 2026 13:20
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch 2 times, most recently from 3d469d0 to 20f2fd3 Compare March 10, 2026 12:34
@dejanzele
Copy link
Member Author

@greptile

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the wire-error-categories-executor branch from 71d2cd3 to 5ab3dda Compare March 11, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant