Skip to content

Add error categorization proto schema and executor classifier#4741

Open
dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
dejanzele:error-categorization-proto
Open

Add error categorization proto schema and executor classifier#4741
dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
dejanzele:error-categorization-proto

Conversation

@dejanzele
Copy link
Member

@dejanzele dejanzele commented Mar 5, 2026

What type of PR is this?

This is the first PR in a series that adds error categorization to Armada. It introduces the proto schema and shared building blocks (executor classifier, failure info extraction).

What this PR does / why we need it

  • Adds FailureCondition enum and FailureInfo message to the Error proto, allowing structured failure metadata (condition, exit code, termination message, categories) to flow through Pulsar events.
  • Adds the categorizer package in the executor: a configurable classifier that matches pod failures against rules (exit codes, termination messages, Kubernetes conditions like OOMKilled/Evicted) and assigns named categories.
  • Adds ExtractFailureInfo() in pod_status.go to pull structured failure signals from Kubernetes pod status into the FailureInfo proto.
  • Adds ErrorCategories config field under ApplicationConfiguration for defining category rules.

Nothing is wired yet - this PR provides the building blocks that PR #4745 connects.

Which issue(s) this PR fixes

Part of #4713 (Error Categorization) and #4683 (Native support for retry policies)

Special notes for your reviewer

@dejanzele dejanzele changed the title Add FailureCondition enum and FailureInfo message to Error proto Add error categorization proto schema and executor classifier Mar 5, 2026
@dejanzele dejanzele force-pushed the error-categorization-proto branch 4 times, most recently from 0303967 to 8d209d5 Compare March 6, 2026 13:43
@dejanzele dejanzele force-pushed the error-categorization-proto branch 2 times, most recently from 371113a to da1d029 Compare March 9, 2026 10:21
@dejanzele dejanzele force-pushed the error-categorization-proto branch from da1d029 to 052ced0 Compare March 9, 2026 13:03
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR introduces the foundational building blocks for error categorization in Armada: a FailureCondition enum and FailureInfo proto message on Error events, a configurable Classifier in the executor that matches pod failures against named category rules (exit codes, termination messages, Kubernetes conditions), and ExtractFailureInfo() to bridge Kubernetes pod status into the new proto. Nothing is wired end-to-end yet; this is PR 1 of 3.

Key observations:

  • The errormatch and categorizer packages are well-structured, tested, and handle edge cases correctly (nil guards, exit-code-0 filtering, upfront config validation).
  • mapReasonToCondition in pod_status.go maps KubernetesReason_NodeLost, _Unknown, _Unschedulable, and _PodUnschedulable to FAILURE_CONDITION_USER_ERROR via the default branch. These are infrastructure-side failures, not user errors; the misclassification could cause incorrect retry-policy decisions in PR Wire error categories into executor event reporting #4745.
  • buildRule in classifier.go does not reject an empty OnTerminationMessage.Pattern; regexp.Compile("") succeeds and creates a match-all rule, silently miscategorizing any pod that produces a termination message.
  • The exit_code proto field remains a plain int32, preserving the previously noted proto3 zero-value ambiguity for eviction/deadline pods (no failing container → field is indistinguishable from unset).
  • GetPodContainerStatuses returns regular containers before init containers, so ExtractFailureInfo may report a stale regular-container exit code rather than a current init-container failure (see prior review thread).

Confidence Score: 3/5

  • Safe to merge as an unwired building block, but the NodeLost→USER_ERROR misclassification should be fixed before PR Wire error categories into executor event reporting #4745 wires this into retry logic.
  • The new packages are well-tested and the classifier logic is sound. However, mapReasonToCondition contains a logic error where infrastructure failures (NodeLost, Unknown) are labelled USER_ERROR, which will propagate incorrect retry decisions once wired. The empty-pattern validation gap is a misconfiguration risk. Two issues from prior threads (init-container ordering, proto3 exit_code ambiguity) remain unaddressed.
  • internal/executor/util/pod_status.go (NodeLost mapping) and internal/executor/categorizer/classifier.go (empty pattern validation).

Important Files Changed

Filename Overview
internal/executor/categorizer/classifier.go Core classifier implementation is sound (exit-code-0 filtering, empty-rules validation, regex compilation, nil-safe Classify), but an empty OnTerminationMessage.Pattern bypasses validation and creates a match-all rule.
internal/executor/categorizer/classifier_test.go Comprehensive table-driven tests covering OOM, exit-code In/NotIn, termination message regex, multi-category matching, init containers, eviction, deadline, OR semantics, exit-code-0 skip, and all validation error paths.
internal/executor/util/pod_status.go Adds ExtractFailureInfo and mapReasonToCondition; however the default branch of mapReasonToCondition incorrectly maps NodeLost, Unknown, Unschedulable, and PodUnschedulable to FAILURE_CONDITION_USER_ERROR, which are infrastructure failures rather than user errors.
internal/executor/util/pod_status_test.go Good coverage of ExtractFailureInfo: OOM, eviction, deadline exceeded, custom error, nil pod, preempted, and preempted+OOM. Missing a test for the NodeLost/Unknown condition mapping.
pkg/armadaevents/events.proto Adds FailureCondition enum and FailureInfo message to Error; enum values use explicit numbering. The exit_code field is a plain int32, making it impossible to distinguish an unset field from a container that happened to exit with code 0 (e.g., eviction/deadline scenarios).

Sequence Diagram

sequenceDiagram
    participant Executor
    participant Classifier
    participant pod_status
    participant Proto as Pulsar Event

    Executor->>Classifier: NewClassifier(config.ErrorCategories)
    Note over Classifier: Validates rules, compiles regexes

    Executor->>Classifier: Classify(pod)
    Classifier->>Classifier: failedContainers(pod)<br/>(skip exitCode==0)
    Classifier->>Classifier: categoryMatches() → ruleMatches()<br/>onConditions | onExitCodes | onTerminationMessage
    Classifier-->>Executor: []string{categoryNames}

    Executor->>pod_status: ExtractFailureInfo(pod, retryable, msg, categories)
    pod_status->>pod_status: isPreempted(pod)?
    pod_status->>pod_status: ExtractPodFailureCause(pod) → mapReasonToCondition()
    pod_status->>pod_status: GetPodContainerStatuses(pod)<br/>ContainerStatuses + InitContainerStatuses<br/>(first non-zero exit code wins)
    pod_status-->>Executor: FailureInfo{ExitCode, Condition, Categories, ...}

    Executor->>Proto: Error{FailureInfo: ...}
    Proto-->>Executor: Published to Pulsar
Loading

Comments Outside Diff (2)

  1. internal/executor/util/pod_status.go, line 147-156 (link)

    NodeLost/Unknown incorrectly mapped to FAILURE_CONDITION_USER_ERROR

    The default branch maps all unhandled KubernetesReason values — including KubernetesReason_NodeLost, KubernetesReason_Unknown, KubernetesReason_Unschedulable, and KubernetesReason_PodUnschedulable — to FAILURE_CONDITION_USER_ERROR. Node-loss and unknown failures are infrastructure-side events, not user errors. This misclassification will flow into the retry-policy engine in PR Wire error categories into executor event reporting #4745 and could incorrectly suppress retries for infra-side failures while treating them as user mistakes in Lookout.

    KubernetesReason_AppError is the only value that should truly fall through to USER_ERROR. The others need explicit handling (or at minimum a separate "infrastructure error" condition).

    switch ExtractPodFailureCause(pod) {
    case armadaevents.KubernetesReason_Evicted:
        return armadaevents.FailureCondition_FAILURE_CONDITION_EVICTED
    case armadaevents.KubernetesReason_OOM:
        return armadaevents.FailureCondition_FAILURE_CONDITION_OOM_KILLED
    case armadaevents.KubernetesReason_DeadlineExceeded:
        return armadaevents.FailureCondition_FAILURE_CONDITION_DEADLINE_EXCEEDED
    case armadaevents.KubernetesReason_AppError:
        return armadaevents.FailureCondition_FAILURE_CONDITION_USER_ERROR
    default:
        // NodeLost, Unknown, Unschedulable, PodUnschedulable — not user errors
        return armadaevents.FailureCondition_FAILURE_CONDITION_UNSPECIFIED
    }

    Alternatively, add a FAILURE_CONDITION_INFRASTRUCTURE_ERROR enum value for these cases so that downstream consumers can distinguish them from user errors.

  2. internal/executor/categorizer/classifier.go, line 94-101 (link)

    Empty OnTerminationMessage pattern silently becomes a catch-all

    buildRule validates that cfg.OnTerminationMessage != nil, but does not check that cfg.OnTerminationMessage.Pattern is non-empty. regexp.Compile("") succeeds and produces a regex that matches every string. Combined with the value != "" guard in MatchPattern, this means an empty pattern would match any pod that has any non-empty termination message — silently creating a catch-all category that is almost certainly a misconfiguration.

    Consider adding a guard before compilation:

    if cfg.OnTerminationMessage != nil {
        if cfg.OnTerminationMessage.Pattern == "" {
            return rule{}, fmt.Errorf("termination message matcher requires a non-empty pattern")
        }
        re, err := regexp.Compile(cfg.OnTerminationMessage.Pattern)
        if err != nil {
            return rule{}, fmt.Errorf("invalid regex %q: %w", cfg.OnTerminationMessage.Pattern, err)
        }
        compiledRegex = re
    }

Last reviewed commit: 67989ad

@dejanzele dejanzele force-pushed the error-categorization-proto branch from 052ced0 to 7af38d0 Compare March 9, 2026 13:19
@dejanzele dejanzele force-pushed the error-categorization-proto branch 2 times, most recently from 22d10e0 to f8a4e3e Compare March 10, 2026 12:34
@dejanzele
Copy link
Member Author

@greptile

@dejanzele dejanzele force-pushed the error-categorization-proto branch from f8a4e3e to def537a Compare March 10, 2026 16:50
@dejanzele
Copy link
Member Author

@greptile

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the error-categorization-proto branch from def537a to 67989ad Compare March 11, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant