Add error categorization proto schema and executor classifier#4741
Add error categorization proto schema and executor classifier#4741dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
Conversation
0303967 to
8d209d5
Compare
371113a to
da1d029
Compare
da1d029 to
052ced0
Compare
Greptile SummaryThis PR introduces the foundational building blocks for error categorization in Armada: a Key observations:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Executor
participant Classifier
participant pod_status
participant Proto as Pulsar Event
Executor->>Classifier: NewClassifier(config.ErrorCategories)
Note over Classifier: Validates rules, compiles regexes
Executor->>Classifier: Classify(pod)
Classifier->>Classifier: failedContainers(pod)<br/>(skip exitCode==0)
Classifier->>Classifier: categoryMatches() → ruleMatches()<br/>onConditions | onExitCodes | onTerminationMessage
Classifier-->>Executor: []string{categoryNames}
Executor->>pod_status: ExtractFailureInfo(pod, retryable, msg, categories)
pod_status->>pod_status: isPreempted(pod)?
pod_status->>pod_status: ExtractPodFailureCause(pod) → mapReasonToCondition()
pod_status->>pod_status: GetPodContainerStatuses(pod)<br/>ContainerStatuses + InitContainerStatuses<br/>(first non-zero exit code wins)
pod_status-->>Executor: FailureInfo{ExitCode, Condition, Categories, ...}
Executor->>Proto: Error{FailureInfo: ...}
Proto-->>Executor: Published to Pulsar
|
052ced0 to
7af38d0
Compare
22d10e0 to
f8a4e3e
Compare
|
@greptile |
f8a4e3e to
def537a
Compare
|
@greptile |
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
def537a to
67989ad
Compare
What type of PR is this?
This is the first PR in a series that adds error categorization to Armada. It introduces the proto schema and shared building blocks (executor classifier, failure info extraction).
What this PR does / why we need it
FailureConditionenum andFailureInfomessage to theErrorproto, allowing structured failure metadata (condition, exit code, termination message, categories) to flow through Pulsar events.categorizerpackage in the executor: a configurable classifier that matches pod failures against rules (exit codes, termination messages, Kubernetes conditions like OOMKilled/Evicted) and assigns named categories.ExtractFailureInfo()inpod_status.goto pull structured failure signals from Kubernetes pod status into theFailureInfoproto.ErrorCategoriesconfig field underApplicationConfigurationfor defining category rules.Nothing is wired yet - this PR provides the building blocks that PR #4745 connects.
Which issue(s) this PR fixes
Part of #4713 (Error Categorization) and #4683 (Native support for retry policies)
Special notes for your reviewer
categorizerpackage has thorough doc.go explaining config format and validationFailureConditionenum uses explicit proto numbering to stay wire-compatible