|
| 1 | +# ADR-027: Kubernetes Data Store (CRD) for HealthEvent |
| 2 | + |
| 3 | +Author: Avinash Reddy Yeddula, Igor Velichkovich |
| 4 | + |
| 5 | +## Context |
| 6 | + |
| 7 | +NVSentinel currently persists and watches health events using MongoDB through a datastore client. While MongoDB provides rich query capabilities and watch semantics, reliance on an external database makes NVSentinel harder to deploy in Kubernetes-native or constrained environments and prevents exposing health events through standard Kubernetes APIs. |
| 8 | + |
| 9 | +A broader datastore abstraction and multi-backend strategy is defined in the NVSentinel Datastore Abstraction Design document. This ADR narrows the scope to the Kubernetes Custom Resource (CR) datastore option and formalizes the design of the Kubernetes HealthEvent API and its operational constraints. |
| 10 | + |
| 11 | +## Decision |
| 12 | + |
| 13 | +We will introduce a Kubernetes-native HealthEvent API and implement a Kubernetes Custom Resource–backed datastore for NVSentinel that supports core event ingestion, status updates, and watch-driven remediation workflows. |
| 14 | + |
| 15 | +In the initial phase, Kubernetes CRs are considered suitable only for NVSentinel’s core, watch-oriented workflows and not for analytics or historical querying. |
| 16 | +## Motivations |
| 17 | + |
| 18 | +- **Avoid new database dependencies:** Deploying NVSentinel currently requires provisioning and managing an external datastore (MongoDB, PostgreSQL, etc.). A Kubernetes CR backend allows operators to run NVSentinel without introducing additional database systems. |
| 19 | +- **Stay Kubernetes-native:** NVSentinel should leverage Kubernetes as the primary platform. CRs make health events first-class Kubernetes resources, observable through standard Kubernetes APIs and tools. |
| 20 | +- **Preserve multi-backend flexibility:** This design maintains the existing datastore abstraction, allowing operators to choose their preferred backend based on their needs. |
| 21 | +## Implementation |
| 22 | + |
| 23 | +### API Definition |
| 24 | + |
| 25 | +- API Group/Version: `nvsentinel.dgxc.nvidia.com/v1alpha1` |
| 26 | +- Kind: `HealthEvent` |
| 27 | +- `spec`: contains immutable event data (from HealthEvent) |
| 28 | +- `status`: tracks mutable workflow state (from HealthEventStatus) |
| 29 | + |
| 30 | +### Directory Structure and CRD Generation |
| 31 | + |
| 32 | +- CRD generation strategy: we intend to use a proto-first generation approach (for example, [`protoc-gen-crd`](https://github.com/yandex/protoc-gen-crd)) to produce Kubernetes CRD YAML and the corresponding Go types for the `HealthEvent` `spec` directly from the `.proto` definitions. This keeps the authoritative shape of `spec` in the `.proto` files and reduces drift between proto, generated types, and handwritten types. |
| 33 | + |
| 34 | +- Spec is the `HealthEvent` proto: the `spec` of the `HealthEvent` CR should be the `HealthEvent` protobuf message. To keep generation coherent, the `status` subresource (`HealthEventStatus`) should also be defined in the `.proto` if it is expected to be generated into the CRD and accompanying types. |
| 35 | + |
| 36 | +- Move `HealthEventStatus` into proto: currently `HealthEventStatus` exists as a handwritten Go struct in `data-models/pkg/model/health_event_extentions.go`. For proto-first CRD generation, move the `HealthEventStatus` definition into the `.proto` file so the generator emits a single, consistent CRD with both `spec` and `status` schemas. If `HealthEventStatus` remains only as a handwritten Go type, generation will produce `spec`-related types from proto while `status` remains a separate handwritten type — this creates a mix-and-match and generally forces an extra wrapper or conversion layer to assemble a complete CR type for controllers. |
| 37 | + |
| 38 | + - Implementation note: after moving `HealthEventStatus` into the `.proto`, run the proto-to-CRD generator to produce the CRD and generated Go types. Existing conversion helpers should be adapted to map between the protobuf messages, generated CRD types, and any internal model types. The event flow and responsibilities (health monitors send events to platform-connectors, which persist to the configured backend) remain unchanged. |
| 39 | + |
| 40 | +- **Future consideration — API versioning:** Generating CRDs directly from proto objects means the proto types will eventually need to follow Kubernetes API versioning rules (e.g., storage versions, conversion webhooks, backward compatibility guarantees). This is acceptable for `v1alpha1` where we can iterate freely, but as the API stabilizes toward `v1`, we should plan for more formal API versioning practices. This does not block the current proto-first approach but should be part of the v1 graduation plan. |
| 41 | + |
| 42 | +> See **Alternatives Considered** for the approach where CRD structs and conversion tests are hand-managed in Go instead of generated from proto. |
| 43 | +
|
| 44 | +### Operational Workflow |
| 45 | + |
| 46 | +**Event Flow and CRD Integration:** |
| 47 | + |
| 48 | +1. Kubernetes CR as a pluggable datastore backend: The Kubernetes CR backend is one datastore implementation option alongside MongoDB, PostgreSQL, and others. Operators select which backend to use via Helm configuration at deployment time. Only one backend is active per deployment. |
| 49 | + |
| 50 | +2. Responsibilities and data flow with Kubernetes CR backend: |
| 51 | + - **Health Monitors → Platform Connectors → Datastore:** Health monitors detect events and forward them to the `platform-connectors` component (gRPC). The `platform-connectors` service is responsible for validating and persisting events to the configured datastore backend. When the Kubernetes CR backend is selected, `platform-connectors` will create/update `HealthEvent` CRs. This preserves the current responsibility model — health monitors do not write directly to the datastore. This is the current data flow (Health Monitors → Platform Connectors → Datastore) and will remain unchanged by the CR backend introduction. |
| 52 | + - **FaultRemediationReconciler and controllers:** Consume HealthEvent CRs via watches. Controllers use the CR `spec` as immutable event data and update the CR `status` subresource to reflect mutable workflow state (node quarantine status, pod eviction progress, remediation timestamps). |
| 53 | + |
| 54 | +3. Operators or additional controllers can act on the CR data to determine remediation actions independently, providing flexibility and separation between event ingestion and execution. |
| 55 | + |
| 56 | +### Integration with Existing Tooling and Processes |
| 57 | + |
| 58 | +- **Kubernetes CR as a datastore backend:** The CR backend is a pluggable datastore option. Operators choose which backend to deploy via Helm configuration — either Kubernetes CRs or MongoDB (or another backend). Only one backend is active per deployment. |
| 59 | +- **Existing datastore abstraction pattern:** The multi-datastore pattern is already established in NVSentinel (e.g., PostgreSQL and MongoDB backends). The Kubernetes CR backend follows the same abstraction approach. Implementation requires implementing the datastore interface (see [#456](https://github.com/NVIDIA/NVSentinel/pull/456) as a reference for adding a new datastore backend). |
| 60 | +- **No MongoDB-specific changes:** The existing MongoDB-backed implementation remains unchanged and fully functional. |
| 61 | + - **Health Monitors and Controllers:** When using the Kubernetes CR backend, health monitors send events to `platform-connectors`, which create CRs with immutable `spec`. Controllers consume CRs via watches and update CR `status` to track workflow state. When using MongoDB, the flow remains as described above. |
| 62 | +- **Existing remediation logic:** Log collection, state management, and remediation remain functionally the same regardless of which backend is deployed. |
| 63 | +- **Proto migration coordination:** Moving `HealthEventStatus` from `data-models/pkg/model/health_event_extentions.go` into the `.proto` definition requires regenerating the proto files and updating all components (health monitors, controllers, store-client, converters, etc.) that currently depend on the old struct. This change must be coordinated across all affected modules to ensure they consume the new generated types. |
| 64 | + |
| 65 | +### CR Cleanup / Garbage Collection: |
| 66 | +To prevent uncontrolled growth of HealthEvent CRs and reduce load on the Kubernetes API server, the Kubernetes store implementation includes: |
| 67 | +- Optional TTL / age-based deletion – CRs older than a configured threshold can be automatically removed if still unresolved, preventing indefinite growth. |
| 68 | +- Enforcement of configurable per-node or per-cluster CR limits – ensures the number of CRs remains manageable even during bursty failure periods or in large clusters. |
| 69 | + |
| 70 | +## Rationale |
| 71 | + |
| 72 | +- **Operational simplicity:** Using Kubernetes CRs leverages the existing control plane and standard Kubernetes mechanisms, eliminating the need for a separate database deployment and management. |
| 73 | +- **No additional dependencies:** Without a Kubernetes CR backend, consumers must deploy and manage an external datastore (MongoDB, PostgreSQL, etc.). Using Kubernetes CRs as the backing store removes this external dependency burden. |
| 74 | +- **Native observability and integration:** CRs can be watched, queried (to a limited extent), and annotated using Kubernetes-native APIs, making integration with controllers, dashboards, and CI/CD pipelines easier. |
| 75 | +- **Developer experience and flexibility:** The CRD abstraction allows developers to work with Kubernetes-native resources while keeping the underlying datastore pluggable, enabling future enhancements or alternative datastore implementations. |
| 76 | + |
| 77 | +## Consequences |
| 78 | + |
| 79 | +### Positive |
| 80 | + |
| 81 | +- Kubernetes-native storage: Events are persisted directly in the cluster, reducing external dependencies. |
| 82 | +- Core workflows supported: Watch-oriented health event processing (quarantine, remediation, basic updates) works without modification. |
| 83 | +- Historic view via kube-state-metrics: Health events stored as CRs can be scraped and exposed through Kubernetes-native monitoring tools like kube-state-metrics, enabling historical analysis and observability without additional infrastructure. |
| 84 | + |
| 85 | +### Negative |
| 86 | + |
| 87 | +- Limited querying capabilities: Complex ad-hoc queries and full-featured historical analysis capabilities (beyond basic time-series metrics) are not feasible using only CRs. |
| 88 | +- Control plane load: High-frequency or bursty events can increase the number of CR objects, potentially impacting the API server in large clusters. |
| 89 | + |
| 90 | +### Mitigations |
| 91 | + |
| 92 | +- Event deduplication and rate limiting: Only create CRs for unique unhealthy events or updates to existing events. |
| 93 | +- Resource cleanup and garbage collection: Automatically remove CRs when events are remediated or when cluster/node limits are reached. |
| 94 | +- Phased approach: Core watch-based workflows are enabled first; advanced features can be added later. |
| 95 | + |
| 96 | +## Alternatives Considered |
| 97 | + |
| 98 | +### Hand-Managed CRD Structs and Conversion Tests |
| 99 | +**Under Consideration** — needs further review: |
| 100 | +- Data Models: All new structs for the Kubernetes HealthEvent CRD (HealthEventSpec, HealthEventStatus, BehaviourOverrides, Entity, etc.) are added to `./data-models` as handwritten Go types. |
| 101 | +- CRD Generation: `controller-gen` is used to generate the Kubernetes CRD YAML manifests from these Go structs (spec) while `HealthEventStatus` remains in `data-models/pkg/model/health_event_extentions.go`. |
| 102 | +- Conversion APIs: The conversion functions between internal NVSentinel models and CRD types are implemented in `pkg/conversion/healthevent_conversion.go` with corresponding unit tests in `pkg/conversion/healthevent_conversion_test.go`. These tests ensure that the mapping between internal models, protobufs, and CRD structs is correct and that there is no drift between the data models. |
| 103 | +- Helm Chart Packaging: The generated CRD manifests are packaged in a new Helm chart at `./distros/kubernetes/nvsentinel/charts/kubernetes-store`. |
| 104 | +- **Trade-off:** This approach keeps `HealthEventStatus` as a handwritten struct, which requires an additional wrapper/conversion layer to compose a complete CR type for controllers. However, it may offer more flexibility in certain scenarios and should be evaluated further before final decision. |
| 105 | + |
| 106 | +## Notes |
| 107 | + |
| 108 | +- The initial phase only supports core, watch-driven workflows; analytics and historical queries are deferred. |
| 109 | +- The system is designed to allow multiple operators or controllers to act independently on CR data. |
| 110 | + |
| 111 | +## References |
| 112 | + |
| 113 | +- [NVSentinel Datastore Abstraction Design Document](https://docs.google.com/document/d/1iD6qhWDapfb7CMCAY7sr3FB6WngdlwmSMewUDQ8gxJo/edit?tab=t.0#heading=h.lolo8bj6u6g2) |
| 114 | +- [K8s API Data store WIP PR](https://github.com/NVIDIA/NVSentinel/pull/640) |
0 commit comments