Skip to content

Conversation

@tanishagoyal2
Copy link
Contributor

@tanishagoyal2 tanishagoyal2 commented Dec 22, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Issue: #390

Testing

  1. Completed testing on dev cluster 'nvs-dgxc-k8s-oci-lhr-dev3'
  2. I have updated image of platform-connector and fault-quarantine, that means all health monitors were running in old mode where no processingStrategy field was applied by health-monitors
  3. Inserted XID event 13 on GPU node using kernal log message
  4. Saw that platform connector automatically updated processingStrategy to EXECUTE_REMEDIATION and same was stored in db
Screenshot 2026-01-08 at 9 43 18β€―PM

Platform connector logs

Screenshot 2026-01-08 at 9 44 30β€―PM
  1. Fault quarantine also received the event and started processing the events
  2. Updated the circuit breaker cm to mark the breaker as TRIPPED
  3. Inserted some event without processingStrategy
Screenshot 2026-01-08 at 9 54 45β€―PM
  1. Updated the circuit breaker to CLOSED and restarted fault -quarantine
  2. It picked up old inserted events without processingStrategy and processed them
Screenshot 2026-01-08 at 9 57 05β€―PM

Summary by CodeRabbit

Release Notes

  • New Features

    • Health events now support processing strategies: execute remediation actions or store for monitoring only.
    • System intelligently filters and routes events based on their assigned strategy.
  • Tests

    • Added comprehensive test coverage for processing strategy handling across components.
  • Documentation

    • Updated design documentation covering processing strategy implementation.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

πŸ“ Walkthrough

Walkthrough

Introduces a new ProcessingStrategy enum (UNSPECIFIED, EXECUTE_REMEDIATION, STORE_ONLY) to the HealthEvent protobuf definition and propagates it across event processing, storage, and platform connector components. STORE_ONLY events are filtered out during remediation flows; EXECUTE_REMEDIATION events trigger system actions. The platform connector normalizes UNSPECIFIED to EXECUTE_REMEDIATION for backward compatibility.

Changes

Cohort / File(s) Summary
Protocol Buffer Definitions
data-models/protobufs/health_event.proto, health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py, health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
Added new ProcessingStrategy enum with members UNSPECIFIED (0), EXECUTE_REMEDIATION (1), and STORE_ONLY (2). Extended HealthEvent message with new field processingStrategy (field number 16). Updated generated Python protobuf files with serialization offsets and type stubs.
Event Transformation & Exporting
event-exporter/pkg/transformer/cloudevents.go, event-exporter/pkg/transformer/cloudevents_test.go
Added processingStrategy field to CloudEvents data payload. Updated test fixtures to set STORE_ONLY strategy and validate serialization into CloudEvent.
Kubernetes Platform Connector
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go, platform-connectors/pkg/connectors/kubernetes/process_node_events.go
Introduced filterProcessableEvents() to skip STORE_ONLY events from remediation processing. Added createK8sEvent() helper. Refactored processHealthEvents() to build node conditions only from processable (EXECUTE_REMEDIATION) events. Comprehensive test coverage for mixed strategies and node condition/event generation behavior.
Storage Pipeline Builders
store-client/pkg/client/pipeline_builder.go, store-client/pkg/client/mongodb_pipeline_builder.go, store-client/pkg/client/postgresql_pipeline_builder.go, store-client/pkg/client/pipeline_builder_test.go
Added BuildProcessableHealthEventInsertsPipeline() and BuildProcessableNonFatalUnhealthyInsertsPipeline() interface methods and implementations for MongoDB and PostgreSQL. Pipelines filter for EXECUTE_REMEDIATION strategy or missing/absent field for backward compatibility.
Data Mapping
store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
Added "processingstrategy" ↔ "processingStrategy" field mapping for SQL queries.
Health Monitors
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go, health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
mapMaintenanceEventToHealthEvent() now sets ProcessingStrategy to EXECUTE_REMEDIATION by default (TODO: make configurable). Updated test expectations to validate strategy assignment.
Platform Connector Server
platform-connectors/pkg/server/platform_connector_server.go, platform-connectors/pkg/server/platform_connector_server_test.go
Added pre-processing logic in HealthEventOccurredV1() to normalize UNSPECIFIED strategy to EXECUTE_REMEDIATION for each event. New test validates normalization for UNSPECIFIED, EXECUTE_REMEDIATION, and STORE_ONLY cases.
Test Helpers & Utilities
tests/helpers/healthevent.go, tests/helpers/kube.go, tests/helpers/fault_quarantine.go, tests/fault_quarantine_test.go, fault-quarantine/pkg/evaluator/rule_evaluator_test.go
Extended HealthEventTemplate with ProcessingStrategy field and WithProcessingStrategy() method. Added EnsureNodeEventNotPresent() helper for negative event assertions. Updated test fixtures to include strategy field. Made config map application conditional in quarantine setup.
Pipeline Initialization
fault-quarantine/pkg/initializer/init.go
Changed pipeline construction from BuildAllHealthEventInsertsPipeline() to BuildProcessableHealthEventInsertsPipeline() to filter events by processing strategy.
Documentation
docs/designs/025-processing-strategy-for-health-checks.md
Design document covering ProcessingStrategy enum expansion, public method signatures, pipeline builder additions, and backward compatibility strategy.

Sequence Diagram(s)

sequenceDiagram
    participant HM as Health Monitor
    participant PCS as Platform Connector Server
    participant KC as K8s Connector
    participant Store as Data Store

    HM->>HM: Create HealthEvent<br/>(processingStrategy set)
    HM->>PCS: HealthEventOccurredV1(event)
    
    PCS->>PCS: Normalize UNSPECIFIED<br/>β†’ EXECUTE_REMEDIATION
    PCS->>KC: Process via Pipeline
    PCS->>Store: Enqueue to Ring Buffer
    
    alt processingStrategy = EXECUTE_REMEDIATION
        KC->>KC: filterProcessableEvents()
        KC->>KC: Create K8s node conditions
        KC->>KC: Create K8s events
        KC->>Store: Write conditions & events
    else processingStrategy = STORE_ONLY
        KC->>KC: Skip (observability only)
        Note over KC: Event stored but<br/>no remediation
    end
    
    Store->>Store: Insert/Update with strategy
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR introduces a multi-service feature spanning protocol buffers, event transformation, filtering logic, and storage layer with heterogeneous changes across ~20 files. While individual changes are moderately complex, the breadth of affected components and diverse patterns (filters, builders, normalizations) require careful verification of strategy propagation and backward compatibility guarantees.

Poem

🐰 A strategy emerges, splitting the stream wide,
Some events for action, some for the tide.
STORE_ONLY whispers "just watch and record,"
While EXECUTE_REMEDIATION swings the sword.
Filtered with care through each service's gate,
The bunny ensures they all coordinate! πŸ₯•βœ¨

πŸš₯ Pre-merge checks | βœ… 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
βœ… Passed checks (2 passed)
Check name Status Explanation
Title check βœ… Passed The title accurately describes the main change: adding a processingStrategy field to the health events structure, which is the core feature across all modified files.
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • πŸ“ Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

325-343: Consider log level and function signature simplification.

Two optional refinements:

  1. Log level: If STORE_ONLY events are frequent, INFO-level logging (line 331) may be verbose. Consider DEBUG level for routine filtering operations.

  2. Function signature: The function takes *protos.HealthEvents but only uses the Events field. Consider accepting []*protos.HealthEvent directly for a clearer contract.

πŸ”Ž Proposed signature simplification
-func filterProcessableEvents(healthEvents *protos.HealthEvents) []*protos.HealthEvent {
+func filterProcessableEvents(healthEvents []*protos.HealthEvent) []*protos.HealthEvent {
 	var processableEvents []*protos.HealthEvent
 
-	for _, healthEvent := range healthEvents.Events {
+	for _, healthEvent := range healthEvents {
 		if healthEvent.ProcessingStrategy == protos.ProcessingStrategy_STORE_ONLY {
 			slog.Info("Skipping STORE_ONLY health event (no node conditions / node events)",
 				"node", healthEvent.NodeName,
 				"checkName", healthEvent.CheckName,
 				"agent", healthEvent.Agent)
 
 			continue
 		}
 
 		processableEvents = append(processableEvents, healthEvent)
 	}
 
 	return processableEvents
 }

Then update the call site:

 func (r *K8sConnector) processHealthEvents(ctx context.Context, healthEvents *protos.HealthEvents) error {
-	processableEvents := filterProcessableEvents(healthEvents)
+	processableEvents := filterProcessableEvents(healthEvents.Events)
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 61f47cb and a44b5a6.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (11)
  • data-models/protobufs/health_event.proto
  • event-exporter/pkg/transformer/cloudevents.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/pipeline_builder.go
  • store-client/pkg/client/pipeline_builder_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
🧰 Additional context used
πŸ““ Path-based instructions (4)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • store-client/pkg/client/pipeline_builder.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • event-exporter/pkg/transformer/cloudevents.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • store-client/pkg/client/pipeline_builder_test.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • event-exporter/pkg/transformer/cloudevents_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • store-client/pkg/client/pipeline_builder_test.go
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
🧠 Learnings (2)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • store-client/pkg/client/mongodb_pipeline_builder.go
🧬 Code graph analysis (7)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
store-client/pkg/client/postgresql_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
data-models/pkg/protos/health_event.pb.go (11)
  • HealthEvents (156-162)
  • HealthEvents (175-175)
  • HealthEvents (190-192)
  • HealthEvent (260-280)
  • HealthEvent (293-293)
  • HealthEvent (308-310)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
  • PipelineBuilder (26-47)
store-client/pkg/client/mongodb_pipeline_builder.go (1)
  • NewMongoDBPipelineBuilder (29-31)
store-client/pkg/client/postgresql_pipeline_builder.go (1)
  • NewPostgreSQLPipelineBuilder (29-31)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
πŸ”‡ Additional comments (13)
event-exporter/pkg/transformer/cloudevents_test.go (1)

69-69: LGTM - Good test coverage for the new field.

The test correctly validates that ProcessingStrategy_STORE_ONLY is serialized as "STORE_ONLY" in the CloudEvent payload. Consider adding an explicit test case for the default value (EXECUTE_REMEDIATION) to ensure both enum values are covered.

Also applies to: 106-108

store-client/pkg/client/pipeline_builder_test.go (1)

69-86: LGTM - Test follows established patterns.

The test is well-structured, table-driven, and consistent with the existing pipeline builder tests. It correctly validates both MongoDB and PostgreSQL implementations.

event-exporter/pkg/transformer/cloudevents.go (1)

66-66: LGTM - Consistent enum serialization.

The processingStrategy field is added correctly, following the same pattern as recommendedAction using the .String() method for enum-to-string conversion.

store-client/pkg/client/pipeline_builder.go (1)

35-38: LGTM - Well-documented interface method.

The new BuildProcessableHealthEventInsertsPipeline method is properly documented with clear comments explaining its purpose and use case. The placement in the interface is logical.

store-client/pkg/client/postgresql_pipeline_builder.go (1)

119-132: LGTM - Consistent with MongoDB implementation.

The PostgreSQL implementation correctly mirrors the MongoDB version. Since both use client-side pipeline filtering, the identical implementation is appropriate.

data-models/protobufs/health_event.proto (2)

32-38: LGTM - Well-designed enum with sensible defaults.

The ProcessingStrategy enum is properly defined with:

  • EXECUTE_REMEDIATION = 0 as the default (preserves existing behavior for events without this field)
  • STORE_ONLY = 1 for observability-only events
  • Comprehensive comments explaining each value's semantics

As per coding guidelines, the comments adequately describe the field behavior.


77-77: LGTM - Additive field addition.

Field number 16 is correctly assigned following the sequence. This is a backward-compatible change.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: Auto-generated protobuf code - no manual review needed.

This file is generated by the protocol buffer compiler. The updated descriptor and offsets correctly reflect the new ProcessingStrategy enum and HealthEvent.processingStrategy field from the proto definition.

store-client/pkg/client/mongodb_pipeline_builder.go (1)

87-100: LGTM - Correct implementation of processable events filter.

The pipeline correctly filters for insert operations where processingStrategy equals EXECUTE_REMEDIATION. The use of int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION) is appropriate since protobuf enums are stored as integers. The field path fullDocument.healthevent.processingstrategy is correctβ€”the codebase stores all MongoDB document fields in lowercase, as enforced by the unmarshaller which converts JSON tags to lowercase BSON tags.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1391-1589: Well-structured test for processing strategy feature.

The test comprehensively validates the STORE_ONLY vs EXECUTE_REMEDIATION strategy behavior using table-driven tests as recommended in the coding guidelines. Good test isolation with dedicated clientSet and ring buffer per test case, and proper use of testify assertions.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (2)

345-370: Good refactoring to extract event creation logic.

Extracting the Kubernetes event creation into a dedicated function improves code organization and reusability. The implementation correctly uses the receiver methods for message and reason formatting.


372-416: Clean integration of processing strategy filtering.

The function correctly filters events by processing strategy before applying node condition and Kubernetes event logic. The flow is clear: filter processable events, update node conditions for processable events, and write node events for processable non-fatal events. This ensures STORE_ONLY events are consistently excluded from remediation actions.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

14-17: Generated type stubs correctly reflect protobuf schema changes.

The ProcessingStrategy enum and processingStrategy field additions follow the established pattern for protobuf-generated Python type stubs. The type annotations and scaffolding are consistent with the existing code structure.

Also applies to: 31-32, 78-78, 104-104, 120-120, 138-138

@tanishagoyal2 tanishagoyal2 changed the title feat: add processing strategy field in health events feat: add processingStrategy field in health events structure Dec 22, 2025
@tanishagoyal2 tanishagoyal2 self-assigned this Dec 23, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tests/platform-connector_test.go (1)

53-96: Consider renaming the feature to match test intent.

The test function is named TestPlatformConnectorWithProcessingStrategy but the feature on line 35 is named "TestPlatformConnector". For consistency and clarity, consider naming the feature to include "WithProcessingStrategy" to better reflect what's being tested.

πŸ”Ž Suggested change
-	feature := features.New("TestPlatformConnector").
+	feature := features.New("TestPlatformConnectorWithProcessingStrategy").
 		WithLabel("suite", "platform-connector")
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between a44b5a6 and 9445cfd.

πŸ“’ Files selected for processing (6)
  • distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
  • docs/postgresql-schema.sql
  • tests/fault_quarantine_test.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
  • tests/platform-connector_test.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/kube.go
  • tests/platform-connector_test.go
  • tests/helpers/healthevent.go
  • tests/fault_quarantine_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/platform-connector_test.go
  • tests/fault_quarantine_test.go
🧠 Learnings (7)
πŸ“š Learning: 2025-12-23T10:34:08.727Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:08.727Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • tests/helpers/kube.go
  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • tests/platform-connector_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/platform-connector_test.go
  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/platform-connector_test.go
  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/fault_quarantine_test.go
🧬 Code graph analysis (2)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
tests/fault_quarantine_test.go (4)
tests/helpers/fault_quarantine.go (4)
  • QuarantineTestContext (51-54)
  • SetupQuarantineTest (107-112)
  • AssertQuarantineState (315-382)
  • QuarantineAssertion (56-60)
tests/helpers/kube.go (1)
  • SetNodeManagedByNVSentinel (1389-1408)
tests/helpers/healthevent.go (3)
  • NewHealthEvent (60-76)
  • SendHealthEvent (263-275)
  • SendHealthyEvent (277-287)
data-models/pkg/protos/health_event.pb.go (2)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
πŸ”‡ Additional comments (4)
tests/helpers/kube.go (1)

387-409: LGTM! Well-designed negative assertion helper.

The function correctly uses require.Never to assert the absence of events and complements the existing WaitForNodeEvent function. The implementation follows established patterns in this file.

tests/fault_quarantine_test.go (1)

233-295: LGTM! Test correctly validates processing strategy behavior.

The test properly exercises both STORE_ONLY and EXECUTE_REMEDIATION strategies with appropriate assertions for quarantine state. The structure follows e2e-framework conventions and test naming guidelines.

tests/helpers/healthevent.go (2)

48-48: LGTM! Field addition follows struct conventions.

The ProcessingStrategy field is properly tagged with omitempty and uses int type to align with protobuf enum values.


153-156: LGTM! Fluent setter follows established patterns.

The WithProcessingStrategy method correctly implements the fluent builder pattern consistent with other setters in this struct.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-api-changes branch from 9445cfd to 272b090 Compare December 23, 2025 16:15
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/helpers/healthevent.go (1)

48-48: Consider using the proper enum type for type safety.

The ProcessingStrategy field is declared as int, but the protobuf-generated code defines ProcessingStrategy as a proper enum type in data-models/pkg/protos/health_event.pb.go. Using the enum type would provide compile-time validation of valid values and make the code more self-documenting.

πŸ”Ž Proposed refactor to use the enum type

First, import the protobuf package at the top of the file:

 import (
 	"context"
 	"encoding/json"
 	"errors"
 	"fmt"
 	"io"
 	"log"
 	"net/http"
 	"os"
 	"strings"
 	"sync"
 	"testing"
 	"time"
 
+	protos "github.com/NVIDIA/NVSentinel/data-models/pkg/protos"
 	"github.com/stretchr/testify/require"
 )

Then update the field declaration:

-	ProcessingStrategy  int                  `json:"processingStrategy,omitempty"`
+	ProcessingStrategy  protos.ProcessingStrategy `json:"processingStrategy,omitempty"`
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 9445cfd and 272b090.

πŸ“’ Files selected for processing (6)
  • distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
  • docs/postgresql-schema.sql
  • tests/fault_quarantine_test.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
  • tests/platform-connector_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • docs/postgresql-schema.sql
  • tests/platform-connector_test.go
  • tests/helpers/kube.go
  • distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
  • tests/fault_quarantine_test.go
🧰 Additional context used
πŸ““ Path-based instructions (1)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/healthevent.go
🧬 Code graph analysis (1)
tests/helpers/healthevent.go (2)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
πŸ”‡ Additional comments (1)
tests/helpers/healthevent.go (1)

153-156: Implementation follows the fluent builder pattern correctly.

The method correctly implements the fluent builder pattern consistent with other methods in this file. If the field type is updated to use the proper enum type (as suggested for line 48), remember to update the parameter type accordingly.

Signed-off-by: Tanisha goyal <[email protected]>
@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-api-changes branch from 272b090 to 69a01e2 Compare December 23, 2025 16:24
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

218-271: LGTM! Test correctly validates the new processingStrategy field.

The test properly expects processingStrategy with value float64(0) (EXECUTE_REMEDIATION) when not explicitly set on the HealthEvent. This aligns with the PR's addition of the ProcessingStrategy enum and validates the default serialization behavior.

Consider adding a test case that explicitly sets processingStrategy to protos.ProcessingStrategy_STORE_ONLY to verify the RoundTrip function handles non-default values correctly. However, if other test files (e.g., cloudevents_test.go) already cover different processing strategies, this may be unnecessary.

Optional: Example test case for STORE_ONLY strategy
func TestRoundTrip_WithStoreOnlyStrategy(t *testing.T) {
	eventTime := timestamppb.New(time.Now())
	event := &protos.HealthEvent{
		Version:            1,
		Agent:              "test-agent",
		CheckName:          "test-check",
		ProcessingStrategy: protos.ProcessingStrategy_STORE_ONLY,
		GeneratedTimestamp: eventTime,
		NodeName:           "test-node",
	}

	result, err := RoundTrip(event)
	if err != nil {
		t.Fatalf("Failed to roundtrip event: %v", err)
	}

	if ps, ok := result["processingStrategy"].(float64); !ok || ps != float64(1) {
		t.Errorf("Expected processingStrategy to be 1 (STORE_ONLY), got %v", result["processingStrategy"])
	}
}
tests/helpers/kube.go (1)

387-389: Consider adding godoc comment for exported function.

The coding guidelines specify that function comments are required for all exported Go functions. While this is consistent with other test helpers in the file, adding a brief godoc comment would improve clarity and align with the guidelines.

Example godoc
// EnsureNodeEventNotPresent asserts that a specific node event with the given type and reason
// does not exist within NeverWaitTimeout. This is the negative counterpart to WaitForNodeEvent.
func EnsureNodeEventNotPresent(ctx context.Context, t *testing.T,
	c klient.Client, nodeName string, eventType, eventReason string) {

Based on coding guidelines, function comments are required for exported Go functions.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 272b090 and 69a01e2.

πŸ“’ Files selected for processing (7)
  • distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
  • docs/postgresql-schema.sql
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • tests/fault_quarantine_test.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
  • tests/platform-connector_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • docs/postgresql-schema.sql
  • tests/helpers/healthevent.go
  • tests/platform-connector_test.go
  • tests/fault_quarantine_test.go
  • distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/kube.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
🧠 Learnings (3)
πŸ“š Learning: 2025-12-23T10:34:08.727Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:08.727Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • tests/helpers/kube.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
πŸ“š Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

  • tests/helpers/kube.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-api-changes branch from 404bf55 to b5af994 Compare January 5, 2026 13:56
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Fix all issues with AI Agents πŸ€–
In @tests/fault_quarantine_test.go:
- Around line 298-299: The test currently calls helpers.CheckNodeConditionExists
and helpers.CheckNodeEventExists but ignores their boolean return values; change
both calls to assert their results by wrapping them in require.Eventually (or
another assertion) so the test fails if the condition/event never appearsβ€”e.g.,
call CheckNodeConditionExists(ctx, client, testCtx.NodeName, "SysLogsXIDError",
"SysLogsXIDErrorIsNotHealthy") inside a require.Eventually loop with a timeout
and poll interval and do the same for CheckNodeEventExists(ctx, client,
testCtx.NodeName, "SysLogsXIDError", "XID error found"); ensure you reference
the returned bool from each helper and assert it is true rather than ignoring
it.
🧹 Nitpick comments (1)
tests/fault_quarantine_test.go (1)

324-328: Redundant healthy event send in teardown.

The SendHealthyEvent call at line 325 is redundant because TeardownQuarantineTest (line 327) already sends a healthy event internally as part of its cleanup routine.

πŸ”Ž Proposed simplification
 	feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
-		helpers.SendHealthyEvent(ctx, t, testCtx.NodeName)
-
 		return helpers.TeardownQuarantineTest(ctx, t, c)
 	})
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 69a01e2 and b5af994.

πŸ“’ Files selected for processing (4)
  • fault-quarantine/pkg/initializer/init.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • tests/fault_quarantine_test.go
  • tests/helpers/kube.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • tests/helpers/kube.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/fault_quarantine_test.go
  • fault-quarantine/pkg/initializer/init.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/fault_quarantine_test.go
🧠 Learnings (5)
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • fault-quarantine/pkg/initializer/init.go
🧬 Code graph analysis (1)
tests/fault_quarantine_test.go (4)
tests/helpers/fault_quarantine.go (5)
  • QuarantineTestContext (51-54)
  • SetupQuarantineTest (107-112)
  • AssertQuarantineState (315-382)
  • QuarantineAssertion (56-60)
  • TeardownQuarantineTest (181-242)
tests/helpers/kube.go (5)
  • SetNodeManagedByNVSentinel (1389-1408)
  • EnsureNodeConditionNotPresent (1797-1818)
  • EnsureNodeEventNotPresent (387-409)
  • CheckNodeConditionExists (1283-1300)
  • CheckNodeEventExists (1325-1353)
tests/helpers/healthevent.go (3)
  • NewHealthEvent (60-76)
  • SendHealthEvent (263-275)
  • SendHealthyEvent (277-287)
data-models/pkg/protos/health_event.pb.go (2)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
  • GitHub Check: ko-build-test (janitor, .)
  • GitHub Check: container-build-test (syslog-health-monitor, make -C health-monitors/syslog-health-monitor docker...
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm4, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: container-build-test (file-server-cleanup, make -C log-collector docker-build-file-server-cleanup)
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm3, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: container-build-test (log-collector, make -C log-collector docker-build-log-collector)
  • GitHub Check: modules-lint-test (health-events-analyzer)
  • GitHub Check: modules-lint-test (fault-remediation)
  • GitHub Check: modules-lint-test (fault-quarantine)
  • GitHub Check: modules-lint-test (labeler)
  • GitHub Check: modules-lint-test (platform-connectors)
  • GitHub Check: modules-lint-test (store-client)
  • GitHub Check: modules-lint-test (event-exporter)
  • GitHub Check: modules-lint-test (node-drainer)
  • GitHub Check: modules-lint-test (janitor)
  • GitHub Check: health-monitors-lint-test (csp-health-monitor)
  • GitHub Check: health-monitors-lint-test (kubernetes-object-monitor)
  • GitHub Check: health-monitors-lint-test (syslog-health-monitor)
  • GitHub Check: CodeQL PR Analysis
  • GitHub Check: E2E Tests (ARM64 + MongoDB)
  • GitHub Check: E2E Tests (ARM64 + PostgreSQL)
  • GitHub Check: E2E Tests (AMD64 + MongoDB)
  • GitHub Check: E2E Tests (AMD64 + PostgreSQL)
πŸ”‡ Additional comments (2)
fault-quarantine/pkg/initializer/init.go (1)

66-66: Change is correct; default processingStrategy ensures backward compatibility.

The switch to BuildProcessableHealthEventInsertsPipeline() correctly filters to only process events with processingStrategy == EXECUTE_REMEDIATION. This is backward compatible because EXECUTE_REMEDIATION is the default value (0) in the proto definition, so existing events without an explicit processingStrategy field will be processed normally. Events explicitly marked STORE_ONLY are intentionally ignored by fault-quarantine, as shown in the test TestFaultQuarantineWithProcessingStrategy. The change is scoped to fault-quarantine; event-exporter still uses BuildAllHealthEventInsertsPipeline() to handle all events.

tests/fault_quarantine_test.go (1)

26-26: LGTM!

The protos import is necessary for accessing the ProcessingStrategy constants used in the new test.

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes 82.99% (+0.21%) πŸ‘
github.com/nvidia/nvsentinel/tests 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ΓΈ)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer/init.go 0.00% (ΓΈ) 280 0 280
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ΓΈ) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • github.com/nvidia/nvsentinel/tests/fault_quarantine_test.go
  • github.com/nvidia/nvsentinel/tests/platform-connector_test.go

Signed-off-by: Tanisha goyal <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Fix all issues with AI Agents πŸ€–
In @tests/fault_quarantine_test.go:
- Around line 334-343: The teardown is missing removal of the
ManagedByNVSentinel label set in Setup; inside the Teardown func (the anonymous
function with signature func(ctx context.Context, t *testing.T, c
*envconf.Config)), obtain a client via c.NewClient(), assert no error with
require.NoError(t, err), call helpers.RemoveNodeManagedByNVSentinelLabel(ctx,
client, testCtx.NodeName) and assert its error with require.NoError(t, err)
before returning helpers.TeardownQuarantineTest(ctx, t, c) so the label is
cleaned up prior to final teardown.
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between b5af994 and 642a966.

πŸ“’ Files selected for processing (2)
  • tests/data/managed-by-nvsentinel-configmap.yaml
  • tests/fault_quarantine_test.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/fault_quarantine_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/fault_quarantine_test.go
🧠 Learnings (9)
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • tests/data/managed-by-nvsentinel-configmap.yaml
πŸ“š Learning: 2025-12-12T07:41:27.339Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:27.339Z
Learning: In tests/data/health-events-analyzer-config.yaml, the XID74Reg2Bit13Set rule intentionally omits the time window filter; tests should verify only the register bit pattern (bit 13 in REG2) on the incoming XID 74 event and should not rely on historical events or counts of repeats. If adding similar rules elsewhere, apply the same pattern and document that the time window filter is unnecessary for single-event bit checks.

Applied to files:

  • tests/data/managed-by-nvsentinel-configmap.yaml
πŸ“š Learning: 2025-11-27T15:42:48.142Z
Learnt from: lalitadithya
Repo: NVIDIA/NVSentinel PR: 459
File: docs/cancelling-breakfix.md:58-90
Timestamp: 2025-11-27T15:42:48.142Z
Learning: The label `k8saas.nvidia.com/ManagedByNVSentinel` exists in NVSentinel and is used to opt nodes out of automated break-fix workflows. The prefix `k8saas.nvidia.com/` is configurable via the `labelPrefix` setting in the fault-quarantine Helm values, and the documentation correctly uses the default value.

Applied to files:

  • tests/data/managed-by-nvsentinel-configmap.yaml
πŸ“š Learning: 2025-12-12T07:36:26.109Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:26.109Z
Learning: In NVSentinel health-events analyzer config files (e.g., tests/data/health-events-analyzer-config.yaml and similar), XID 74 errors are guaranteed to include a GPU_UUID entry within healthevent.entitiesimpacted. Therefore, when writing XID 74-specific rules that filter for GPU_UUID entities, you can skip null checks (no $ifNull) for GPU_UUID. Ensure tests relying on this assumption are updated accordingly and document the guarantee in the rule descriptions.

Applied to files:

  • tests/data/managed-by-nvsentinel-configmap.yaml
πŸ“š Learning: 2025-12-12T07:38:37.023Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:37.023Z
Learning: In NVSentinel, XID 74 errors always include an NVLINK entry in healthevent.entitiesimpacted, so null-checking with $ifNull is unnecessary when filtering for NVLINK entities in XID 74-specific rules. Apply this rule to YAML test fixtures under tests/ data (e.g., tests/data/health-events-analyzer-config.yaml) and any similar health-event configuration tests. If applying in code, ensure downstream filters rely on the presence of NVLINK in entitiesimpacted for XID 74 only, but continue to guard other fields and XIDs with appropriate null checks.

Applied to files:

  • tests/data/managed-by-nvsentinel-configmap.yaml
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/fault_quarantine_test.go
πŸ”‡ Additional comments (5)
tests/data/managed-by-nvsentinel-configmap.yaml (1)

43-58: Verified: The syslog-health-monitor agent and SysLogsXIDError checkName are valid identifiers in the system and are actively used across test fixtures, e2e tests, and health monitor implementations. The new rule-set correctly applies the same remediation pattern as the existing ManagedByNVSentinel-Rule for handling syslog-based XID errors.

tests/fault_quarantine_test.go (4)

26-32: LGTM!

The import for the protos package is necessary for accessing the ProcessingStrategy enum values used in the test assertions.


240-251: LGTM!

Setup correctly initializes the test context and enables NVSentinel management for the node under test.


253-289: LGTM!

The test correctly validates that health events with STORE_ONLY processing strategy do not trigger node conditions, events, or quarantine actions. The use of Ensure* helpers with built-in assertions is appropriate.


291-332: LGTM!

The test correctly validates that health events with EXECUTE_REMEDIATION processing strategy trigger the expected node conditions, events, and quarantine actions. This provides good contrast with the STORE_ONLY phase above.

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/janitor/pkg/csp 24.39% (ΓΈ)
github.com/nvidia/nvsentinel/janitor/pkg/csp/nebius 0.39% (ΓΈ)
github.com/nvidia/nvsentinel/store-client/pkg/client 6.11% (+0.03%) πŸ‘

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/janitor/pkg/csp/client.go 24.39% (ΓΈ) 164 40 124
github.com/nvidia/nvsentinel/janitor/pkg/csp/nebius/nebius.go 0.39% (ΓΈ) 514 2 512
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_client.go 3.47% (ΓΈ) 2966 103 2863

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor/pkg/csp/client_test.go

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-api-changes branch from 89ca1dd to 2056205 Compare January 7, 2026 07:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In @data-models/protobufs/health_event.proto:
- Line 78: Add a field-level comment for the processingStrategy field in the
HealthEvent message: describe what processingStrategy (of enum type
ProcessingStrategy) controls for a HealthEvent (e.g., how the event should be
processed, prioritization/ordering implications, default behavior when unset,
and any effects on downstream consumers). Update the comment immediately above
the line declaring "ProcessingStrategy processingStrategy = 16;" so it clearly
documents the field's purpose, expected values/semantics, and any default or
compatibility considerations.
🧹 Nitpick comments (1)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)

362-364: Hardcoded processing strategy is appropriate; consider issue reference in TODO.

The hardcoded EXECUTE_REMEDIATION value is reasonable for this trigger engine component, as its purpose is to forward maintenance events that should initiate remediation workflows. The TODO comment clearly documents the intent to make this configurable.

However, per coding guidelines, TODO comments should reference issues rather than PRs. Consider creating or referencing an issue for the configurability work.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 642a966 and 89ca1dd.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (9)
  • data-models/protobufs/health_event.proto
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
  • tests/data/fatal-health-event.json
  • tests/data/healthy-event.json
  • tests/data/unsupported-health-event.json
  • tests/helpers/healthevent.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/helpers/healthevent.go
🧰 Additional context used
πŸ““ Path-based instructions (3)
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
🧠 Learnings (2)
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-12-23T05:02:22.108Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:22.108Z
Learning: In the NVSentinel codebase, protobuf fields stored in MongoDB should use lowercase field names (e.g., processingstrategy, componentclass, checkname). Ensure pipeline filters and queries that access protobuf fields in the database consistently use lowercase field names in the store-client package, avoiding camelCase mappings for MongoDB reads/writes.

Applied to files:

  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
🧬 Code graph analysis (2)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (75-77)
  • ProcessingStrategy (79-81)
  • ProcessingStrategy (88-90)
  • ProcessingStrategy_EXECUTE_REMEDIATION (47-47)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (75-77)
  • ProcessingStrategy (79-81)
  • ProcessingStrategy (88-90)
πŸ”‡ Additional comments (7)
data-models/protobufs/health_event.proto (1)

32-39: Well-designed enum with clear documentation.

The ProcessingStrategy enum follows protobuf best practices with UNSPECIFIED = 0 as the default value. The comments clearly explain the behavioral differences between strategies, which will help downstream consumers understand when to use each value.

tests/data/unsupported-health-event.json (1)

19-20: LGTM!

The JSON is valid and the processingStrategy value of 1 (EXECUTE_REMEDIATION) is appropriate for test data representing an unsupported health event that should trigger remediation.

tests/data/fatal-health-event.json (1)

19-20: LGTM!

The addition of processingStrategy: 1 (EXECUTE_REMEDIATION) is appropriate for a fatal health event that requires remediation action.

store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go (1)

404-404: LGTM!

The field name mapping correctly follows the established pattern: lowercase for MongoDB (per the retrieved learnings) and camelCase for PostgreSQL JSONB queries. This ensures that filters on processingStrategy will work correctly across both database providers.

Based on learnings.

tests/data/healthy-event.json (1)

19-20: Verify that EXECUTE_REMEDIATION is the intended strategy for healthy events.

While the JSON is valid, using processingStrategy: 1 (EXECUTE_REMEDIATION) for a healthy event warrants verification. Depending on the system design:

  • If healthy events should clear quarantines or update cluster state, EXECUTE_REMEDIATION is appropriate.
  • If healthy events are purely observability signals, STORE_ONLY might be more suitable.

Please confirm this aligns with the intended behavior for healthy events in your system.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: Generated protobuf file – verify check-in policy.

This is auto-generated protobuf code. Line 3 states "NO CHECKED-IN PROTOBUF GENCODE", yet this file is being committed to version control. Please verify whether this aligns with the project's policy for handling generated protobuf files. Some projects regenerate these files during the build process rather than committing them.

If the policy is to commit generated files, the changes look correct and reflect the new ProcessingStrategy enum and processingStrategy field additions.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

14-18: Generated type stubs look consistent.

This auto-generated type stub file correctly reflects the new ProcessingStrategy enum and processingStrategy field additions. The type annotations follow standard protobuf Python patterns:

  • Enum class with proper metaclass and __slots__
  • Module-level constants exported
  • HealthEvent extended with field, constant, and parameter
  • Type hints correctly use _Optional[_Union[ProcessingStrategy, str]]

No issues identified with the generated code.

Also applies to: 32-34, 80-80, 106-106, 122-122, 140-140

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/helpers/healthevent.go (1)

155-158: LGTM! Consider adding godoc (optional).

The WithProcessingStrategy method correctly implements the fluent builder pattern used throughout this file.

As per coding guidelines, exported functions should have godoc comments. However, this is consistent with all other With* methods in the file (lines 80-153). Consider adding documentation to all exported methods in a future refactor.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 89ca1dd and 2056205.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (9)
  • data-models/protobufs/health_event.proto
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
  • tests/data/fatal-health-event.json
  • tests/data/healthy-event.json
  • tests/data/unsupported-health-event.json
  • tests/helpers/healthevent.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • tests/data/unsupported-health-event.json
  • tests/data/healthy-event.json
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • data-models/protobufs/health_event.proto
🧰 Additional context used
πŸ““ Path-based instructions (1)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/healthevent.go
🧠 Learnings (2)
πŸ“š Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

  • tests/helpers/healthevent.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/helpers/healthevent.go
🧬 Code graph analysis (2)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (75-77)
  • ProcessingStrategy (79-81)
  • ProcessingStrategy (88-90)
  • ProcessingStrategy_EXECUTE_REMEDIATION (47-47)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (75-77)
  • ProcessingStrategy (79-81)
  • ProcessingStrategy (88-90)
πŸ”‡ Additional comments (7)
tests/helpers/healthevent.go (3)

31-31: LGTM!

The import of the protos package is necessary to access the ProcessingStrategy enum for the default initialization.


49-49: LGTM!

The ProcessingStrategy field follows the established pattern in this struct, using int type consistent with RecommendedAction (line 43). The omitempty tag appropriately handles zero-value serialization.


76-76: LGTM!

Initializing ProcessingStrategy to EXECUTE_REMEDIATION is a sensible default that maintains backward compatibilityβ€”existing tests that don't explicitly set this field will continue to expect remediation to execute.

tests/data/fatal-health-event.json (1)

19-20: LGTM. The addition of processingStrategy with value 1 correctly represents EXECUTE_REMEDIATION per the proto definition, and is appropriate for this fatal health event test fixture. JSON syntax is valid.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (3)

32-34: LGTM!

Module-level enum constants follow the standard protobuf pattern and are consistent with the existing RecommendedAction enum exports.


80-80: LGTM!

The processingStrategy field is correctly integrated into HealthEvent:

  • Added to __slots__ (line 80)
  • Field number constant declared (line 106)
  • Field typed appropriately (line 122)
  • Parameter signature matches the pattern used for recommendedAction (line 140)

Also applies to: 106-106, 122-122, 140-140


14-18: No action required. The type stub is correctly auto-generated from the protobuf definition. The processingStrategy field and ProcessingStrategy enum are properly defined in both the proto source and the generated type stub. The proto documentation clearly explains the intended behavior: EXECUTE_REMEDIATION for normal operations and STORE_ONLY for observability-only events, with UNSPECIFIED as the default. Since no filtering logic on processingStrategy currently exists in the Python codebase, backward compatibility is not an issue at this time. Once processing logic is implemented, developers will have the proto documentation to guide proper handling of the enum values.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-api-changes branch 2 times, most recently from d432594 to a4f8335 Compare January 7, 2026 08:03
Signed-off-by: Tanisha goyal <[email protected]>
@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-api-changes branch from a4f8335 to 786239c Compare January 7, 2026 08:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between d432594 and 786239c.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (10)
  • data-models/protobufs/health_event.proto
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
  • tests/data/fatal-health-event.json
  • tests/data/healthy-event.json
  • tests/data/unsupported-health-event.json
  • tests/helpers/healthevent.go
🚧 Files skipped from review as they are similar to previous changes (7)
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
  • tests/data/fatal-health-event.json
  • tests/helpers/healthevent.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • tests/data/unsupported-health-event.json
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
🧠 Learnings (1)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Include comprehensive comments for all fields in Protocol Buffer messages

Applied to files:

  • data-models/protobufs/health_event.proto
πŸ”‡ Additional comments (3)
data-models/protobufs/health_event.proto (1)

32-39: Well-documented enum definition with clear semantics.

The ProcessingStrategy enum is properly defined with:

  • Comprehensive comments explaining each member's behavior
  • UNSPECIFIED = 0 following proto3 conventions for default values
  • Clear distinction between EXECUTE_REMEDIATION (normal) and STORE_ONLY (observability-only) modes
tests/data/healthy-event.json (1)

19-20: Test fixture correctly updated with new field.

The addition of processingStrategy: 1 (EXECUTE_REMEDIATION) properly reflects the new protobuf field in the test data. The JSON syntax is valid and the value correctly corresponds to the enum definition.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

24-50: Auto-generated protobuf codeβ€”no manual review required.

This file is auto-generated by the protocol buffer compiler (line 2). The changes to serialized descriptors and offset positions are mechanical reflections of the proto schema updates. No manual modifications or detailed review needed.

Copy link
Collaborator

@lalitadithya lalitadithya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some details on how are we going to handle backwards compatibility? What about monitors that do not send the field, how will they be processed?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
docs/designs/025-processing-strategy-for-health-checks.md (3)

415-435: Backward compatibility inconsistency: pipeline builders don't handle missing processingStrategy field.

The MongoDB and PostgreSQL pipeline builders (lines 408-435) use strict equality checks for processingStrategy == "EXECUTE_REMEDIATION", which will exclude historical events that lack this field entirely. However, the reconciler (lines 555–557) correctly handles backward compatibility using an $or condition that treats missing fields as EXECUTE_REMEDIATION.

This inconsistency means old events without processingStrategy will be silently filtered out by BuildProcessableHealthEventInsertsPipeline() during fault-quarantine initialization, breaking processing for historical data.

Both pipeline builders should match the reconciler's backward-compatible approach:

♻️ Proposed fix for pipeline builders
// For MongoDB:
datastore.E("$or", datastore.A(
  datastore.D(
    datastore.E("fullDocument.healthevent.processingstrategy", "EXECUTE_REMEDIATION"),
  ),
  datastore.D(
    datastore.E("fullDocument.healthevent.processingstrategy", datastore.D(
      datastore.E("$exists", false),
    )),
  ),
))

// Similar for PostgreSQLβ€”use $or with existence check for missing field

398-398: Fix duplicate step numbering: Step 5 appears twice.

Step 5 is labeled at both line 398 ("Store Client") and line 477 ("Fault Quarantine Module"). The fault-quarantine section should be labeled "Step 6" and subsequent sections renumbered accordingly.

Also applies to: 477-477


323-327: Clarify GPU Health Monitor implementation: document all 4 creation sites.

Line 327 notes that GPU Health Monitor creates HealthEvent objects in 4 different places but only shows implementation for one. The design should either:

  1. Show all 4 locations with the required processingStrategy=self._processing_strategy addition, or
  2. Reference a specific file path and line range where developers can find and update all instances.

This reduces ambiguity during implementation.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 786239c and 6376148.

πŸ“’ Files selected for processing (1)
  • docs/designs/025-processing-strategy-for-health-checks.md
🧰 Additional context used
🧠 Learnings (3)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • docs/designs/025-processing-strategy-for-health-checks.md
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • docs/designs/025-processing-strategy-for-health-checks.md
πŸͺ› markdownlint-cli2 (0.18.1)
docs/designs/025-processing-strategy-for-health-checks.md

547-547: Hard tabs
Column: 1

(MD010, no-hard-tabs)


548-548: Hard tabs
Column: 1

(MD010, no-hard-tabs)


549-549: Hard tabs
Column: 1

(MD010, no-hard-tabs)


550-550: Hard tabs
Column: 1

(MD010, no-hard-tabs)


551-551: Hard tabs
Column: 1

(MD010, no-hard-tabs)


552-552: Hard tabs
Column: 1

(MD010, no-hard-tabs)


553-553: Hard tabs
Column: 1

(MD010, no-hard-tabs)


554-554: Hard tabs
Column: 1

(MD010, no-hard-tabs)


555-555: Hard tabs
Column: 1

(MD010, no-hard-tabs)


556-556: Hard tabs
Column: 1

(MD010, no-hard-tabs)


557-557: Hard tabs
Column: 1

(MD010, no-hard-tabs)


558-558: Hard tabs
Column: 1

(MD010, no-hard-tabs)

πŸ”‡ Additional comments (2)
docs/designs/025-processing-strategy-for-health-checks.md (2)

101-104: Clarify enum definition: AI summary mentions UNSPECIFIED but design omits it.

The protobuf enum in Step 1 defines only EXECUTE_REMEDIATION = 0 and STORE_ONLY = 1, but the AI summary references a three-value enum: (UNSPECIFIED, EXECUTE_REMEDIATION, STORE_ONLY). Protobuf conventions typically reserve value 0 for an UNSPECIFIED/default state.

Confirm whether the enum should include an explicit UNSPECIFIED = 0 value, with EXECUTE_REMEDIATION shifted to 1, or whether the current design intentionally defaults to EXECUTE_REMEDIATION.


547-558: Replace hard tabs with spaces in code block.

Lines 547–558 contain hard tabs instead of spaces, flagged by markdownlint (MD010). Convert to spaces for consistency with markdown standards.

πŸ› Proposed fix
-	// Backward Compatibility: Use $or to include events where processingstrategy is either
-	// EXECUTE_REMEDIATION or missing (old events created before this feature was added).
-	// Old events without this field should be treated as EXECUTE_REMEDIATION.
+    // Backward Compatibility: Use $or to include events where processingstrategy is either
+    // EXECUTE_REMEDIATION or missing (old events created before this feature was added).
+    // Old events without this field should be treated as EXECUTE_REMEDIATION.
β›” Skipped due to learnings
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:143-155
Timestamp: 2025-12-01T17:52:43.853Z
Learning: Linting violations (such as hard tabs, MD010) in code stubs within design documents (docs/designs/) in the NVSentinel repository can be ignored, as these are illustrative examples rather than production code.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-api-changes branch from 6376148 to ab8c6d4 Compare January 8, 2026 04:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In @data-models/protobufs/health_event.proto:
- Line 79: Add a field-level comment for the processingStrategy field on the
HealthEvent message that explains its purpose for routing/filtering and
remediation decisions; state that ProcessingStrategy processingStrategy = 16
indicates how downstream systems should handle the event (e.g., whether to
EXECUTE_REMEDIATION, LOG_ONLY, or IGNORE), document the default behavior when
the field is unset (treated as EXECUTE_REMEDIATION by downstream code), and note
how consumers should interpret it together with other HealthEvent fields such as
isHealthy and isFatal (e.g., certain strategies may be ignored or escalated if
isFatal is true or if isHealthy toggles), so implementers can make consistent
routing/remediation choices.
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 6376148 and ab8c6d4.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (2)
  • data-models/protobufs/health_event.proto
  • docs/designs/025-processing-strategy-for-health-checks.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/designs/025-processing-strategy-for-health-checks.md
🧰 Additional context used
πŸ““ Path-based instructions (1)
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
🧠 Learnings (3)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • data-models/protobufs/health_event.proto
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Include comprehensive comments for all fields in Protocol Buffer messages

Applied to files:

  • data-models/protobufs/health_event.proto
πŸ”‡ Additional comments (1)
data-models/protobufs/health_event.proto (1)

32-40: Well-documented enum definition.

The ProcessingStrategy enum is well-structured with comprehensive documentation that clearly explains:

  • The overall purpose of the enum
  • The role of each value (UNSPECIFIED, EXECUTE_REMEDIATION, STORE_ONLY)
  • The distinction between observability-only and cluster-modifying behaviors

The UNSPECIFIED documentation addresses previous concerns by explicitly stating it's reserved as the zero value and not used in the code.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-api-changes branch from ab8c6d4 to 765ec61 Compare January 8, 2026 07:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In @data-models/protobufs/health_event.proto:
- Line 79: Add a descriptive comment above the processingStrategy field in the
HealthEvent protobuf message: explain that ProcessingStrategy (the enum)
determines how the event is handled (e.g., synchronous vs asynchronous
processing, priority or routing implications), when to set it, and any default
behavior or constraints; update the comment immediately above the line
"ProcessingStrategy processingStrategy = 16;" so the field-level documentation
mirrors the enum documentation and clarifies intended usage.
🧹 Nitpick comments (1)
docs/designs/025-processing-strategy-for-health-checks.md (1)

349-362: Fix hard tab indentation in code blocks.

The markdown linter flagged hard tabs (U+0009) in several code blocks. Replace them with spaces for consistency with standard markdown formatting conventions.

Example fix for lines 349-362

Replace hard tabs with spaces (or adjust your editor to use spaces):

 func (p *PlatformConnectorServer) HealthEventOccurredV1(ctx context.Context,
-	he *pb.HealthEvents) (*empty.Empty, error) {
-	slog.Info("Health events received", "events", he)
+    he *pb.HealthEvents) (*empty.Empty, error) {
+    slog.Info("Health events received", "events", he)

Apply similar changes to all flagged code blocks.

Also applies to: 371-382, 435-448, 456-469, 483-496, 504-517, 593-607

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between ab8c6d4 and 765ec61.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (6)
  • data-models/protobufs/health_event.proto
  • docs/designs/025-processing-strategy-for-health-checks.md
  • platform-connectors/pkg/server/platform_connector_server.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • tests/helpers/healthevent.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/helpers/healthevent.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • platform-connectors/pkg/server/platform_connector_server.go
🧠 Learnings (7)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • data-models/protobufs/health_event.proto
  • platform-connectors/pkg/server/platform_connector_server.go
  • docs/designs/025-processing-strategy-for-health-checks.md
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Include comprehensive comments for all fields in Protocol Buffer messages

Applied to files:

  • data-models/protobufs/health_event.proto
πŸ“š Learning: 2025-12-23T05:02:22.108Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:22.108Z
Learning: In the NVSentinel codebase, protobuf fields stored in MongoDB should use lowercase field names (e.g., processingstrategy, componentclass, checkname). Ensure pipeline filters and queries that access protobuf fields in the database consistently use lowercase field names in the store-client package, avoiding camelCase mappings for MongoDB reads/writes.

Applied to files:

  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • platform-connectors/pkg/server/platform_connector_server.go
  • docs/designs/025-processing-strategy-for-health-checks.md
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • docs/designs/025-processing-strategy-for-health-checks.md
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • docs/designs/025-processing-strategy-for-health-checks.md
🧬 Code graph analysis (1)
store-client/pkg/client/postgresql_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (48-48)
πŸͺ› markdownlint-cli2 (0.18.1)
docs/designs/025-processing-strategy-for-health-checks.md

349-349: Hard tabs
Column: 1

(MD010, no-hard-tabs)


350-350: Hard tabs
Column: 1

(MD010, no-hard-tabs)


352-352: Hard tabs
Column: 1

(MD010, no-hard-tabs)


354-354: Hard tabs
Column: 1

(MD010, no-hard-tabs)


355-355: Hard tabs
Column: 1

(MD010, no-hard-tabs)


356-356: Hard tabs
Column: 1

(MD010, no-hard-tabs)


357-357: Hard tabs
Column: 1

(MD010, no-hard-tabs)


358-358: Hard tabs
Column: 1

(MD010, no-hard-tabs)


359-359: Hard tabs
Column: 1

(MD010, no-hard-tabs)


361-361: Hard tabs
Column: 1

(MD010, no-hard-tabs)


371-371: Hard tabs
Column: 1

(MD010, no-hard-tabs)


372-372: Hard tabs
Column: 1

(MD010, no-hard-tabs)


373-373: Hard tabs
Column: 1

(MD010, no-hard-tabs)


382-382: Hard tabs
Column: 1

(MD010, no-hard-tabs)


435-435: Hard tabs
Column: 1

(MD010, no-hard-tabs)


436-436: Hard tabs
Column: 1

(MD010, no-hard-tabs)


437-437: Hard tabs
Column: 1

(MD010, no-hard-tabs)


438-438: Hard tabs
Column: 1

(MD010, no-hard-tabs)


439-439: Hard tabs
Column: 1

(MD010, no-hard-tabs)


440-440: Hard tabs
Column: 1

(MD010, no-hard-tabs)


441-441: Hard tabs
Column: 1

(MD010, no-hard-tabs)


442-442: Hard tabs
Column: 1

(MD010, no-hard-tabs)


443-443: Hard tabs
Column: 1

(MD010, no-hard-tabs)


444-444: Hard tabs
Column: 1

(MD010, no-hard-tabs)


445-445: Hard tabs
Column: 1

(MD010, no-hard-tabs)


446-446: Hard tabs
Column: 1

(MD010, no-hard-tabs)


447-447: Hard tabs
Column: 1

(MD010, no-hard-tabs)


448-448: Hard tabs
Column: 1

(MD010, no-hard-tabs)


456-456: Hard tabs
Column: 1

(MD010, no-hard-tabs)


457-457: Hard tabs
Column: 1

(MD010, no-hard-tabs)


458-458: Hard tabs
Column: 1

(MD010, no-hard-tabs)


459-459: Hard tabs
Column: 1

(MD010, no-hard-tabs)


460-460: Hard tabs
Column: 1

(MD010, no-hard-tabs)


461-461: Hard tabs
Column: 1

(MD010, no-hard-tabs)


462-462: Hard tabs
Column: 1

(MD010, no-hard-tabs)


463-463: Hard tabs
Column: 1

(MD010, no-hard-tabs)


464-464: Hard tabs
Column: 1

(MD010, no-hard-tabs)


465-465: Hard tabs
Column: 1

(MD010, no-hard-tabs)


466-466: Hard tabs
Column: 1

(MD010, no-hard-tabs)


483-483: Hard tabs
Column: 1

(MD010, no-hard-tabs)


484-484: Hard tabs
Column: 1

(MD010, no-hard-tabs)


485-485: Hard tabs
Column: 1

(MD010, no-hard-tabs)


486-486: Hard tabs
Column: 1

(MD010, no-hard-tabs)


487-487: Hard tabs
Column: 1

(MD010, no-hard-tabs)


488-488: Hard tabs
Column: 1

(MD010, no-hard-tabs)


489-489: Hard tabs
Column: 1

(MD010, no-hard-tabs)


490-490: Hard tabs
Column: 1

(MD010, no-hard-tabs)


491-491: Hard tabs
Column: 1

(MD010, no-hard-tabs)


492-492: Hard tabs
Column: 1

(MD010, no-hard-tabs)


493-493: Hard tabs
Column: 1

(MD010, no-hard-tabs)


494-494: Hard tabs
Column: 1

(MD010, no-hard-tabs)


495-495: Hard tabs
Column: 1

(MD010, no-hard-tabs)


496-496: Hard tabs
Column: 1

(MD010, no-hard-tabs)


504-504: Hard tabs
Column: 1

(MD010, no-hard-tabs)


505-505: Hard tabs
Column: 1

(MD010, no-hard-tabs)


506-506: Hard tabs
Column: 1

(MD010, no-hard-tabs)


507-507: Hard tabs
Column: 1

(MD010, no-hard-tabs)


508-508: Hard tabs
Column: 1

(MD010, no-hard-tabs)


509-509: Hard tabs
Column: 1

(MD010, no-hard-tabs)


510-510: Hard tabs
Column: 1

(MD010, no-hard-tabs)


511-511: Hard tabs
Column: 1

(MD010, no-hard-tabs)


512-512: Hard tabs
Column: 1

(MD010, no-hard-tabs)


513-513: Hard tabs
Column: 1

(MD010, no-hard-tabs)


514-514: Hard tabs
Column: 1

(MD010, no-hard-tabs)


593-593: Hard tabs
Column: 1

(MD010, no-hard-tabs)


594-594: Hard tabs
Column: 1

(MD010, no-hard-tabs)


595-595: Hard tabs
Column: 1

(MD010, no-hard-tabs)


596-596: Hard tabs
Column: 1

(MD010, no-hard-tabs)


597-597: Hard tabs
Column: 1

(MD010, no-hard-tabs)


598-598: Hard tabs
Column: 1

(MD010, no-hard-tabs)


599-599: Hard tabs
Column: 1

(MD010, no-hard-tabs)


600-600: Hard tabs
Column: 1

(MD010, no-hard-tabs)


601-601: Hard tabs
Column: 1

(MD010, no-hard-tabs)


602-602: Hard tabs
Column: 1

(MD010, no-hard-tabs)


603-603: Hard tabs
Column: 1

(MD010, no-hard-tabs)


604-604: Hard tabs
Column: 1

(MD010, no-hard-tabs)

πŸ”‡ Additional comments (5)
platform-connectors/pkg/server/platform_connector_server.go (1)

57-62: LGTM! Normalization logic correctly defaults UNSPECIFIED to EXECUTE_REMEDIATION.

The pre-processing mutation ensures all downstream components receive events with an explicit processingStrategy, providing consistent behavior across modules. The placement before pipeline processing and ring buffer enqueuing is correct.

store-client/pkg/client/postgresql_pipeline_builder.go (2)

119-140: LGTM! Pipeline correctly filters EXECUTE_REMEDIATION events with backward compatibility.

The implementation properly uses the $or condition to match both explicitly-set EXECUTE_REMEDIATION and missing processingStrategy fields, ensuring old events (created before this feature) are still processed. Field name "processingstrategy" correctly uses lowercase for database access.


157-180: LGTM! Non-fatal unhealthy pipeline correctly handles processingStrategy filtering.

The backward compatibility approach with $or matching is appropriate, and the handling of both INSERT and UPDATE operations is correct for PostgreSQL's trigger-based change detection. The implementation aligns with the MongoDB counterpart.

store-client/pkg/client/mongodb_pipeline_builder.go (2)

87-107: LGTM! Pipeline correctly filters EXECUTE_REMEDIATION events with backward compatibility.

The $or condition appropriately matches both EXECUTE_REMEDIATION and missing processingStrategy fields, ensuring historical events are still processed. The use of lowercase "processingstrategy" is correct for database field access, and the INSERT-only operation type is appropriate for MongoDB change streams.

Based on learnings, protobuf fields in MongoDB use lowercase field names.


123-144: LGTM! Non-fatal unhealthy pipeline correctly handles processingStrategy filtering.

The implementation properly excludes STORE_ONLY events while maintaining backward compatibility for events without the processingStrategy field. The logic is consistent with the PostgreSQL implementation and appropriate for the health-events-analyzer use case.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-api-changes branch from 765ec61 to 9ef3351 Compare January 8, 2026 08:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/designs/025-processing-strategy-for-health-checks.md (1)

349-604: Consider replacing hard tabs with spaces.

The markdown linter detects hard tabs throughout the code blocks. While this doesn't affect functionality, replacing tabs with spaces would improve consistency with markdown best practices.

Optional: Convert tabs to spaces

You can convert hard tabs to spaces using:

#!/bin/bash
# Convert hard tabs to spaces in the markdown file
fd -e md -x sed -i 's/\t/    /g' {}
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 765ec61 and 9ef3351.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (6)
  • data-models/protobufs/health_event.proto
  • docs/designs/025-processing-strategy-for-health-checks.md
  • platform-connectors/pkg/server/platform_connector_server.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • tests/helpers/healthevent.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • data-models/protobufs/health_event.proto
  • platform-connectors/pkg/server/platform_connector_server.go
🧰 Additional context used
πŸ““ Path-based instructions (1)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/healthevent.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
🧠 Learnings (6)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2025-12-23T05:02:22.108Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:22.108Z
Learning: In the NVSentinel codebase, protobuf fields stored in MongoDB should use lowercase field names (e.g., processingstrategy, componentclass, checkname). Ensure pipeline filters and queries that access protobuf fields in the database consistently use lowercase field names in the store-client package, avoiding camelCase mappings for MongoDB reads/writes.

Applied to files:

  • store-client/pkg/client/postgresql_pipeline_builder.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • docs/designs/025-processing-strategy-for-health-checks.md
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • docs/designs/025-processing-strategy-for-health-checks.md
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • docs/designs/025-processing-strategy-for-health-checks.md
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • docs/designs/025-processing-strategy-for-health-checks.md
🧬 Code graph analysis (3)
tests/helpers/healthevent.go (2)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
store-client/pkg/client/postgresql_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (48-48)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (48-48)
πŸͺ› markdownlint-cli2 (0.18.1)
docs/designs/025-processing-strategy-for-health-checks.md

349-349: Hard tabs
Column: 1

(MD010, no-hard-tabs)


350-350: Hard tabs
Column: 1

(MD010, no-hard-tabs)


352-352: Hard tabs
Column: 1

(MD010, no-hard-tabs)


354-354: Hard tabs
Column: 1

(MD010, no-hard-tabs)


355-355: Hard tabs
Column: 1

(MD010, no-hard-tabs)


356-356: Hard tabs
Column: 1

(MD010, no-hard-tabs)


357-357: Hard tabs
Column: 1

(MD010, no-hard-tabs)


358-358: Hard tabs
Column: 1

(MD010, no-hard-tabs)


359-359: Hard tabs
Column: 1

(MD010, no-hard-tabs)


361-361: Hard tabs
Column: 1

(MD010, no-hard-tabs)


371-371: Hard tabs
Column: 1

(MD010, no-hard-tabs)


372-372: Hard tabs
Column: 1

(MD010, no-hard-tabs)


373-373: Hard tabs
Column: 1

(MD010, no-hard-tabs)


382-382: Hard tabs
Column: 1

(MD010, no-hard-tabs)


435-435: Hard tabs
Column: 1

(MD010, no-hard-tabs)


436-436: Hard tabs
Column: 1

(MD010, no-hard-tabs)


437-437: Hard tabs
Column: 1

(MD010, no-hard-tabs)


438-438: Hard tabs
Column: 1

(MD010, no-hard-tabs)


439-439: Hard tabs
Column: 1

(MD010, no-hard-tabs)


440-440: Hard tabs
Column: 1

(MD010, no-hard-tabs)


441-441: Hard tabs
Column: 1

(MD010, no-hard-tabs)


442-442: Hard tabs
Column: 1

(MD010, no-hard-tabs)


443-443: Hard tabs
Column: 1

(MD010, no-hard-tabs)


444-444: Hard tabs
Column: 1

(MD010, no-hard-tabs)


445-445: Hard tabs
Column: 1

(MD010, no-hard-tabs)


446-446: Hard tabs
Column: 1

(MD010, no-hard-tabs)


447-447: Hard tabs
Column: 1

(MD010, no-hard-tabs)


448-448: Hard tabs
Column: 1

(MD010, no-hard-tabs)


456-456: Hard tabs
Column: 1

(MD010, no-hard-tabs)


457-457: Hard tabs
Column: 1

(MD010, no-hard-tabs)


458-458: Hard tabs
Column: 1

(MD010, no-hard-tabs)


459-459: Hard tabs
Column: 1

(MD010, no-hard-tabs)


460-460: Hard tabs
Column: 1

(MD010, no-hard-tabs)


461-461: Hard tabs
Column: 1

(MD010, no-hard-tabs)


462-462: Hard tabs
Column: 1

(MD010, no-hard-tabs)


463-463: Hard tabs
Column: 1

(MD010, no-hard-tabs)


464-464: Hard tabs
Column: 1

(MD010, no-hard-tabs)


465-465: Hard tabs
Column: 1

(MD010, no-hard-tabs)


466-466: Hard tabs
Column: 1

(MD010, no-hard-tabs)


483-483: Hard tabs
Column: 1

(MD010, no-hard-tabs)


484-484: Hard tabs
Column: 1

(MD010, no-hard-tabs)


485-485: Hard tabs
Column: 1

(MD010, no-hard-tabs)


486-486: Hard tabs
Column: 1

(MD010, no-hard-tabs)


487-487: Hard tabs
Column: 1

(MD010, no-hard-tabs)


488-488: Hard tabs
Column: 1

(MD010, no-hard-tabs)


489-489: Hard tabs
Column: 1

(MD010, no-hard-tabs)


490-490: Hard tabs
Column: 1

(MD010, no-hard-tabs)


491-491: Hard tabs
Column: 1

(MD010, no-hard-tabs)


492-492: Hard tabs
Column: 1

(MD010, no-hard-tabs)


493-493: Hard tabs
Column: 1

(MD010, no-hard-tabs)


494-494: Hard tabs
Column: 1

(MD010, no-hard-tabs)


495-495: Hard tabs
Column: 1

(MD010, no-hard-tabs)


496-496: Hard tabs
Column: 1

(MD010, no-hard-tabs)


504-504: Hard tabs
Column: 1

(MD010, no-hard-tabs)


505-505: Hard tabs
Column: 1

(MD010, no-hard-tabs)


506-506: Hard tabs
Column: 1

(MD010, no-hard-tabs)


507-507: Hard tabs
Column: 1

(MD010, no-hard-tabs)


508-508: Hard tabs
Column: 1

(MD010, no-hard-tabs)


509-509: Hard tabs
Column: 1

(MD010, no-hard-tabs)


510-510: Hard tabs
Column: 1

(MD010, no-hard-tabs)


511-511: Hard tabs
Column: 1

(MD010, no-hard-tabs)


512-512: Hard tabs
Column: 1

(MD010, no-hard-tabs)


513-513: Hard tabs
Column: 1

(MD010, no-hard-tabs)


514-514: Hard tabs
Column: 1

(MD010, no-hard-tabs)


593-593: Hard tabs
Column: 1

(MD010, no-hard-tabs)


594-594: Hard tabs
Column: 1

(MD010, no-hard-tabs)


595-595: Hard tabs
Column: 1

(MD010, no-hard-tabs)


596-596: Hard tabs
Column: 1

(MD010, no-hard-tabs)


597-597: Hard tabs
Column: 1

(MD010, no-hard-tabs)


598-598: Hard tabs
Column: 1

(MD010, no-hard-tabs)


599-599: Hard tabs
Column: 1

(MD010, no-hard-tabs)


600-600: Hard tabs
Column: 1

(MD010, no-hard-tabs)


601-601: Hard tabs
Column: 1

(MD010, no-hard-tabs)


602-602: Hard tabs
Column: 1

(MD010, no-hard-tabs)


603-603: Hard tabs
Column: 1

(MD010, no-hard-tabs)


604-604: Hard tabs
Column: 1

(MD010, no-hard-tabs)

πŸ”‡ Additional comments (7)
tests/helpers/healthevent.go (1)

48-48: LGTM! ProcessingStrategy field and setter follow established patterns.

The new field and fluent setter method integrate cleanly with the existing HealthEventTemplate structure. The omitempty JSON tag ensures backward compatibility with existing tests.

Also applies to: 153-156

store-client/pkg/client/postgresql_pipeline_builder.go (2)

119-146: LGTM! Backward-compatible pipeline correctly filters by processing strategy.

The pipeline properly uses lowercase processingstrategy field name (consistent with protobuf storage conventions) and implements backward compatibility by matching both EXECUTE_REMEDIATION and missing fields.


163-192: LGTM! Correctly handles PostgreSQL-specific INSERT/UPDATE semantics.

The pipeline appropriately watches both insert and update operations to handle PostgreSQL's changelog trigger behavior, consistent with the existing BuildNonFatalUnhealthyInsertsPipeline method. The backward compatibility logic for processingStrategy is correctly implemented.

store-client/pkg/client/mongodb_pipeline_builder.go (2)

87-113: LGTM! MongoDB-specific pipeline correctly filters processable events.

The pipeline properly handles MongoDB change stream semantics (INSERT operations only) and implements backward compatibility by matching both EXECUTE_REMEDIATION and missing processingStrategy fields. Field naming follows protobuf storage conventions.


129-156: LGTM! Consistent with MongoDB change stream behavior.

The pipeline correctly watches only insert operations (appropriate for MongoDB, unlike PostgreSQL which needs insert+update). The backward compatibility logic for legacy events without processingStrategy is properly implemented.

docs/designs/025-processing-strategy-for-health-checks.md (2)

341-363: Excellent documentation of normalization strategy.

The explicit documentation of how UNSPECIFIED processingStrategy defaults to EXECUTE_REMEDIATION at the gRPC server level is crucial for understanding the system's behavior. This ensures consistent handling across all modules and provides clear backward compatibility guarantees.


581-609: Clear backward compatibility explanation.

The documentation clearly explains why the $or pattern is needed for historical events and how the analyzer handles events without processingStrategy. This will help future maintainers understand the design rationale.

@github-actions
Copy link

github-actions bot commented Jan 8, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes 83.06% (+0.21%) πŸ‘
github.com/nvidia/nvsentinel/tests 0.00% (ΓΈ)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/process_node_events.go 93.02% (+0.10%) 215 (+3) 200 (+3) 15 πŸ‘

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/tests/gpu_health_monitor_test.go
  • github.com/nvidia/nvsentinel/tests/health_events_analyzer_test.go

Signed-off-by: Tanisha goyal <[email protected]>
@github-actions
Copy link

github-actions bot commented Jan 8, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/data-models/pkg/protos 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/platform-connectors/pkg/server 66.67% (+66.67%) 🌟

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/data-models/pkg/protos/health_event.pb.go 0.00% (ΓΈ) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/platform-connectors/pkg/server/platform_connector_server_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/helpers/fault_quarantine.go (1)

141-145: LGTM! Conditional ConfigMap application improves test flexibility.

The change correctly makes ConfigMap application optional when configMapPath is empty, allowing tests to use default configurations or modify settings only via opts. The backup is still properly taken for restore operations, and error handling remains appropriate.

πŸ“ Optional: Document the empty string behavior

Consider updating the function comment (lines 124-126) to document that configMapPath can be an empty string to skip ConfigMap application:

 // SetupQuarantineTestWithOptions sets up a quarantine test with additional configuration options.
 // This allows combining multiple deployment modifications into a single rollout.
+// Pass an empty configMapPath to skip ConfigMap application and use the existing configuration.
 // Returns (context, testContext, originalDeployment) - originalDeployment is nil if no deployment changes were made.
tests/fault_quarantine_test.go (1)

327-336: Consider sending healthy events for all checks used in the test.

The Teardown only sends a healthy event for SysLogsXIDError but not for GpuPowerWatch. While TeardownQuarantineTest may handle full cleanup, explicitly sending healthy events for both checks would ensure complete state cleanup and make the teardown more robust.

♻️ Proposed enhancement for complete cleanup
 feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
+	// Clear SysLogsXIDError
 	event := helpers.NewHealthEvent(testCtx.NodeName).
 		WithErrorCode("79").
 		WithHealthy(true).
 		WithAgent(helpers.SYSLOG_HEALTH_MONITOR_AGENT).
 		WithCheckName("SysLogsXIDError")
 	helpers.SendHealthEvent(ctx, t, event)
+
+	// Clear GpuPowerWatch
+	event = helpers.NewHealthEvent(testCtx.NodeName).
+		WithErrorCode("DCGM_FR_CLOCK_THROTTLE_POWER").
+		WithHealthy(true).
+		WithCheckName("GpuPowerWatch")
+	helpers.SendHealthEvent(ctx, t, event)
 
 	return helpers.TeardownQuarantineTest(ctx, t, c)
 })
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 52788c4 and bf29a24.

πŸ“’ Files selected for processing (3)
  • store-client/pkg/client/pipeline_builder.go
  • tests/fault_quarantine_test.go
  • tests/helpers/fault_quarantine.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • store-client/pkg/client/pipeline_builder.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/fault_quarantine.go
  • tests/fault_quarantine_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/fault_quarantine_test.go
🧠 Learnings (8)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/helpers/fault_quarantine.go
  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-27T15:42:48.142Z
Learnt from: lalitadithya
Repo: NVIDIA/NVSentinel PR: 459
File: docs/cancelling-breakfix.md:58-90
Timestamp: 2025-11-27T15:42:48.142Z
Learning: The label `k8saas.nvidia.com/ManagedByNVSentinel` exists in NVSentinel and is used to opt nodes out of automated break-fix workflows. The prefix `k8saas.nvidia.com/` is configurable via the `labelPrefix` setting in the fault-quarantine Helm values, and the documentation correctly uses the default value.

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/fault_quarantine_test.go
🧬 Code graph analysis (1)
tests/helpers/fault_quarantine.go (1)
tests/helpers/kube.go (1)
  • NVSentinelNamespace (64-64)
πŸ”‡ Additional comments (3)
tests/fault_quarantine_test.go (3)

26-26: LGTM!

The protos import is necessary for accessing the ProcessingStrategy enum values used in the test.


246-282: LGTM!

The test correctly verifies that STORE_ONLY events do not trigger remediation actions (no node conditions, events, cordoning, or annotations). Testing both fatal and non-fatal event types provides good coverage.


284-325: LGTM!

The test correctly verifies that EXECUTE_REMEDIATION events trigger remediation actions (node conditions, events, cordoning, and annotations). The parallel structure with the first Assess block provides clear contrast between the two processing strategies.

@github-actions
Copy link

github-actions bot commented Jan 8, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/store-client/pkg/client 6.09% (+0.01%) πŸ‘
github.com/nvidia/nvsentinel/tests 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ΓΈ)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/store-client/pkg/client/pipeline_builder.go 17.02% (ΓΈ) 47 8 39
github.com/nvidia/nvsentinel/tests/helpers/fault_quarantine.go 0.00% (ΓΈ) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/tests/fault_quarantine_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants