Skip to content

Conversation

@natherz97
Copy link
Contributor

@natherz97 natherz97 commented Jan 9, 2026

Summary

Related design doc for GPU reset: https://github.com/NVIDIA/NVSentinel/blob/main/docs/designs/020-nvsentinel-gpu-reset.md.

This PR adds functionality to the syslog-health-monitor to send unhealthy HealthEvents that will result in a GPU reset as asll we to send healthy HealthEvents after a GPU reset has successfully occurred. In addition to the PCI, we will always add the GPU UUID as an impacted entity if it is available from either dmesg or from the metadata-collector. The COMPONENT_RESET remediation action requires the PCI and GPU UUID are available in the initial unhealthy event we're sending. Additionally, the corresponding healthy event triggered after the COMPONENT_RESET requires the same PCI and GPU UUID impacted entities are included as the initial event. As a result, we will only permit the COMPONENT_RESET action if the GPU UUID was sourced from the metadata-collector to ensure that the same impacted entities can be fetched after a reset occurs. If the GPU UUID does not exist or is sourced from dmesg, we will still include it as an impacted entity but override the remediation action from COMPONENT_RESET to RESTART_VM.

Unhealthy event generation:

  1. XID 48 error occurs in syslog which includes the PCI 0000:03:00:
Xid (PCI:0000:03:00): 48, pid=91237, name=nv-hostengine, Ch 00000076, errorString CTX SWITCH TIMEOUT, Info 0x3c046
  1. Using the metadata-collector, look up the corresponding GPU UUID for PCI 0000:03:00 which is GPU-455d8f70-2051-db6c-0430-ffc457bff834
  2. Include this PCI and GPU UUID in the list of impacted entities in our unhealthy HealthEvent with the COMPONENT_RESET remediation action.

Healthy event generation:

  1. GPU reset occurs in syslog which includes the GPU UUID:
GPU reset executed: GPU-455d8f70-2051-db6c-0430-ffc457bff834
  1. Using the metadata-collector, look up the corresponding PCI for the given GPU UUID.
  2. Include this PCI and GPU UUID in the list of impacted entities in our healthy HealthEvent.

Implementation notes:

  • As part of this change, we are persisting the normalized PCI in all HealthEvents. Currently, the normalized PCI is used as the key to lookup the corresponding GPU info from either dmesg or from the metadata-collector. However, the raw value from NVRM is persisted on HealthEvents. As a result, we will store the normalized PCI on impacted HealthEvent impacted entities to ensure that GPU reset HealthEvents have an exact match for the PCI (along with the GPU UUID) on both the unhealthy and healthy event. For sending the healthy HealthEvent, we need to look up the PCI from the GPU UUID. Note that the metadata-collector doesn't normalize the PCI by removing any trailing ".0" from the PCI in metadata.json. As a result, we also need to normalize this value prior to generating the healthy HealthEvent.
  • The xid-handler will take care of overriding the remediation action from COMPONENT_RESET to RESTART_VM if the GPU UUID is not available in the HealthEvent. This prevents either a healthEventOverrides from being required or from each future module needing to derive whether to proceed with a COMPONENT_RESET or RESTART_VM based on if the GPU UUID is present in impacted entities (specifically node-drainer needs this determine if we do a partial drain and fault-remediation needs this for the maintenance resource selection).
  • Note that it would be possible to not include the PCI as an impacted entity in COMPONENT_RESET health events which would allow us to always do a GPU reset if the GPU UUID could be fetched from any source (metadata-collector or dmegs). Recall that the GPU UUID itself is provided in the syslog GPU reset log line (whereas the PCI needs to be dynamically looked up from the metadata-collector because Janitor does not accept the PCI as input nor does it look up the PCI before writing the syslog event). However, we do not want to conditionally add entity impact depending on the needs of healthy event generation nor do we want to add custom logic to allow the fault-quarantine-module to clear conditions on a subset of impacted entities recovering.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed -> manual testing will be completed as part of end-to-end testing after node-drainer + fault-remediation changes are merged
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Added GPU reset event monitoring; GPU reset completions now generate health events for better visibility.
  • Documentation

    • Updated docs to include GPU reset events, remediation expectations, and health-event lifecycle details.
  • Tests

    • Expanded and added tests covering GPU reset scenarios, metadata-driven mappings, and end-to-end monitor behavior; improved test teardown and verification for reliability.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 9, 2026

📝 Walkthrough

Walkthrough

Documentation updated to include GPU reset events. Metadata Reader now supports GPU UUID lookup. XID handler detects "GPU reset executed" lines, maps GPU UUID → PCI via metadata, and emits healthy HealthEvents. Tests expanded (metadata-backed, GPU reset scenarios) and test helper DeletePod gained a *testing.T parameter.

Changes

Cohort / File(s) Summary
Documentation
docs/configuration/syslog-health-monitor.md, docs/syslog-health-monitor.md
Added GPU reset mentions in Overview, How It Works, Configuration, and monitoring/remediation sections; notes about gRPC health event delivery preserved.
Metadata Infrastructure
health-monitors/syslog-health-monitor/pkg/metadata/reader.go, .../reader_test.go
Added uuidToInfo map and GetInfoByUUID(uuid string) (*model.GPUInfo, error) to enable GPU UUID → GPUInfo lookup; tests added to verify exact-match retrieval and not-found behavior.
GPU Reset Pattern
health-monitors/syslog-health-monitor/pkg/xid/types.go
Added gpuResetMap regex to detect "GPU reset executed" lines.
XID Handler & Health Events
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
Added parseGPUResetLine, new GPU-reset handling path in ProcessLine, changed getGPUUUID to return (uuid, fromMetadata), added createHealthEventGPUResetEvent, introduced getDefaultImpactedEntities, and logic to override remediation to RESTART_VM when UUID not from metadata. Added healthyHealthEventMessage constant.
XID Tests
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
Added filesystem-backed metadata test scaffolding, TestParseGPUResetLine, expanded TestProcessLine scenarios for metadata-driven events, adjusted PCI normalization expectations, and introduced mockParser.
Test Helpers
tests/helpers/kube.go, tests/gpu_health_monitor_test.go, tests/node_drainer_test.go
DeletePod signature changed to accept t *testing.T; added post-deletion verification (require.Eventually); all call sites updated to pass t.
Integration Tests
tests/syslog_health_monitor_test.go
Teardown enhanced with metadata cleanup and waits for SysLogsXIDError to clear; new TestSyslogHealthMonitorGPUReset test added; extra logging for teardown visibility.

Sequence Diagram(s)

sequenceDiagram
    participant Parser as Syslog Parser
    participant XID as XID Handler
    participant Meta as Metadata Reader
    participant Event as Health Event Creator

    Parser->>XID: ProcessLine(message)
    alt GPU reset line ("GPU reset executed")
        XID->>XID: parseGPUResetLine(message) -> uuid
        XID->>Meta: GetInfoByUUID(uuid)
        Meta-->>XID: GPUInfo (includes PCI)
        XID->>Event: createHealthEventGPUResetEvent(uuid)
        Event->>Event: Compose impacted entities (PCI + UUID), set action NONE (healthy)
        Event-->>Parser: HealthEvent (healthy)
    else XID/SXID line
        XID->>XID: Standard XID parsing -> getGPUUUID(normPCI)
        alt uuid from metadata == false and recommended action COMPONENT_RESET
            XID->>XID: override action -> RESTART_VM
        end
        XID->>Event: createHealthEventFromResponse(...)
        Event-->>Parser: HealthEvent (unhealthy or healthy)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Poem

🐰 hop-hop, I found a reset in the log,
UUIDs matched and metadata cleared the fog,
A healthy ping sent when the kernel did mend,
Tests hopped along, helpers gained a friend,
a jubilant rabbit, twitching whiskers

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main feature addition: GPU reset support to the syslog-health-monitor, which is the primary objective documented in the PR summary.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @health-monitors/syslog-health-monitor/pkg/metadata/reader.go:
- Around line 108-119: Add a godoc comment above the exported
Reader.GetInfoByUUID function that briefly describes what the method does, its
input (uuid string), and its return values (*model.GPUInfo, error), including
behavior when the metadata must be loaded (calls ensureLoaded) and when a UUID
is not found; reference the Reader type and the model.GPUInfo return type in the
comment and ensure the comment starts with "GetInfoByUUID" to satisfy Go doc
conventions.
- Around line 108-119: GetInfoByUUID does exact string matching on r.uuidToInfo
without normalizing case; ensure lookups are case-insensitive by normalizing
UUIDs to lowercase: when populating r.uuidToInfo in ensureLoaded (or wherever
UUIDs are parsed) store keys as strings.ToLower(uuid) and in GetInfoByUUID call
uuid = strings.ToLower(uuid) before the map lookup (similar to normalizePCI for
PCI addresses) so stored and lookup keys are consistently lowercase.

In @health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go:
- Around line 254-286: In createHealthEventGPUResetEvent the PCI-address-empty
branch incorrectly wraps the earlier variable err (which is nil) causing
messages like "<nil>"; change that branch to return a new error that does not
reference err (e.g., fmt.Errorf("failed to look up PCI info using UUID %s: no
PCI address found", uuid)) so the error message is meaningful and not wrapping a
nil error; update the return in the if len(gpuInfo.PCIAddress) == 0 block to
construct and return this new error instead of using err.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4038245 and fecc08e.

📒 Files selected for processing (7)
  • docs/configuration/syslog-health-monitor.md
  • docs/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
🧠 Learnings (7)
📓 Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • docs/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • docs/configuration/syslog-health-monitor.md
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:41:34.094Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:34.094Z
Learning: The XID74Reg2Bit13Set rule in tests/data/health-events-analyzer-config.yaml intentionally omits the time window filter because it only validates the register bit pattern (bit 13 in REG2) on the received XID 74 event itself, without needing to check historical events or count repeated occurrences.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • docs/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • docs/configuration/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.

Applied to files:

  • docs/configuration/syslog-health-monitor.md
🧬 Code graph analysis (2)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
  • NewReader (46-50)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
data-models/pkg/model/gpu_metadata.go (1)
  • GPUInfo (26-33)
🔇 Additional comments (17)
docs/configuration/syslog-health-monitor.md (1)

5-5: LGTM! Documentation accurately reflects GPU reset monitoring.

The documentation updates appropriately describe the new GPU reset event monitoring capability alongside existing XID/SXID error detection.

Also applies to: 58-58

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

24-27: LGTM! GPU reset regex pattern is correct.

The gpuResetMap regex correctly captures GPU UUIDs from reset log lines in the expected format.

docs/syslog-health-monitor.md (1)

5-5: LGTM! Comprehensive documentation updates for GPU reset support.

The documentation thoroughly describes the new GPU reset monitoring capability, including its role in tracking remediation completion. The updates are well-integrated throughout the document and provide clear context for users.

Also applies to: 11-11, 16-16, 18-18, 25-25, 29-29, 42-42, 79-82

health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)

176-221: LGTM! Comprehensive test coverage for UUID lookup.

The table-driven test appropriately validates the new GetInfoByUUID method with both successful lookups (GPU 0 and GPU 1) and error handling (unknown UUID). The test structure follows Go best practices and is consistent with existing tests in the file.

health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)

37-37: LGTM! UUID lookup implementation is well-structured.

The UUID-based lookup functionality is correctly implemented:

  • The uuidToInfo map is properly initialized and populated in buildMaps
  • Thread safety is maintained through the existing sync.Once pattern
  • The implementation mirrors the established GetGPUByPCI pattern for consistency

Also applies to: 84-84, 91-91

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (5)

30-86: Well-structured test fixtures for metadata scenarios.

The JSON constants provide clear test data for both complete metadata and missing PCI scenarios, enabling thorough coverage of metadata-driven paths.


120-151: Good coverage for GPU reset line parsing.

The table-driven test follows Go conventions and covers valid reset lines, invalid/empty cases, and ensures XID log lines don't match the GPU reset pattern.


222-227: Clean test setup using temporary directories.

Using t.TempDir() ensures proper cleanup and isolation between test runs. The metadata file setup enables filesystem-backed tests without polluting shared state.


295-363: Comprehensive GPU reset test coverage.

The test cases appropriately cover:

  • Valid GPU reset with metadata lookup
  • Metadata collector not initialized (error case)
  • Missing GPU UUID in metadata (error case)
  • Missing PCI in metadata (error case)

The validateEvent function correctly verifies GeneratedTimestamp is set before nullifying it for equality comparison.


364-436: RESET_GPU → RESTART_VM override logic well tested.

Good distinction between:

  • Line 365-400: UUID from NVRM (dmesg) triggers override to RESTART_VM
  • Line 401-436: UUID from metadata-collector preserves COMPONENT_RESET

This aligns with the PR objectives for remediation action behavior.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (7)

33-35: LGTM!

The constant provides a single source of truth for the healthy event message, improving maintainability.


79-82: LGTM!

Clean integration of GPU reset handling into the processing flow. The early return pattern matches the existing NVRM GPU map line handling above.


101-108: Consider checking for empty string match result.

While len(m) >= 2 guards against out-of-bounds access, if the regex capture group matches an empty string, this would return an empty UUID. The test cases suggest this is expected behavior (returning "" for invalid lines).


133-148: Clean refactor to return metadata source information.

Returning (uuid, fromMetadata) tuple enables the caller to make informed decisions about remediation actions. The fallback chain (metadata → dmesg cache → empty) is clear.


150-190: Excellent design documentation.

This detailed comment block explains the rationale for impacted entities design, remediation action behavior, and the unhealthy/healthy event flow. This aids future maintainers in understanding the constraints and trade-offs.


224-230: Override logic correctly implements the design.

The COMPONENT_RESET → RESTART_VM fallback when GPU UUID isn't from metadata aligns with the documented behavior. The structured log provides good observability for troubleshooting.


332-348: Clean helper function for consistent entity construction.

getDefaultImpactedEntities ensures PCI is always first and GPU_UUID is conditionally appended, matching the documented behavior and test expectations.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go:
- Around line 150-190: Fix the minor typo in the large documentation comment in
xid_handler.go: replace "dmegs" with "dmesg" inside the block comment describing
healthy/unhealthy event generation and remediation behavior (the comment
associated with the xid-handler logic in xid_handler.go).
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fecc08e and 82d6105.

📒 Files selected for processing (7)
  • docs/configuration/syslog-health-monitor.md
  • docs/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
✅ Files skipped from review due to trivial changes (1)
  • docs/configuration/syslog-health-monitor.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
🧠 Learnings (6)
📓 Common learnings
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • docs/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.

Applied to files:

  • docs/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
🧬 Code graph analysis (3)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
data-models/pkg/model/gpu_metadata.go (1)
  • GPUInfo (26-33)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (3)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
  • XIDHandler (29-38)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
  • NewXIDHandler (37-59)
health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (1)
  • XIDDetails (36-48)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
  • NewReader (46-50)
🔇 Additional comments (17)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)

176-221: LGTM! Well-structured table-driven test for UUID lookup.

The test follows the established pattern from TestGetGPUByPCI and covers the key scenarios: exact matches for both GPUs and the not-found case. As per coding guidelines, table-driven tests are used appropriately here.

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

24-27: LGTM! Regex pattern correctly captures GPU UUID from reset lines.

The gpuResetMap regex appropriately captures the GPU UUID format (GPU-<hex-uuid>) from syslog reset messages, complementing the existing reNvrmMap for PCI-to-UUID mapping.

docs/syslog-health-monitor.md (2)

5-18: LGTM! Documentation clearly explains the new GPU reset monitoring capability.

The overview and rationale sections are updated to accurately describe the dual purpose: detecting failures and monitoring recovery events like GPU resets.


79-82: LGTM! New GPU Reset section provides clear context.

The subsection succinctly explains what a GPU reset event signifies in the context of remediation workflows.

health-monitors/syslog-health-monitor/pkg/metadata/reader.go (2)

37-38: LGTM! UUID-to-GPUInfo map enables efficient reverse lookup.

The new uuidToInfo field mirrors the existing pciToGPU pattern, providing O(1) lookup by UUID.


108-119: LGTM! GetInfoByUUID follows established patterns.

The method correctly uses lazy loading via ensureLoaded() and provides proper error wrapping with context. Based on learnings, this aligns with the metadata Reader's design where the file may not exist at initialization time.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)

30-86: LGTM! Test fixtures provide good coverage for metadata scenarios.

The two JSON constants (testMetadataJSON and testMetadataMissingPCIJSON) enable testing both happy path and edge cases where PCI data is absent from metadata.


120-151: LGTM! GPU reset line parsing test covers key scenarios.

Good coverage of valid reset lines, malformed lines (missing UUID), and ensuring non-reset lines (like NVRM map lines) return empty strings.


296-363: LGTM! Comprehensive GPU reset event generation tests.

The test cases appropriately cover:

  • Successful reset with full metadata
  • Missing metadata file (expect error)
  • UUID not found in metadata (expect error)
  • Missing PCI in metadata (expect error)

This validates the behavior described in the PR objectives where healthy events require both PCI and UUID.


364-436: LGTM! Tests validate the remediation action override logic.

Two critical scenarios are covered:

  1. UUID from NVRM/dmesg → RESET_GPU overridden to RESTART_VM
  2. UUID from metadata-collector → RESET_GPU kept as COMPONENT_RESET

This aligns with the PR design where COMPONENT_RESET is only permitted when UUID is sourced from metadata to ensure matching between unhealthy and healthy events.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (7)

33-35: LGTM! Clear constant for healthy event messaging.

The constant provides a consistent message for GPU reset healthy events.


79-82: LGTM! GPU reset detection integrated correctly in processing pipeline.

The reset line parsing is checked before standard XID parsing, which is the correct order since reset lines have a distinct format and should generate healthy (not unhealthy) events.


101-108: LGTM! Simple and correct GPU reset line parsing.

The implementation correctly uses the gpuResetMap regex and returns the captured UUID or empty string.


133-148: LGTM! getGPUUUID now indicates UUID provenance.

Returning fromMetadata boolean enables the caller to make informed decisions about remediation actions. The fallback to pciToGPUUUID (dmesg-sourced) when metadata lookup fails is correct.


224-230: LGTM! Remediation action override logic is correct.

When UUID is not from metadata, COMPONENT_RESET is downgraded to RESTART_VM. This ensures consistent entity matching between unhealthy and healthy events, as documented in the PR objectives.


254-286: LGTM! Healthy event generation with proper validation.

The implementation correctly:

  1. Requires metadata lookup to succeed (returns error otherwise)
  2. Validates PCI address is present
  3. Uses getDefaultImpactedEntities for consistent entity formatting
  4. Sets IsHealthy: true and RecommendedAction: NONE

333-348: LGTM! Helper function for consistent entity construction.

getDefaultImpactedEntities ensures PCI is always included and GPU_UUID is conditionally added only when available. This maintains consistency between unhealthy and healthy events.

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)

82-106: Guard uuidToInfo against empty/duplicate UUIDs (avoid silent overwrite).
Today r.uuidToInfo[gpu.UUID] = gpu will (a) create a "" key if UUID missing and (b) silently overwrite on duplicates. Consider skipping empty UUIDs and warning on duplicates.

Proposed fix
 func (r *Reader) buildMaps() {
 	r.pciToGPU = make(map[string]*model.GPUInfo)
 	r.uuidToInfo = make(map[string]*model.GPUInfo)
 	r.nvswitchLinks = make(map[string]map[int]*gpuLinkInfo)

 	for i := range r.metadata.GPUs {
 		gpu := &r.metadata.GPUs[i]
 		normPCI := normalizePCI(gpu.PCIAddress)
 		r.pciToGPU[normPCI] = gpu
-		r.uuidToInfo[gpu.UUID] = gpu
+		if uuid := strings.TrimSpace(gpu.UUID); uuid != "" {
+			if _, exists := r.uuidToInfo[uuid]; exists {
+				slog.Warn("Duplicate GPU UUID in metadata; overwriting", "uuid", uuid)
+			}
+			r.uuidToInfo[uuid] = gpu
+		}

 		for _, link := range gpu.NVLinks {
 			remotePCI := normalizePCI(link.RemotePCIAddress)
tests/helpers/kube.go (1)

453-475: Add t.Helper() to DeletePod (improves error attribution).

Proposed fix
 func DeletePod(ctx context.Context, t *testing.T, c klient.Client, namespace, podName string) error {
+	t.Helper()
 	pod := &v1.Pod{
 		ObjectMeta: metav1.ObjectMeta{
 			Name:      podName,
 			Namespace: namespace,
 		},
 	}
tests/syslog_health_monitor_test.go (1)

204-225: Teardown “condition cleared” checks should treat “condition removed” as success (avoid flakes).
Right now you require condition != nil && Status=False; if the controller removes the condition entirely, teardown will fail.

Proposed fix (apply to each similar Eventually block)
 require.Eventually(t, func() bool {
 	condition, err := helpers.CheckNodeConditionExists(ctx, client, nodeName,
 		"SysLogsXIDError", "SysLogsXIDErrorIsHealthy")
 	if err != nil {
 		return false
 	}
-	return condition != nil && condition.Status == v1.ConditionFalse
+	// Consider both "removed" and "explicitly healthy" as cleared.
+	if condition == nil {
+		return true
+	}
+	return condition.Status == v1.ConditionFalse
 }, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "SysLogsXIDError condition should be cleared")

Also applies to: 350-371, 601-621

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)

119-125: Fix PCI normalization to guarantee exact matching between unhealthy and healthy events.
Current xid-handler normalization only strips .0 and doesn’t lowercase/domain-normalize, while metadata’s normalization is stricter; this can break the “same impacted entities” contract.

Proposed fix (make xid-handler normalization match metadata normalization semantics)
 func (xidHandler *XIDHandler) normalizePCI(pci string) string {
-	if idx := strings.Index(pci, "."); idx != -1 {
-		return pci[:idx]
-	}
-
-	return pci
+	pci = strings.TrimSpace(pci)
+	parts := strings.Split(pci, ":")
+	if len(parts) != 3 {
+		// Best-effort fallback: strip function if present and lowercase
+		if idx := strings.Index(pci, "."); idx != -1 {
+			pci = pci[:idx]
+		}
+		return strings.ToLower(pci)
+	}
+
+	domain := parts[0]
+	if len(domain) > 4 {
+		domain = domain[len(domain)-4:]
+	}
+
+	busDeviceFunc := parts[2]
+	if idx := strings.Index(busDeviceFunc, "."); idx != -1 {
+		busDeviceFunc = busDeviceFunc[:idx]
+	}
+
+	return fmt.Sprintf("%s:%s:%s",
+		strings.ToLower(domain),
+		strings.ToLower(parts[1]),
+		strings.ToLower(busDeviceFunc))
 }
 func (xidHandler *XIDHandler) createHealthEventFromResponse(
 	xidResp *parser.Response,
 	message string,
 ) *pb.HealthEvents {
-	normPCI := xidHandler.normalizePCI(xidResp.Result.PCIE)
+	normPCI := xidHandler.normalizePCI(xidResp.Result.PCIE)
 	uuid, fromMetadata := xidHandler.getGPUUUID(normPCI)

 	entities := getDefaultImpactedEntities(normPCI, uuid)
 func (xidHandler *XIDHandler) createHealthEventGPUResetEvent(uuid string) (*pb.HealthEvents, error) {
 	gpuInfo, err := xidHandler.metadataReader.GetInfoByUUID(uuid)
 	// ...
 	if len(gpuInfo.PCIAddress) == 0 {
 		return nil, fmt.Errorf("failed to look up PCI info using UUID %s", uuid)
 	}

-	entities := getDefaultImpactedEntities(gpuInfo.PCIAddress, uuid)
+	normPCI := xidHandler.normalizePCI(gpuInfo.PCIAddress)
+	entities := getDefaultImpactedEntities(normPCI, uuid)

Also applies to: 195-199, 254-280, 333-348

🤖 Fix all issues with AI agents
In @health-monitors/syslog-health-monitor/pkg/metadata/reader.go:
- Around line 108-119: In Reader.GetInfoByUUID normalize the incoming uuid by
calling strings.TrimSpace on the uuid parameter, validate that the trimmed uuid
is not empty (return a clear error like "empty UUID" or similar), then use the
trimmed value for the lookup against r.uuidToInfo and in the error messages;
reference the Reader.GetInfoByUUID function and the r.uuidToInfo map when making
these edits so you normalize input before the existence check to avoid false
"not found" results from whitespace/newline.

In @health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go:
- Around line 79-82: The parsed GPU UUID from xidHandler.parseGPUResetLine may
include trailing whitespace/newlines; update the call sites (the block using
uuid := xidHandler.parseGPUResetLine(message) and the analogous block around
lines handling createHealthEventGPUResetEvent) to trim the returned string
(e.g., assign trimmed := strings.TrimSpace(uuid) and use trimmed for the
nil/empty check and when invoking xidHandler.createHealthEventGPUResetEvent) so
trailing whitespace/newlines are removed before further processing.
- Around line 133-148: In getGPUUUID, avoid returning fromMetadata=true when the
metadata lookup yields an empty UUID and reduce noisy error logging: after
calling xidHandler.metadataReader.GetGPUByPCI(normPCI) verify both err==nil and
gpuInfo.UUID != "" before returning (return gpuInfo.UUID, true only when
non-empty); if err!=nil log at a lower level (e.g., Debug or Info) instead of
Error to avoid spamming when metadata is absent; keep the existing fallback to
xidHandler.pciToGPUUUID[normPCI] when metadata gave no UUID. Apply the same
non-empty-UUID check and lowered log level to the analogous code block at the
other location referenced (lines ~223-230) that also calls GetGPUByPCI and
returns fromMetadata.

In @tests/syslog_health_monitor_test.go:
- Around line 312-317: The injected GPU reset syslog message in xidMessages
includes a trailing newline which can break deterministic parsing; update the
test so the string in xidMessages (used by helpers.InjectSyslogMessages) does
not end with "\n" (e.g., "kernel: [16450076.435595] GPU reset executed:
GPU-22222222-2222-2222-2222-222222222222") so the parser receives a stable
message format.
🧹 Nitpick comments (2)
tests/syslog_health_monitor_test.go (1)

41-633: Consider extracting shared syslog test setup/teardown into helpers (reduces duplication).
Port-forward + metadata inject + ManagedByNVSentinel label wiring is repeated across several tests; a small helper returning {nodeName, podName, stopChan} would simplify future additions.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)

222-507: Good scenario coverage for GPU reset and remediation override (metadata vs dmesg).
One nit: consider asserting the PCI entity value in the override tests too (currently only type is asserted).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82d6105 and 98542bf.

📒 Files selected for processing (12)
  • docs/configuration/syslog-health-monitor.md
  • docs/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/helpers/kube.go
  • tests/helpers/metadata.go
  • tests/node_drainer_test.go
  • tests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/gpu_health_monitor_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/helpers/metadata.go
  • tests/helpers/kube.go
  • tests/syslog_health_monitor_test.go
  • tests/node_drainer_test.go
**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/gpu_health_monitor_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/syslog_health_monitor_test.go
  • tests/node_drainer_test.go
🧠 Learnings (7)
📓 Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/gpu_health_monitor_test.go
  • tests/helpers/kube.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/gpu_health_monitor_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/syslog_health_monitor_test.go
  • tests/node_drainer_test.go
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • docs/syslog-health-monitor.md
  • docs/configuration/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • docs/syslog-health-monitor.md
  • docs/configuration/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • docs/configuration/syslog-health-monitor.md
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
🧬 Code graph analysis (6)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (2)
  • DeletePod (453-475)
  • NVSentinelNamespace (64-64)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
  • XIDHandler (29-38)
health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (1)
  • Response (29-33)
data-models/pkg/protos/health_event.pb.go (13)
  • HealthEvents (107-113)
  • HealthEvents (126-126)
  • HealthEvents (141-143)
  • HealthEvent (211-230)
  • HealthEvent (243-243)
  • HealthEvent (258-260)
  • RecommendedAction (40-40)
  • RecommendedAction (90-92)
  • RecommendedAction (94-96)
  • RecommendedAction (103-105)
  • Entity (159-165)
  • Entity (178-178)
  • Entity (193-195)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
data-models/pkg/model/gpu_metadata.go (1)
  • GPUInfo (26-33)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
  • XIDHandler (29-38)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
  • NewXIDHandler (37-59)
data-models/pkg/protos/health_event.pb.go (13)
  • HealthEvents (107-113)
  • HealthEvents (126-126)
  • HealthEvents (141-143)
  • HealthEvent (211-230)
  • HealthEvent (243-243)
  • HealthEvent (258-260)
  • Entity (159-165)
  • Entity (178-178)
  • Entity (193-195)
  • RecommendedAction (40-40)
  • RecommendedAction (90-92)
  • RecommendedAction (94-96)
  • RecommendedAction (103-105)
health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (2)
  • Response (29-33)
  • XIDDetails (36-48)
tests/syslog_health_monitor_test.go (2)
tests/helpers/metadata.go (3)
  • DeleteMetadata (183-229)
  • CreateTestMetadata (59-108)
  • InjectMetadata (110-181)
tests/helpers/kube.go (8)
  • NVSentinelNamespace (64-64)
  • DeletePod (453-475)
  • GetPodOnWorkerNode (1412-1442)
  • PortForwardPod (1598-1642)
  • SetNodeManagedByNVSentinel (1373-1392)
  • EventuallyWaitTimeout (61-61)
  • WaitInterval (63-63)
  • CheckNodeConditionExists (1267-1284)
tests/node_drainer_test.go (1)
tests/helpers/kube.go (1)
  • DeletePod (453-475)
🔇 Additional comments (9)
docs/configuration/syslog-health-monitor.md (1)

5-5: LGTM! Documentation accurately reflects GPU reset support.

The documentation updates correctly describe the new GPU reset monitoring capability:

  • Overview includes GPU reset events
  • SysLogsXIDError check description clarified to include GPU reset messages
  • New subsection explains GPU reset as a remediation completion signal

The changes align well with the PR objectives and maintain consistency with the existing documentation structure.

Also applies to: 58-58, 79-82

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

25-26: LGTM! GPU reset regex pattern is correct.

The new gpuResetMap regex correctly captures GPU reset events with the expected format GPU reset executed: GPU-<UUID>. The alignment adjustment on line 25 improves code readability.

docs/syslog-health-monitor.md (1)

5-5: LGTM! Comprehensive documentation for GPU reset feature.

The documentation updates effectively communicate the GPU reset monitoring capability:

  • Clarifies that GPU resets are events indicating remediation completion, not just failures
  • Integrates GPU reset into the monitoring workflow description
  • Adds dedicated subsection explaining GPU reset semantics
  • Updates configuration examples to reflect GPU reset support

The changes are well-structured and maintain consistency across the documentation.

Also applies to: 11-18, 25-29, 42-42, 79-82

tests/helpers/metadata.go (1)

71-71: LGTM! PCI address normalization aligns with production code.

The test metadata has been updated to use normalized PCI addresses (removing the trailing ".0" function suffix). This change ensures test data consistency with the production code's PCI normalization logic, as described in the PR objectives where normalized PCI is used as the lookup key for GPU metadata.

Also applies to: 82-82, 92-92, 100-100

tests/node_drainer_test.go (1)

97-97: LGTM! Function call correctly updated for new DeletePod signature.

The DeletePod invocation has been correctly updated to pass t as the second parameter, matching the helper's new signature that enables post-deletion verification within the helper function.

tests/gpu_health_monitor_test.go (1)

62-68: Update to helpers.DeletePod(ctx, t, ...) is consistent with new helper signature.

Also applies to: 302-305

tests/helpers/kube.go (1)

696-742: DrainRunningPodsInNamespace correctly passes t through to DeletePod.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)

30-151: GPU reset parsing tests are clear and table-driven; good coverage of valid/invalid lines.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)

61-77: The concurrent map access concern does not apply to this code.

The pciToGPUUUID map in XIDHandler is accessed only sequentially. Within each SyslogMonitor.Run() cycle, all checks (including ProcessLine calls) execute in a sequential for-loop without goroutines. Across multiple Run() cycles, execution is serialized by the polling ticker and the blocking nature of Run(), meaning the same XIDHandler instance cannot be accessed concurrently. The default polling interval is 30 minutes, and only a single SyslogMonitor instance exists in the application.

Likely an incorrect or invalid review comment.

@natherz97 natherz97 force-pushed the syslog-gpu-reset branch 2 times, most recently from 3b50eee to 0cb6782 Compare January 9, 2026 21:37
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/helpers/kube.go (1)

453-475: Replace require.Eventually fatal assertion with error-returning timeout logic.
DeletePod uses require.Eventually which will fatally abort the test on timeout, yet the function signature returns error and callers (like DrainRunningPodsInNamespace) expect to handle errors via t.Errorf. This breaks error handling: if the pod deletion times out, the test fails immediately before the error is returned, making the caller's error handling unreachable and terminating the entire drain operation. Additionally, if the pod is already deleted, treat IsNotFound on the delete call as success (idempotency).

Convert the wait loop to return errors instead of calling require.Eventually, and add t.Helper() to preserve caller context in test output. The pod should be checked for existence before starting the wait (implicit via the initial Delete call), and transient Get errors during polling should be logged but not fatal.

🧹 Nitpick comments (3)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

26-26: Optional: Consider a stricter GPU UUID regex pattern.

The current pattern [0-9a-fA-F-]+ is permissive and will match malformed UUIDs like GPU--123. If strict validation is desired, consider a pattern that enforces the standard GPU UUID format: GPU-[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}.

However, the current permissive pattern is consistent with reNvrmMap and may be intentionally lenient for robustness.

♻️ More strict regex pattern
-	gpuResetMap = regexp.MustCompile(`GPU reset executed: (GPU-[0-9a-fA-F-]+)`)
+	gpuResetMap = regexp.MustCompile(`GPU reset executed: (GPU-[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12})`)

You may also want to update reNvrmMap for consistency if you choose this approach.

tests/gpu_health_monitor_test.go (1)

65-69: Minor: WaitForPodsDeleted is likely redundant now that helpers.DeletePod waits for API absence.
If DeletePod keeps the “wait until NotFound” behavior, consider removing the extra wait to reduce test duration/flakiness.

tests/syslog_health_monitor_test.go (1)

239-374: GPU reset E2E test: strengthen assertions to match the PR’s key invariants (impacted entities + action gating).
Suggestions:

  • Avoid hard-coding GPU-2222... unless it’s guaranteed by helpers.CreateTestMetadata(...); derive expected UUID/PCI from the same test metadata fixture to prevent brittle failures.
  • Assert the “healthy after reset” artifact also includes the same normalized PCI + GPU UUID (not just that the condition becomes healthy), since that’s a core contract of COMPONENT_RESET matching.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 98542bf and 3b50eee.

📒 Files selected for processing (12)
  • docs/configuration/syslog-health-monitor.md
  • docs/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/helpers/kube.go
  • tests/helpers/metadata.go
  • tests/node_drainer_test.go
  • tests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • tests/node_drainer_test.go
  • docs/configuration/syslog-health-monitor.md
  • tests/helpers/metadata.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • docs/syslog-health-monitor.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • tests/helpers/kube.go
  • tests/gpu_health_monitor_test.go
  • tests/syslog_health_monitor_test.go
**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/syslog_health_monitor_test.go
🧠 Learnings (7)
📓 Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/syslog_health_monitor_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/helpers/kube.go
  • tests/gpu_health_monitor_test.go
🧬 Code graph analysis (2)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
  • XIDHandler (29-38)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
  • NewXIDHandler (37-59)
data-models/pkg/protos/health_event.pb.go (7)
  • Entity (159-165)
  • Entity (178-178)
  • Entity (193-195)
  • RecommendedAction (40-40)
  • RecommendedAction (90-92)
  • RecommendedAction (94-96)
  • RecommendedAction (103-105)
health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (2)
  • Response (29-33)
  • XIDDetails (36-48)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (2)
  • DeletePod (453-475)
  • NVSentinelNamespace (64-64)
🔇 Additional comments (16)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (5)

30-86: LGTM: Well-structured test metadata.

The test metadata constants provide comprehensive coverage for testing both normal and edge-case scenarios (missing PCI address). The use of realistic GPU metadata structure with normalized PCI addresses aligns well with the normalizePCI behavior.


120-151: LGTM: Comprehensive GPU reset line parsing tests.

The test cases properly cover valid GPU reset lines, invalid formats, and ensure that XID log lines (which contain GPU UUIDs but aren't reset lines) are correctly rejected. The table-driven structure follows coding guidelines.


223-227: LGTM: Proper test file management.

Using t.TempDir() and creating temporary metadata files with secure permissions (0600) is the correct approach for filesystem-backed tests. This ensures test isolation and automatic cleanup.


296-363: LGTM: Thorough GPU reset event generation tests.

The test cases effectively validate:

  • Successful healthy event generation when metadata is available
  • Error handling when metadata collector is not initialized
  • Error handling when GPU UUID is not found in metadata
  • Error handling when PCI address is missing from metadata

These scenarios ensure robust behavior across different metadata collector states.


365-436: LGTM: Critical remediation override logic is well-tested.

These tests validate the core PR behavior:

  • When GPU UUID is sourced from NVRM (dmesg), RESET_GPU is correctly overridden to RESTART_VM
  • When GPU UUID is sourced from metadata collector, COMPONENT_RESET is preserved

This distinction is crucial for ensuring GPUs can only be reset when the same impacted entities can be resolved after the reset completes, as explained in the PR objectives.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (8)

33-35: LGTM: Clear constant for healthy event message.

Using a named constant improves maintainability and ensures consistency across healthy GPU reset events.


79-82: LGTM: GPU reset detection properly positioned.

Checking for GPU reset lines before XID parsing is correct, as reset lines have a different format and should generate healthy events rather than being parsed as XIDs.


101-108: LGTM: Simple and correct GPU reset line parsing.

The implementation correctly extracts the GPU UUID from reset lines and returns an empty string when the pattern doesn't match.


133-148: Verify thread-safety of pciToGPUUUID map access.

The pciToGPUUUID map is written at line 70 and read here at line 143 without synchronization. If ProcessLine can be called concurrently from multiple goroutines, this could cause a data race.

Please verify whether ProcessLine is called:

  1. Sequentially from a single goroutine (no issue)
  2. Concurrently from multiple goroutines (requires synchronization with sync.RWMutex)

If concurrent access is possible, consider protecting the map with a sync.RWMutex:

type XIDHandler struct {
	nodeName              string
	defaultAgentName      string
	defaultComponentClass string
	checkName             string

	mu             sync.RWMutex
	pciToGPUUUID   map[string]string
	parser         parser.Parser
	metadataReader *metadata.Reader
}

Then use xidHandler.mu.Lock()/Unlock() for writes and xidHandler.mu.RLock()/RUnlock() for reads.


150-190: LGTM: Excellent documentation for complex logic.

This comment block thoroughly explains the GPU reset workflow, including:

  • Why both PCI and GPU UUID are required for COMPONENT_RESET
  • The distinction between metadata-sourced vs dmesg-sourced UUIDs
  • The remediation action override behavior
  • Example unhealthy and healthy event flows

This documentation is invaluable for understanding the design decisions and will help future maintainers.


224-230: LGTM: Remediation override logic correctly implements PR requirements.

The override from COMPONENT_RESET to RESTART_VM when the GPU UUID is not from metadata is a critical safety measure. This ensures GPU resets are only attempted when the same impacted entities (PCI + UUID) can be reliably resolved after the reset, as explained in the documentation comment and PR objectives.

The logging at line 227 provides good visibility into when this override occurs.


254-286: LGTM: Robust healthy event generation with proper validation.

The function correctly:

  • Looks up GPU info from metadata using the UUID (line 255)
  • Returns an error if lookup fails or PCI is missing (lines 259-265), ensuring partial events aren't sent
  • Uses the same getDefaultImpactedEntities helper for consistency (line 267)
  • Sets appropriate fields for a healthy event (IsHealthy=true, IsFatal=false, RecommendedAction=NONE)

The comment at lines 256-258 clearly explains why returning an error for incomplete data is necessary (fault-quarantine-module matching requirements).


333-348: LGTM: Consistent impacted entities composition.

This helper ensures that:

  • PCI is always included as the first entity
  • GPU_UUID is conditionally added when available
  • The same entity structure is used for both unhealthy and healthy events

This consistency is critical for the fault-quarantine-module to match events correctly.

tests/gpu_health_monitor_test.go (1)

302-305: LGTM: updated helpers.DeletePod(ctx, t, ...) call site.
No concerns with the signature update here.

tests/syslog_health_monitor_test.go (2)

204-234: Teardown ordering: consider restarting the pod before deleting injected metadata (if startup/clear logic depends on metadata).
Not necessarily wrong, but if the syslog monitor’s “healthy event” generation/impacted-entity matching needs metadata, deleting it first could add flakes.


592-611: LGTM: teardown updated to use helpers.DeletePod(ctx, t, ...).
This aligns with the new helper signature and keeps cleanup consistent.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/helpers/kube.go (1)

453-474: Fix lint + avoid mutating outer err inside require.Eventually closure.
This should address the whitespace lint failure and makes the polling closure side-effect free.

Proposed diff
 func DeletePod(ctx context.Context, t *testing.T, c klient.Client, namespace, podName string) error {
+	t.Helper()
+
 	pod := &v1.Pod{
 		ObjectMeta: metav1.ObjectMeta{
 			Name:      podName,
 			Namespace: namespace,
 		},
 	}
 
 	err := c.Resources().Delete(ctx, pod)
 	if err != nil {
 		return fmt.Errorf("failed to delete pod %s: %w", podName, err)
 	}
 
 	require.Eventually(t, func() bool {
-		err = c.Resources().Get(ctx, podName, namespace, pod)
-		if err != nil {
-			return apierrors.IsNotFound(err)
+		getErr := c.Resources().Get(ctx, podName, namespace, pod)
+		if getErr != nil {
+			return apierrors.IsNotFound(getErr)
 		}
+
 		return false
 	}, EventuallyWaitTimeout, WaitInterval, "pod %s should be removed from API", podName)
 
 	return nil
 }
tests/syslog_health_monitor_test.go (1)

204-234: Teardown “cleared” check may be too strict if conditions are removed instead of set healthy.
If SysLogsXIDError is ever cleared by deleting the condition (nil) rather than flipping it to Status=False with reason SysLogsXIDErrorIsHealthy, this teardown can flake.

Consider accepting either “healthy” or “not present”:

Proposed diff
 require.Eventually(t, func() bool {
 	condition, err := helpers.CheckNodeConditionExists(ctx, client, nodeName,
 		"SysLogsXIDError", "SysLogsXIDErrorIsHealthy")
 	if err != nil {
 		return false
 	}
-	return condition != nil && condition.Status == v1.ConditionFalse
+	return condition == nil || condition.Status == v1.ConditionFalse
 }, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "SysLogsXIDError condition should be cleared")
🧹 Nitpick comments (4)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)

150-190: Consider relocating detailed design rationale.

While this documentation is valuable, the extensive implementation rationale and alternative approaches might be better suited for package-level godoc or a design document rather than inline comments. This would improve code readability while preserving the context.

As per coding guidelines, inline comments should be reserved for complex logic, and package-level documentation is preferred for broader context.

tests/gpu_health_monitor_test.go (1)

62-69: DeletePod now waits for API removal; WaitForPodsDeleted may be redundant here.
Since helpers.DeletePod() now includes an eventual NotFound check, consider dropping the extra helpers.WaitForPodsDeleted(...) to reduce latency/flakes (unless you’re intentionally waiting for additional downstream cleanup beyond API removal).

tests/syslog_health_monitor_test.go (2)

239-305: GPU reset test is valuable, but hard-coded GPU UUID is brittle.
Since the test already creates/injects metadata, consider pulling the expected UUID from the injected metadata (or using a regex + separately asserting “same UUID before/after”) instead of pinning GPU-2222....


592-611: Same teardown strictness risk in SXID cleanup.
As with XID teardown, consider treating “condition removed” as cleared to avoid flakes if the controller clears by deletion instead of flipping to IsHealthy.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b50eee and 0cb6782.

📒 Files selected for processing (12)
  • docs/configuration/syslog-health-monitor.md
  • docs/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/helpers/kube.go
  • tests/helpers/metadata.go
  • tests/node_drainer_test.go
  • tests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • tests/node_drainer_test.go
  • docs/configuration/syslog-health-monitor.md
  • docs/syslog-health-monitor.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • tests/helpers/kube.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/syslog_health_monitor_test.go
  • tests/helpers/metadata.go
**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/syslog_health_monitor_test.go
🧠 Learnings (8)
📓 Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/syslog_health_monitor_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/helpers/kube.go
  • tests/gpu_health_monitor_test.go
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
📚 Learning: 2025-12-12T07:41:34.094Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:34.094Z
Learning: The XID74Reg2Bit13Set rule in tests/data/health-events-analyzer-config.yaml intentionally omits the time window filter because it only validates the register bit pattern (bit 13 in REG2) on the received XID 74 event itself, without needing to check historical events or count repeated occurrences.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
🧬 Code graph analysis (5)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
  • NewReader (46-50)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
  • XIDHandler (29-38)
health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (1)
  • Response (29-33)
data-models/pkg/protos/health_event.pb.go (16)
  • HealthEvents (107-113)
  • HealthEvents (126-126)
  • HealthEvents (141-143)
  • RecommendedAction_COMPONENT_RESET (44-44)
  • RecommendedAction_RESTART_VM (47-47)
  • HealthEvent (211-230)
  • HealthEvent (243-243)
  • HealthEvent (258-260)
  • RecommendedAction (40-40)
  • RecommendedAction (90-92)
  • RecommendedAction (94-96)
  • RecommendedAction (103-105)
  • RecommendedAction_NONE (43-43)
  • Entity (159-165)
  • Entity (178-178)
  • Entity (193-195)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (3)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
  • XIDHandler (29-38)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
  • NewXIDHandler (37-59)
health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (2)
  • Response (29-33)
  • XIDDetails (36-48)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (2)
  • DeletePod (453-475)
  • NVSentinelNamespace (64-64)
tests/syslog_health_monitor_test.go (7)
tests/helpers/metadata.go (3)
  • DeleteMetadata (183-229)
  • CreateTestMetadata (59-108)
  • InjectMetadata (110-181)
tests/helpers/kube.go (9)
  • NVSentinelNamespace (64-64)
  • DeletePod (453-475)
  • RemoveNodeManagedByNVSentinelLabel (1395-1409)
  • GetPodOnWorkerNode (1412-1442)
  • PortForwardPod (1598-1642)
  • SetNodeManagedByNVSentinel (1373-1392)
  • InjectSyslogMessages (1804-1827)
  • VerifyNodeConditionMatchesSequence (1830-1867)
  • CheckNodeConditionExists (1267-1284)
platform-connectors/pkg/transformers/metadata/transformer.go (1)
  • New (42-63)
platform-connectors/pkg/pipeline/pipeline.go (1)
  • New (35-37)
platform-connectors/pkg/pipeline/config.go (1)
  • Config (17-21)
platform-connectors/pkg/transformers/overrides/config.go (1)
  • Config (25-28)
commons/pkg/auditlogger/auditlogger.go (1)
  • Log (114-134)
🪛 GitHub Check: modules-lint-test (tests)
tests/helpers/kube.go

[failure] 471-471:
missing whitespace above this line (too many lines above return)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (34)
  • GitHub Check: ko-build-test (event-exporter, .)
  • GitHub Check: ko-build-test (platform-connectors, .)
  • GitHub Check: ko-build-test (labeler, .)
  • GitHub Check: ko-build-test (janitor, .)
  • GitHub Check: ko-build-test (fault-quarantine, .)
  • GitHub Check: ko-build-test (health-monitors/kubernetes-object-monitor, .)
  • GitHub Check: ko-build-test (fault-remediation, .)
  • GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/maintenance-notifier)
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm3, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/csp-health-monitor)
  • GitHub Check: container-build-test (syslog-health-monitor, make -C health-monitors/syslog-health-monitor docker...
  • GitHub Check: ko-build-test (node-drainer, .)
  • GitHub Check: ko-build-test (health-events-analyzer, .)
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm4, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: container-build-test (log-collector, make -C log-collector docker-build-log-collector)
  • GitHub Check: container-build-test (file-server-cleanup, make -C log-collector docker-build-file-server-cleanup)
  • GitHub Check: modules-lint-test (event-exporter)
  • GitHub Check: modules-lint-test (metadata-collector)
  • GitHub Check: modules-lint-test (fault-quarantine)
  • GitHub Check: modules-lint-test (platform-connectors)
  • GitHub Check: modules-lint-test (health-events-analyzer)
  • GitHub Check: modules-lint-test (commons)
  • GitHub Check: modules-lint-test (janitor)
  • GitHub Check: modules-lint-test (store-client)
  • GitHub Check: modules-lint-test (fault-remediation)
  • GitHub Check: modules-lint-test (labeler)
  • GitHub Check: health-monitors-lint-test (syslog-health-monitor)
  • GitHub Check: health-monitors-lint-test (kubernetes-object-monitor)
  • GitHub Check: health-monitors-lint-test (csp-health-monitor)
  • GitHub Check: CodeQL PR Analysis
  • GitHub Check: E2E Tests (AMD64 + MongoDB)
  • GitHub Check: E2E Tests (ARM64 + MongoDB)
  • GitHub Check: E2E Tests (ARM64 + PostgreSQL)
  • GitHub Check: E2E Tests (AMD64 + PostgreSQL)
🔇 Additional comments (20)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

24-27: LGTM!

The new gpuResetMap regex correctly matches the GPU reset line format with GPU UUID capture group. The alignment adjustment to reNvrmMap improves readability.

tests/helpers/metadata.go (1)

67-105: LGTM! Test data aligns with PCI normalization.

The trimmed PCI addresses (removing .0 suffix) correctly match the normalized format used in production code. This ensures metadata lookups work correctly in tests.

health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)

176-221: LGTM! Test follows established patterns.

The table-driven test structure correctly validates GetInfoByUUID for success and error scenarios. The test is consistent with existing test patterns in this file.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (7)

33-35: LGTM!

Good practice to define the healthy event message as a package-level constant for consistency.


79-82: LGTM!

The GPU reset detection is correctly placed before XID parsing to short-circuit processing. The logging provides good observability.


101-108: LGTM!

The parsing logic correctly extracts the GPU UUID using the regex capture group, with appropriate bounds checking.


133-148: LGTM!

The signature change correctly tracks UUID provenance, enabling the remediation override logic. The fallback from metadata to dmesg is handled gracefully with appropriate error logging.


191-252: LGTM!

The remediation override logic correctly implements the requirement that COMPONENT_RESET is only permitted when the GPU UUID is sourced from metadata. The logging provides good observability for the override.


254-286: LGTM!

The healthy event generation correctly requires both UUID and PCI from metadata, failing fast if either is unavailable. This ensures entity matching consistency as documented in the implementation comments.


333-348: LGTM!

Good refactoring to centralize impacted entity construction, ensuring consistent ordering and reducing duplication.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (7)

19-20: LGTM!

The additional imports support filesystem-backed test metadata scenarios.


30-86: LGTM!

The test metadata fixtures correctly use normalized PCI format and cover both valid and missing-PCI scenarios for comprehensive testing.


120-151: LGTM!

The test provides good coverage of the GPU reset line parsing logic with valid, invalid, and non-matching cases. Follows the table-driven test pattern as per coding guidelines.


223-227: LGTM!

The test setup correctly creates temporary metadata files for filesystem-backed test scenarios.


296-363: LGTM! Comprehensive GPU reset test coverage.

The test cases thoroughly validate GPU reset event generation including:

  • Successful healthy event creation with PCI+UUID from metadata
  • Error handling when metadata is unavailable
  • Error handling when GPU UUID is missing from metadata
  • Error handling when PCI address is missing from metadata

All scenarios are correctly validated.


365-436: LGTM! Override logic thoroughly tested.

The test cases correctly validate the remediation override behavior:

  • RESET_GPU → RESTART_VM when UUID is sourced from dmesg
  • RESET_GPU → COMPONENT_RESET (unchanged) when UUID is sourced from metadata

This ensures the PR requirement is properly implemented and tested.


569-580: LGTM!

The updated PCI value expectation correctly reflects the normalized format without the .0 suffix, consistent with the normalization logic.

tests/gpu_health_monitor_test.go (1)

302-305: DeletePod signature update looks correct.
Passing t aligns with the helper’s new behavior (eventual absence check).

tests/helpers/kube.go (1)

696-742: DrainRunningPodsInNamespace callsite update is correct.
Passing t preserves the helper’s improved delete verification behavior.

tests/syslog_health_monitor_test.go (1)

307-329: Nice coverage of the “XID → reset → healthy” flow.
The end assertion matches the PR objective that a reset completion should emit a healthy event that clears the condition.

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/helpers/kube.go (1)

453-475: DeletePod has conflicting semantics: it returns error but uses require.Eventually which hard-fails the test.

When the condition in require.Eventually is not met, it calls FailNow() on the *testing.T instance, terminating the test immediately rather than allowing the function to return an error. This conflicts with the function's error return type—callers expecting to handle an error will never receive one because the test fails first.

Additionally, if the context is already canceled before entering the assertion block, the Get calls will continue failing until EventuallyWaitTimeout expires (a long wait with no hope of success).

The proposed fix correctly addresses this by:

  • Replacing require.Eventually with explicit timer/ticker logic
  • Checking ctx.Err() upfront to return immediately if the context is already done
  • Returning errors to callers instead of hard-failing tests
  • Handling the case where a pod doesn't exist after deletion (idempotent)
Proposed fix
 func DeletePod(ctx context.Context, t *testing.T, c klient.Client, namespace, podName string) error {
+	t.Helper()
+
 	pod := &v1.Pod{
 		ObjectMeta: metav1.ObjectMeta{
 			Name:      podName,
 			Namespace: namespace,
 		},
 	}
 
 	err := c.Resources().Delete(ctx, pod)
 	if err != nil {
+		if apierrors.IsNotFound(err) {
+			return nil
+		}
 		return fmt.Errorf("failed to delete pod %s: %w", podName, err)
 	}
 
-	require.Eventually(t, func() bool {
-		err = c.Resources().Get(ctx, podName, namespace, pod)
-		if err != nil {
-			return apierrors.IsNotFound(err)
-		}
-		return false
-	}, EventuallyWaitTimeout, WaitInterval, "pod %s should be removed from API", podName)
+	deadline := time.NewTimer(EventuallyWaitTimeout)
+	ticker := time.NewTicker(WaitInterval)
+	defer deadline.Stop()
+	defer ticker.Stop()
+
+	for {
+		if ctx.Err() != nil {
+			return fmt.Errorf("context done while waiting for pod %s to be removed: %w", podName, ctx.Err())
+		}
+
+		err = c.Resources().Get(ctx, podName, namespace, pod)
+		if err != nil && apierrors.IsNotFound(err) {
+			return nil
+		}
+
+		select {
+		case <-deadline.C:
+			return fmt.Errorf("pod %s was not removed from API within %s", podName, EventuallyWaitTimeout)
+		case <-ticker.C:
+		}
+	}
 
-	return nil
 }
🧹 Nitpick comments (1)
tests/syslog_health_monitor_test.go (1)

353-372: Teardown duplication: consider extracting a shared syslog-monitor cleanup helper.

Multiple tests repeat “stop port-forward → DeleteMetadata → restart syslog-health-monitor → remove label”. A small helper (e.g., helpers.TeardownSyslogHealthMonitorTest(...)) would reduce drift between test suites.

Also applies to: 592-600

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0cb6782 and b0e0bcf.

📒 Files selected for processing (11)
  • docs/configuration/syslog-health-monitor.md
  • docs/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/helpers/kube.go
  • tests/node_drainer_test.go
  • tests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/gpu_health_monitor_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • tests/helpers/kube.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/node_drainer_test.go
  • tests/syslog_health_monitor_test.go
**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/gpu_health_monitor_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/node_drainer_test.go
  • tests/syslog_health_monitor_test.go
🧠 Learnings (8)
📓 Common learnings
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/gpu_health_monitor_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/node_drainer_test.go
  • tests/syslog_health_monitor_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/gpu_health_monitor_test.go
  • tests/helpers/kube.go
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • docs/configuration/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • docs/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.

Applied to files:

  • docs/configuration/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • docs/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.

Applied to files:

  • docs/configuration/syslog-health-monitor.md
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:41:34.094Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:34.094Z
Learning: The XID74Reg2Bit13Set rule in tests/data/health-events-analyzer-config.yaml intentionally omits the time window filter because it only validates the register bit pattern (bit 13 in REG2) on the received XID 74 event itself, without needing to check historical events or count repeated occurrences.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
🧬 Code graph analysis (4)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (2)
  • DeletePod (453-475)
  • NVSentinelNamespace (64-64)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
  • NewReader (46-50)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
  • XIDHandler (29-38)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
  • NewXIDHandler (37-59)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (3)
  • HealthEvent (54-128)
  • Entity (46-52)
  • RecommendedAction (14-24)
health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (2)
  • Response (29-33)
  • XIDDetails (36-48)
tests/node_drainer_test.go (1)
tests/helpers/kube.go (1)
  • DeletePod (453-475)
🔇 Additional comments (14)
docs/configuration/syslog-health-monitor.md (1)

5-5: Documentation updates properly reflect GPU reset monitoring.

GPU reset support is consistently integrated throughout the configuration documentation:

  • Line 5 correctly lists GPU reset events alongside other monitored events
  • Line 45 accurately reflects that GPU reset monitoring is bundled under the existing SysLogsXIDError check (no new check type)
  • Line 58 appropriately notes that SysLogsXIDError now monitors GPU reset messages in addition to XID errors

The updates maintain the right level of detail for a configuration reference without over-specifying implementation details.

Also applies to: 45-45, 58-58

docs/syslog-health-monitor.md (1)

5-5: Comprehensive and accurate GPU reset documentation.

GPU reset support is well-integrated throughout the main documentation:

  • Line 5 appropriately separates GPU resets from failure events as "other GPU-related events"
  • Lines 11-16 clearly position GPU Reset alongside other notable occurrences
  • Line 25 documents GPU reset log pattern detection as part of the monitoring pipeline
  • Line 29 correctly adds gRPC mention for health event emission
  • Line 42 clarifies that GPU reset monitoring is part of the SysLogsXIDError check
  • Lines 81-82 provide a clear subsection explaining that GPU resets indicate completed remediation

The documentation accurately reflects that the system detects GPU resets in logs and emits corresponding healthy events, which aligns with the feature's remediation-completion semantics described in the PR objectives.

Also applies to: 11-16, 25-25, 29-29, 42-42, 81-82

tests/node_drainer_test.go (1)

96-99: DeletePod call-site update looks correct.

tests/gpu_health_monitor_test.go (2)

65-69: DeletePod call-site update looks correct (setup restart path).


302-305: DeletePod call-site update looks correct (restart to trigger reconnection).

tests/helpers/kube.go (1)

696-742: DrainRunningPodsInNamespace update is consistent with the new DeletePod signature.

tests/syslog_health_monitor_test.go (2)

204-234: Teardown sequencing looks solid; consider idempotency once DeletePod is fixed.

This cleanup (DeleteMetadata → restart pod → wait for healthy condition → remove label) is the right shape; once helpers.DeletePod becomes idempotent / ctx-aware, it should also reduce teardown flakes.


239-329: No action required—this test is not brittle. The UUID GPU-22222222-2222-2222-2222-222222222222 is deterministically assigned to PCI address 0002:00:00 by CreateTestMetadata, and the GPU reset detector uses a flexible regex pattern (GPU reset executed: (GPU-[0-9a-fA-F-]+)) that accepts any valid UUID format. The hard-coded values are intentional and correct.

Likely an incorrect or invalid review comment.

health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)

176-221: LGTM! Well-structured test for UUID lookup.

The test follows best practices with table-driven design and covers the essential scenarios: exact matches for different GPUs and error handling for non-existent UUIDs. The implementation is consistent with existing test patterns in the file.

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

24-27: LGTM! GPU reset pattern detection added.

The new gpuResetMap regex correctly matches GPU reset log lines and extracts the GPU UUID. The pattern aligns with the PR's goal of detecting GPU reset events.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)

30-86: LGTM! Test metadata constants support comprehensive scenarios.

The test metadata constants provide realistic data for testing GPU reset functionality with different configurations (complete metadata vs. missing PCI). The simplified UUID "GPU-123" makes test assertions clearer while still exercising the full code path.


120-151: LGTM! Thorough test coverage for GPU reset line parsing.

The test validates both successful GPU UUID extraction and proper handling of invalid/mismatched patterns. The inclusion of an NVRM line as a negative test case is particularly good for ensuring pattern specificity.


222-363: LGTM! Comprehensive GPU reset test coverage.

The test setup properly uses filesystem-backed metadata files for realistic testing. The GPU reset scenarios thoroughly validate:

  • Successful healthy event creation with normalized PCI and GPU UUID
  • Error handling when metadata is unavailable, UUID not found, or PCI missing

This aligns perfectly with the PR objective that GPU reset events require metadata-collector lookups for PCI resolution.


365-436: LGTM! Remediation action override logic correctly tested.

These test cases validate the critical requirement from the PR objectives:

  • COMPONENT_RESET is only used when GPU UUID is sourced from metadata-collector
  • When UUID is from NVRM/dmesg, the action is overridden to RESTART_VM

The test setup properly distinguishes between the two UUID sources (pciToGPUUUID map vs. metadataFile), and the assertions confirm the correct remediation actions are applied in each scenario.

@github-actions
Copy link

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)

222-507: Make TestProcessLine hermetic: avoid /tmp/metadata.json (can exist on runners and break expectations).

Several cases intentionally model “metadata collector not initialized / missing”, but using a fixed /tmp/metadata.json path makes the test outcome depend on the machine state. Prefer a guaranteed-missing path under t.TempDir() (and use the temp-backed metadataFile when you want it initialized).

Proposed fix (use a temp “missing” path instead of /tmp)
 func TestProcessLine(t *testing.T) {
 	tmpDir := t.TempDir()
 	metadataFile := filepath.Join(tmpDir, "gpu_metadata.json")
 	require.NoError(t, os.WriteFile(metadataFile, []byte(testMetadataJSON), 0600))
 	metadataFileMissingPCI := filepath.Join(tmpDir, "gpu_metadata_missing_pci.json")
 	require.NoError(t, os.WriteFile(metadataFileMissingPCI, []byte(testMetadataMissingPCIJSON), 0600))
+	metadataFileMissing := filepath.Join(tmpDir, "gpu_metadata_missing.json")
 
 	testCases := []struct {
 		name          string
 		message       string
 		setupHandler  func() *XIDHandler
 		expectEvent   bool
 		expectError   bool
 		validateEvent func(t *testing.T, events *pb.HealthEvents)
 	}{
 		{
 			name:    "NVRM GPU Map Line",
 			message: "NVRM: GPU at PCI:0000:00:08.0: GPU-12345678-1234-1234-1234-123456789012",
 			setupHandler: func() *XIDHandler {
-				h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", "/tmp/metadata.json")
+				h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", metadataFileMissing)
 				h.parser = &mockParser{
 					parseFunc: func(msg string) (*parser.Response, error) {
 						return nil, nil
 					},
 				}
 				return h
 			},
 			expectEvent: false,
 			expectError: false,
 		},
 		{
 			name:    "Valid XID Message",
 			message: "NVRM: Xid (PCI:0000:00:08.0): 79, pid=12345, name=test-process",
 			setupHandler: func() *XIDHandler {
-				h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", "/tmp/metadata.json")
+				h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", metadataFileMissing)
 				h.parser = &mockParser{
 					parseFunc: func(msg string) (*parser.Response, error) {
 						return &parser.Response{
 							Success: true,
 							Result: parser.XIDDetails{
 								DecodedXIDStr: "Xid 79",
 								PCIE:          "0000:00:08.0",
 								Mnemonic:      "GPU has fallen off the bus",
 								Resolution:    "CONTACT_SUPPORT",
 								Number:        79,
 							},
 						}, nil
 					},
 				}
 				return h
 			},
 			expectEvent: true,
 			expectError: false,
 			...
 		},
 		{
 			name:    "Valid GPU Reset Message with Metadata Collector not Initialized",
 			message: "GPU reset executed: GPU-123",
 			setupHandler: func() *XIDHandler {
-				h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", "/tmp/metadata.json")
+				h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", metadataFileMissing)
 				return h
 			},
 			expectEvent: false,
 			expectError: true,
 		},
 		{
 			name:    "Valid XID with GPU UUID from NVRM: RESET_GPU overridden to RESTART_VM",
 			message: "NVRM: Xid (PCI:0000:00:08.0): 79, pid=12345, name=test-process",
 			setupHandler: func() *XIDHandler {
-				h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", "/tmp/metadata.json")
+				h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", metadataFileMissing)
 				h.parser = &mockParser{ ... }
 				h.pciToGPUUUID["0000:00:08"] = "GPU-12345678-1234-1234-1234-123456789012"
 				return h
 			},
 			...
 		},
 		{
 			name:    "Parser Returns Error",
 			message: "Some random message",
 			setupHandler: func() *XIDHandler {
-				h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", "/tmp/metadata.json")
+				h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", metadataFileMissing)
 				h.parser = &mockParser{
 					parseFunc: func(msg string) (*parser.Response, error) {
 						return nil, errors.New("parse error")
 					},
 				}
 				return h
 			},
 			expectEvent: false,
 			expectError: false,
 		},
 		...
 	}
🤖 Fix all issues with AI agents
In @docs/syslog-health-monitor.md:
- Around line 5-26: Update the “Enabled Checks” option list text under the
“Configuration Options”/“Enabled Checks” section to include “GPU reset”
alongside “XID, SXID, fallen-off-bus”, and ensure all related mentions in the
overview/config/check comments are updated to reflect the new set; also remove
any trailing whitespace in the affected paragraphs (the blocks that currently
list enabled checks and the surrounding overview text).
🧹 Nitpick comments (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)

176-221: Strengthen TestGetInfoByUUID assertions (currently only checks GPUID).

Consider also asserting the returned gpu.UUID (and optionally gpu.PCIAddress if present on model.GPUInfo) so the test guards against accidental UUID→wrong-GPU mapping regressions.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b0e0bcf and 262c7b9.

📒 Files selected for processing (11)
  • docs/configuration/syslog-health-monitor.md
  • docs/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/helpers/kube.go
  • tests/node_drainer_test.go
  • tests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • docs/configuration/syslog-health-monitor.md
  • health-monitors/syslog-health-monitor/pkg/metadata/reader.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/helpers/kube.go
  • tests/gpu_health_monitor_test.go
  • tests/node_drainer_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • tests/syslog_health_monitor_test.go
**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/node_drainer_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • tests/syslog_health_monitor_test.go
🧠 Learnings (7)
📓 Common learnings
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • docs/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:41:34.094Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:34.094Z
Learning: The XID74Reg2Bit13Set rule in tests/data/health-events-analyzer-config.yaml intentionally omits the time window filter because it only validates the register bit pattern (bit 13 in REG2) on the received XID 74 event itself, without needing to check historical events or count repeated occurrences.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/gpu_health_monitor_test.go
  • tests/node_drainer_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
  • tests/syslog_health_monitor_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • docs/syslog-health-monitor.md
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/helpers/kube.go
  • tests/gpu_health_monitor_test.go
🧬 Code graph analysis (4)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (2)
  • DeletePod (453-476)
  • NVSentinelNamespace (64-64)
tests/node_drainer_test.go (1)
tests/helpers/kube.go (1)
  • DeletePod (453-476)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
  • NewReader (46-50)
tests/syslog_health_monitor_test.go (7)
tests/helpers/metadata.go (3)
  • DeleteMetadata (183-229)
  • CreateTestMetadata (59-108)
  • InjectMetadata (110-181)
tests/helpers/kube.go (10)
  • NVSentinelNamespace (64-64)
  • DeletePod (453-476)
  • RemoveNodeManagedByNVSentinelLabel (1396-1410)
  • GetPodOnWorkerNode (1413-1443)
  • PortForwardPod (1599-1643)
  • SetNodeManagedByNVSentinel (1374-1393)
  • VerifyNodeConditionMatchesSequence (1831-1868)
  • EventuallyWaitTimeout (61-61)
  • WaitInterval (63-63)
  • CheckNodeConditionExists (1268-1285)
platform-connectors/pkg/transformers/metadata/transformer.go (1)
  • New (42-63)
platform-connectors/pkg/pipeline/pipeline.go (1)
  • New (35-37)
platform-connectors/pkg/pipeline/config.go (1)
  • Config (17-21)
platform-connectors/pkg/transformers/overrides/config.go (1)
  • Config (25-28)
commons/pkg/auditlogger/auditlogger.go (1)
  • Log (114-134)
🔇 Additional comments (7)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)

120-151: TestParseGPUResetLine coverage looks good.

Covers valid/invalid reset lines and ensures it doesn’t false-positive on NVRM XID lines.

tests/node_drainer_test.go (1)

96-99: DeletePod call-site update matches the helper’s new signature.

tests/gpu_health_monitor_test.go (1)

62-69: DeletePod call-sites updated correctly to pass t.

Also applies to: 302-305

tests/helpers/kube.go (1)

453-476: LGTM! Post-deletion verification enhances test reliability.

The addition of the *testing.T parameter and post-deletion verification using require.Eventually is a solid improvement. This ensures that the pod is actually removed from the API server before returning, preventing race conditions in tests that depend on pod absence.

The signature change is consistently applied across all call sites (as noted in the AI summary), maintaining API consistency.

tests/syslog_health_monitor_test.go (3)

204-234: LGTM! Enhanced teardown improves test hygiene.

The addition of metadata cleanup before pod deletion ensures proper test isolation and prevents metadata pollution between tests. The updated DeletePod call with the t parameter aligns with the improved helper signature, and the additional logging improves observability during teardown.


239-374: LGTM! Comprehensive GPU reset test follows established patterns.

The new TestSyslogHealthMonitorGPUReset test provides end-to-end coverage for GPU reset flows:

  • First assess verifies XID error detection with COMPONENT_RESET action
  • Second assess verifies that GPU reset events clear the error condition
  • Teardown correctly omits the condition-clearing wait (line 361) because the last assess already verified it

The test structure follows existing conventions and includes appropriate logging for debugging.


592-599: LGTM! Consistent teardown enhancements.

The metadata cleanup and updated DeletePod call maintain consistency with other test teardowns in this file, ensuring proper cleanup of test resources.

Comment on lines +5 to 26
The Syslog Health Monitor watches system logs for GPU-related errors that may not be caught by DCGM. It monitors journald/syslog for XID errors, SXID errors (NVSwitch/NVLink errors), and GPU fallen-off-bus events - critical failures that indicate serious GPU, NVSwitch, or driver problems. In addition to failures, it monitors system logs for other GPU-related events such as GPU resets to indicate that a required remediation action has completed.

Think of it as a log analyzer that reads between the lines - catching GPU and NVSwitch problems recorded in system logs that other monitoring might miss.

### Why Do You Need This?

Some GPU and NVSwitch failures manifest in system logs before DCGM can detect them:
Some GPU and NVSwitch failures or events manifest in system logs before DCGM can detect them:

- **XID errors**: GPU hardware errors logged by the NVIDIA driver
- **SXID errors**: NVSwitch errors related to NVSwitch and NVLink interconnects
- **GPU fallen off the bus**: GPU became inaccessible to the system
- **GPU Reset**: A GPU reset was executed by nvidia-smi

These errors often appear in system logs first and can indicate imminent GPU or fabric failure, making early detection critical for preventing workload disruptions.
These errors or events often appear in system logs first and can indicate imminent GPU or fabric failure, making early detection critical for preventing workload disruptions or returning GPUs to service.

## How It Works

The Syslog Health Monitor runs as a DaemonSet on GPU nodes:

1. Reads journald logs from the host system
2. Parses log entries for GPU-related error patterns (XID, SXID, fallen-off-bus)
2. Parses log entries for GPU-related error patterns (XID, SXID, fallen-off-bus, GPU reset)
3. Maintains cursor position to avoid re-processing old logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docs: update “Configuration Options” to include GPU reset (and consider trimming trailing whitespace).

You added GPU reset to overview/config/check comments, but the “Enabled Checks” option list still says it selects only “(XID, SXID, fallen-off-bus)”. That’s now incomplete given GPU reset monitoring.

Also applies to: 42-44, 58-60, 79-83

🤖 Prompt for AI Agents
In @docs/syslog-health-monitor.md around lines 5 - 26, Update the “Enabled
Checks” option list text under the “Configuration Options”/“Enabled Checks”
section to include “GPU reset” alongside “XID, SXID, fallen-off-bus”, and ensure
all related mentions in the overview/config/check comments are updated to
reflect the new set; also remove any trailing whitespace in the affected
paragraphs (the blocks that currently list enabled checks and the surrounding
overview text).

@github-actions
Copy link

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant