-
Notifications
You must be signed in to change notification settings - Fork 33
feat: add GPU reset support to syslog-health-monitor #669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughDocumentation updated to include GPU reset events. Metadata Reader now supports GPU UUID lookup. XID handler detects "GPU reset executed" lines, maps GPU UUID → PCI via metadata, and emits healthy HealthEvents. Tests expanded (metadata-backed, GPU reset scenarios) and test helper Changes
Sequence Diagram(s)sequenceDiagram
participant Parser as Syslog Parser
participant XID as XID Handler
participant Meta as Metadata Reader
participant Event as Health Event Creator
Parser->>XID: ProcessLine(message)
alt GPU reset line ("GPU reset executed")
XID->>XID: parseGPUResetLine(message) -> uuid
XID->>Meta: GetInfoByUUID(uuid)
Meta-->>XID: GPUInfo (includes PCI)
XID->>Event: createHealthEventGPUResetEvent(uuid)
Event->>Event: Compose impacted entities (PCI + UUID), set action NONE (healthy)
Event-->>Parser: HealthEvent (healthy)
else XID/SXID line
XID->>XID: Standard XID parsing -> getGPUUUID(normPCI)
alt uuid from metadata == false and recommended action COMPONENT_RESET
XID->>XID: override action -> RESTART_VM
end
XID->>Event: createHealthEventFromResponse(...)
Event-->>Parser: HealthEvent (unhealthy or healthy)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @health-monitors/syslog-health-monitor/pkg/metadata/reader.go:
- Around line 108-119: Add a godoc comment above the exported
Reader.GetInfoByUUID function that briefly describes what the method does, its
input (uuid string), and its return values (*model.GPUInfo, error), including
behavior when the metadata must be loaded (calls ensureLoaded) and when a UUID
is not found; reference the Reader type and the model.GPUInfo return type in the
comment and ensure the comment starts with "GetInfoByUUID" to satisfy Go doc
conventions.
- Around line 108-119: GetInfoByUUID does exact string matching on r.uuidToInfo
without normalizing case; ensure lookups are case-insensitive by normalizing
UUIDs to lowercase: when populating r.uuidToInfo in ensureLoaded (or wherever
UUIDs are parsed) store keys as strings.ToLower(uuid) and in GetInfoByUUID call
uuid = strings.ToLower(uuid) before the map lookup (similar to normalizePCI for
PCI addresses) so stored and lookup keys are consistently lowercase.
In @health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go:
- Around line 254-286: In createHealthEventGPUResetEvent the PCI-address-empty
branch incorrectly wraps the earlier variable err (which is nil) causing
messages like "<nil>"; change that branch to return a new error that does not
reference err (e.g., fmt.Errorf("failed to look up PCI info using UUID %s: no
PCI address found", uuid)) so the error message is meaningful and not wrapping a
nil error; update the return in the if len(gpuInfo.PCIAddress) == 0 block to
construct and return this new error instead of using err.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
docs/configuration/syslog-health-monitor.mddocs/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/metadata/reader.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
🧠 Learnings (7)
📓 Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader.godocs/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.godocs/configuration/syslog-health-monitor.md
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:41:34.094Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:34.094Z
Learning: The XID74Reg2Bit13Set rule in tests/data/health-events-analyzer-config.yaml intentionally omits the time window filter because it only validates the register bit pattern (bit 13 in REG2) on the received XID 74 event itself, without needing to check historical events or count repeated occurrences.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.godocs/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.godocs/configuration/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.
Applied to files:
docs/configuration/syslog-health-monitor.md
🧬 Code graph analysis (2)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
NewReader(46-50)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
data-models/pkg/model/gpu_metadata.go (1)
GPUInfo(26-33)
🔇 Additional comments (17)
docs/configuration/syslog-health-monitor.md (1)
5-5: LGTM! Documentation accurately reflects GPU reset monitoring.The documentation updates appropriately describe the new GPU reset event monitoring capability alongside existing XID/SXID error detection.
Also applies to: 58-58
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
24-27: LGTM! GPU reset regex pattern is correct.The
gpuResetMapregex correctly captures GPU UUIDs from reset log lines in the expected format.docs/syslog-health-monitor.md (1)
5-5: LGTM! Comprehensive documentation updates for GPU reset support.The documentation thoroughly describes the new GPU reset monitoring capability, including its role in tracking remediation completion. The updates are well-integrated throughout the document and provide clear context for users.
Also applies to: 11-11, 16-16, 18-18, 25-25, 29-29, 42-42, 79-82
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
176-221: LGTM! Comprehensive test coverage for UUID lookup.The table-driven test appropriately validates the new
GetInfoByUUIDmethod with both successful lookups (GPU 0 and GPU 1) and error handling (unknown UUID). The test structure follows Go best practices and is consistent with existing tests in the file.health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
37-37: LGTM! UUID lookup implementation is well-structured.The UUID-based lookup functionality is correctly implemented:
- The
uuidToInfomap is properly initialized and populated inbuildMaps- Thread safety is maintained through the existing
sync.Oncepattern- The implementation mirrors the established
GetGPUByPCIpattern for consistencyAlso applies to: 84-84, 91-91
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (5)
30-86: Well-structured test fixtures for metadata scenarios.The JSON constants provide clear test data for both complete metadata and missing PCI scenarios, enabling thorough coverage of metadata-driven paths.
120-151: Good coverage for GPU reset line parsing.The table-driven test follows Go conventions and covers valid reset lines, invalid/empty cases, and ensures XID log lines don't match the GPU reset pattern.
222-227: Clean test setup using temporary directories.Using
t.TempDir()ensures proper cleanup and isolation between test runs. The metadata file setup enables filesystem-backed tests without polluting shared state.
295-363: Comprehensive GPU reset test coverage.The test cases appropriately cover:
- Valid GPU reset with metadata lookup
- Metadata collector not initialized (error case)
- Missing GPU UUID in metadata (error case)
- Missing PCI in metadata (error case)
The
validateEventfunction correctly verifiesGeneratedTimestampis set before nullifying it for equality comparison.
364-436: RESET_GPU → RESTART_VM override logic well tested.Good distinction between:
- Line 365-400: UUID from NVRM (dmesg) triggers override to
RESTART_VM- Line 401-436: UUID from metadata-collector preserves
COMPONENT_RESETThis aligns with the PR objectives for remediation action behavior.
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (7)
33-35: LGTM!The constant provides a single source of truth for the healthy event message, improving maintainability.
79-82: LGTM!Clean integration of GPU reset handling into the processing flow. The early return pattern matches the existing NVRM GPU map line handling above.
101-108: Consider checking for empty string match result.While
len(m) >= 2guards against out-of-bounds access, if the regex capture group matches an empty string, this would return an empty UUID. The test cases suggest this is expected behavior (returning""for invalid lines).
133-148: Clean refactor to return metadata source information.Returning
(uuid, fromMetadata)tuple enables the caller to make informed decisions about remediation actions. The fallback chain (metadata → dmesg cache → empty) is clear.
150-190: Excellent design documentation.This detailed comment block explains the rationale for impacted entities design, remediation action behavior, and the unhealthy/healthy event flow. This aids future maintainers in understanding the constraints and trade-offs.
224-230: Override logic correctly implements the design.The
COMPONENT_RESET → RESTART_VMfallback when GPU UUID isn't from metadata aligns with the documented behavior. The structured log provides good observability for troubleshooting.
332-348: Clean helper function for consistent entity construction.
getDefaultImpactedEntitiesensures PCI is always first and GPU_UUID is conditionally appended, matching the documented behavior and test expectations.
fecc08e to
82d6105
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go:
- Around line 150-190: Fix the minor typo in the large documentation comment in
xid_handler.go: replace "dmegs" with "dmesg" inside the block comment describing
healthy/unhealthy event generation and remediation behavior (the comment
associated with the xid-handler logic in xid_handler.go).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
docs/configuration/syslog-health-monitor.mddocs/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/metadata/reader.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
✅ Files skipped from review due to trivial changes (1)
- docs/configuration/syslog-health-monitor.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/syslog-health-monitor/pkg/metadata/reader.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
🧠 Learnings (6)
📓 Common learnings
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
health-monitors/syslog-health-monitor/pkg/metadata/reader.godocs/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Applied to files:
docs/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
🧬 Code graph analysis (3)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
data-models/pkg/model/gpu_metadata.go (1)
GPUInfo(26-33)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (3)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-38)health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
NewXIDHandler(37-59)health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (1)
XIDDetails(36-48)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
NewReader(46-50)
🔇 Additional comments (17)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
176-221: LGTM! Well-structured table-driven test for UUID lookup.The test follows the established pattern from
TestGetGPUByPCIand covers the key scenarios: exact matches for both GPUs and the not-found case. As per coding guidelines, table-driven tests are used appropriately here.health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
24-27: LGTM! Regex pattern correctly captures GPU UUID from reset lines.The
gpuResetMapregex appropriately captures the GPU UUID format (GPU-<hex-uuid>) from syslog reset messages, complementing the existingreNvrmMapfor PCI-to-UUID mapping.docs/syslog-health-monitor.md (2)
5-18: LGTM! Documentation clearly explains the new GPU reset monitoring capability.The overview and rationale sections are updated to accurately describe the dual purpose: detecting failures and monitoring recovery events like GPU resets.
79-82: LGTM! New GPU Reset section provides clear context.The subsection succinctly explains what a GPU reset event signifies in the context of remediation workflows.
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (2)
37-38: LGTM! UUID-to-GPUInfo map enables efficient reverse lookup.The new
uuidToInfofield mirrors the existingpciToGPUpattern, providing O(1) lookup by UUID.
108-119: LGTM! GetInfoByUUID follows established patterns.The method correctly uses lazy loading via
ensureLoaded()and provides proper error wrapping with context. Based on learnings, this aligns with the metadata Reader's design where the file may not exist at initialization time.health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)
30-86: LGTM! Test fixtures provide good coverage for metadata scenarios.The two JSON constants (
testMetadataJSONandtestMetadataMissingPCIJSON) enable testing both happy path and edge cases where PCI data is absent from metadata.
120-151: LGTM! GPU reset line parsing test covers key scenarios.Good coverage of valid reset lines, malformed lines (missing UUID), and ensuring non-reset lines (like NVRM map lines) return empty strings.
296-363: LGTM! Comprehensive GPU reset event generation tests.The test cases appropriately cover:
- Successful reset with full metadata
- Missing metadata file (expect error)
- UUID not found in metadata (expect error)
- Missing PCI in metadata (expect error)
This validates the behavior described in the PR objectives where healthy events require both PCI and UUID.
364-436: LGTM! Tests validate the remediation action override logic.Two critical scenarios are covered:
- UUID from NVRM/dmesg →
RESET_GPUoverridden toRESTART_VM- UUID from metadata-collector →
RESET_GPUkept asCOMPONENT_RESETThis aligns with the PR design where
COMPONENT_RESETis only permitted when UUID is sourced from metadata to ensure matching between unhealthy and healthy events.health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (7)
33-35: LGTM! Clear constant for healthy event messaging.The constant provides a consistent message for GPU reset healthy events.
79-82: LGTM! GPU reset detection integrated correctly in processing pipeline.The reset line parsing is checked before standard XID parsing, which is the correct order since reset lines have a distinct format and should generate healthy (not unhealthy) events.
101-108: LGTM! Simple and correct GPU reset line parsing.The implementation correctly uses the
gpuResetMapregex and returns the captured UUID or empty string.
133-148: LGTM! getGPUUUID now indicates UUID provenance.Returning
fromMetadataboolean enables the caller to make informed decisions about remediation actions. The fallback topciToGPUUUID(dmesg-sourced) when metadata lookup fails is correct.
224-230: LGTM! Remediation action override logic is correct.When UUID is not from metadata,
COMPONENT_RESETis downgraded toRESTART_VM. This ensures consistent entity matching between unhealthy and healthy events, as documented in the PR objectives.
254-286: LGTM! Healthy event generation with proper validation.The implementation correctly:
- Requires metadata lookup to succeed (returns error otherwise)
- Validates PCI address is present
- Uses
getDefaultImpactedEntitiesfor consistent entity formatting- Sets
IsHealthy: trueandRecommendedAction: NONE
333-348: LGTM! Helper function for consistent entity construction.
getDefaultImpactedEntitiesensures PCI is always included and GPU_UUID is conditionally added only when available. This maintains consistency between unhealthy and healthy events.
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
82d6105 to
98542bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
82-106: GuarduuidToInfoagainst empty/duplicate UUIDs (avoid silent overwrite).
Todayr.uuidToInfo[gpu.UUID] = gpuwill (a) create a""key if UUID missing and (b) silently overwrite on duplicates. Consider skipping empty UUIDs and warning on duplicates.Proposed fix
func (r *Reader) buildMaps() { r.pciToGPU = make(map[string]*model.GPUInfo) r.uuidToInfo = make(map[string]*model.GPUInfo) r.nvswitchLinks = make(map[string]map[int]*gpuLinkInfo) for i := range r.metadata.GPUs { gpu := &r.metadata.GPUs[i] normPCI := normalizePCI(gpu.PCIAddress) r.pciToGPU[normPCI] = gpu - r.uuidToInfo[gpu.UUID] = gpu + if uuid := strings.TrimSpace(gpu.UUID); uuid != "" { + if _, exists := r.uuidToInfo[uuid]; exists { + slog.Warn("Duplicate GPU UUID in metadata; overwriting", "uuid", uuid) + } + r.uuidToInfo[uuid] = gpu + } for _, link := range gpu.NVLinks { remotePCI := normalizePCI(link.RemotePCIAddress)tests/helpers/kube.go (1)
453-475: Addt.Helper()toDeletePod(improves error attribution).Proposed fix
func DeletePod(ctx context.Context, t *testing.T, c klient.Client, namespace, podName string) error { + t.Helper() pod := &v1.Pod{ ObjectMeta: metav1.ObjectMeta{ Name: podName, Namespace: namespace, }, }tests/syslog_health_monitor_test.go (1)
204-225: Teardown “condition cleared” checks should treat “condition removed” as success (avoid flakes).
Right now you requirecondition != nil && Status=False; if the controller removes the condition entirely, teardown will fail.Proposed fix (apply to each similar Eventually block)
require.Eventually(t, func() bool { condition, err := helpers.CheckNodeConditionExists(ctx, client, nodeName, "SysLogsXIDError", "SysLogsXIDErrorIsHealthy") if err != nil { return false } - return condition != nil && condition.Status == v1.ConditionFalse + // Consider both "removed" and "explicitly healthy" as cleared. + if condition == nil { + return true + } + return condition.Status == v1.ConditionFalse }, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "SysLogsXIDError condition should be cleared")Also applies to: 350-371, 601-621
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
119-125: Fix PCI normalization to guarantee exact matching between unhealthy and healthy events.
Current xid-handler normalization only strips.0and doesn’t lowercase/domain-normalize, while metadata’s normalization is stricter; this can break the “same impacted entities” contract.Proposed fix (make xid-handler normalization match metadata normalization semantics)
func (xidHandler *XIDHandler) normalizePCI(pci string) string { - if idx := strings.Index(pci, "."); idx != -1 { - return pci[:idx] - } - - return pci + pci = strings.TrimSpace(pci) + parts := strings.Split(pci, ":") + if len(parts) != 3 { + // Best-effort fallback: strip function if present and lowercase + if idx := strings.Index(pci, "."); idx != -1 { + pci = pci[:idx] + } + return strings.ToLower(pci) + } + + domain := parts[0] + if len(domain) > 4 { + domain = domain[len(domain)-4:] + } + + busDeviceFunc := parts[2] + if idx := strings.Index(busDeviceFunc, "."); idx != -1 { + busDeviceFunc = busDeviceFunc[:idx] + } + + return fmt.Sprintf("%s:%s:%s", + strings.ToLower(domain), + strings.ToLower(parts[1]), + strings.ToLower(busDeviceFunc)) }func (xidHandler *XIDHandler) createHealthEventFromResponse( xidResp *parser.Response, message string, ) *pb.HealthEvents { - normPCI := xidHandler.normalizePCI(xidResp.Result.PCIE) + normPCI := xidHandler.normalizePCI(xidResp.Result.PCIE) uuid, fromMetadata := xidHandler.getGPUUUID(normPCI) entities := getDefaultImpactedEntities(normPCI, uuid)func (xidHandler *XIDHandler) createHealthEventGPUResetEvent(uuid string) (*pb.HealthEvents, error) { gpuInfo, err := xidHandler.metadataReader.GetInfoByUUID(uuid) // ... if len(gpuInfo.PCIAddress) == 0 { return nil, fmt.Errorf("failed to look up PCI info using UUID %s", uuid) } - entities := getDefaultImpactedEntities(gpuInfo.PCIAddress, uuid) + normPCI := xidHandler.normalizePCI(gpuInfo.PCIAddress) + entities := getDefaultImpactedEntities(normPCI, uuid)Also applies to: 195-199, 254-280, 333-348
🤖 Fix all issues with AI agents
In @health-monitors/syslog-health-monitor/pkg/metadata/reader.go:
- Around line 108-119: In Reader.GetInfoByUUID normalize the incoming uuid by
calling strings.TrimSpace on the uuid parameter, validate that the trimmed uuid
is not empty (return a clear error like "empty UUID" or similar), then use the
trimmed value for the lookup against r.uuidToInfo and in the error messages;
reference the Reader.GetInfoByUUID function and the r.uuidToInfo map when making
these edits so you normalize input before the existence check to avoid false
"not found" results from whitespace/newline.
In @health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go:
- Around line 79-82: The parsed GPU UUID from xidHandler.parseGPUResetLine may
include trailing whitespace/newlines; update the call sites (the block using
uuid := xidHandler.parseGPUResetLine(message) and the analogous block around
lines handling createHealthEventGPUResetEvent) to trim the returned string
(e.g., assign trimmed := strings.TrimSpace(uuid) and use trimmed for the
nil/empty check and when invoking xidHandler.createHealthEventGPUResetEvent) so
trailing whitespace/newlines are removed before further processing.
- Around line 133-148: In getGPUUUID, avoid returning fromMetadata=true when the
metadata lookup yields an empty UUID and reduce noisy error logging: after
calling xidHandler.metadataReader.GetGPUByPCI(normPCI) verify both err==nil and
gpuInfo.UUID != "" before returning (return gpuInfo.UUID, true only when
non-empty); if err!=nil log at a lower level (e.g., Debug or Info) instead of
Error to avoid spamming when metadata is absent; keep the existing fallback to
xidHandler.pciToGPUUUID[normPCI] when metadata gave no UUID. Apply the same
non-empty-UUID check and lowered log level to the analogous code block at the
other location referenced (lines ~223-230) that also calls GetGPUByPCI and
returns fromMetadata.
In @tests/syslog_health_monitor_test.go:
- Around line 312-317: The injected GPU reset syslog message in xidMessages
includes a trailing newline which can break deterministic parsing; update the
test so the string in xidMessages (used by helpers.InjectSyslogMessages) does
not end with "\n" (e.g., "kernel: [16450076.435595] GPU reset executed:
GPU-22222222-2222-2222-2222-222222222222") so the parser receives a stable
message format.
🧹 Nitpick comments (2)
tests/syslog_health_monitor_test.go (1)
41-633: Consider extracting shared syslog test setup/teardown into helpers (reduces duplication).
Port-forward + metadata inject + ManagedByNVSentinel label wiring is repeated across several tests; a small helper returning{nodeName, podName, stopChan}would simplify future additions.health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)
222-507: Good scenario coverage for GPU reset and remediation override (metadata vs dmesg).
One nit: consider asserting the PCI entity value in the override tests too (currently only type is asserted).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
docs/configuration/syslog-health-monitor.mddocs/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/metadata/reader.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/helpers/kube.gotests/helpers/metadata.gotests/node_drainer_test.gotests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/gpu_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/helpers/metadata.gotests/helpers/kube.gotests/syslog_health_monitor_test.gotests/node_drainer_test.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/gpu_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/syslog_health_monitor_test.gotests/node_drainer_test.go
🧠 Learnings (7)
📓 Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/gpu_health_monitor_test.gotests/helpers/kube.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/gpu_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/syslog_health_monitor_test.gotests/node_drainer_test.go
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.godocs/syslog-health-monitor.mddocs/configuration/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.godocs/syslog-health-monitor.mddocs/configuration/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.godocs/configuration/syslog-health-monitor.md
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
🧬 Code graph analysis (6)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (2)
DeletePod(453-475)NVSentinelNamespace(64-64)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-38)health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (1)
Response(29-33)data-models/pkg/protos/health_event.pb.go (13)
HealthEvents(107-113)HealthEvents(126-126)HealthEvents(141-143)HealthEvent(211-230)HealthEvent(243-243)HealthEvent(258-260)RecommendedAction(40-40)RecommendedAction(90-92)RecommendedAction(94-96)RecommendedAction(103-105)Entity(159-165)Entity(178-178)Entity(193-195)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
data-models/pkg/model/gpu_metadata.go (1)
GPUInfo(26-33)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-38)health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
NewXIDHandler(37-59)data-models/pkg/protos/health_event.pb.go (13)
HealthEvents(107-113)HealthEvents(126-126)HealthEvents(141-143)HealthEvent(211-230)HealthEvent(243-243)HealthEvent(258-260)Entity(159-165)Entity(178-178)Entity(193-195)RecommendedAction(40-40)RecommendedAction(90-92)RecommendedAction(94-96)RecommendedAction(103-105)health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (2)
Response(29-33)XIDDetails(36-48)
tests/syslog_health_monitor_test.go (2)
tests/helpers/metadata.go (3)
DeleteMetadata(183-229)CreateTestMetadata(59-108)InjectMetadata(110-181)tests/helpers/kube.go (8)
NVSentinelNamespace(64-64)DeletePod(453-475)GetPodOnWorkerNode(1412-1442)PortForwardPod(1598-1642)SetNodeManagedByNVSentinel(1373-1392)EventuallyWaitTimeout(61-61)WaitInterval(63-63)CheckNodeConditionExists(1267-1284)
tests/node_drainer_test.go (1)
tests/helpers/kube.go (1)
DeletePod(453-475)
🔇 Additional comments (9)
docs/configuration/syslog-health-monitor.md (1)
5-5: LGTM! Documentation accurately reflects GPU reset support.The documentation updates correctly describe the new GPU reset monitoring capability:
- Overview includes GPU reset events
- SysLogsXIDError check description clarified to include GPU reset messages
- New subsection explains GPU reset as a remediation completion signal
The changes align well with the PR objectives and maintain consistency with the existing documentation structure.
Also applies to: 58-58, 79-82
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
25-26: LGTM! GPU reset regex pattern is correct.The new
gpuResetMapregex correctly captures GPU reset events with the expected formatGPU reset executed: GPU-<UUID>. The alignment adjustment on line 25 improves code readability.docs/syslog-health-monitor.md (1)
5-5: LGTM! Comprehensive documentation for GPU reset feature.The documentation updates effectively communicate the GPU reset monitoring capability:
- Clarifies that GPU resets are events indicating remediation completion, not just failures
- Integrates GPU reset into the monitoring workflow description
- Adds dedicated subsection explaining GPU reset semantics
- Updates configuration examples to reflect GPU reset support
The changes are well-structured and maintain consistency across the documentation.
Also applies to: 11-18, 25-29, 42-42, 79-82
tests/helpers/metadata.go (1)
71-71: LGTM! PCI address normalization aligns with production code.The test metadata has been updated to use normalized PCI addresses (removing the trailing ".0" function suffix). This change ensures test data consistency with the production code's PCI normalization logic, as described in the PR objectives where normalized PCI is used as the lookup key for GPU metadata.
Also applies to: 82-82, 92-92, 100-100
tests/node_drainer_test.go (1)
97-97: LGTM! Function call correctly updated for new DeletePod signature.The DeletePod invocation has been correctly updated to pass
tas the second parameter, matching the helper's new signature that enables post-deletion verification within the helper function.tests/gpu_health_monitor_test.go (1)
62-68: Update tohelpers.DeletePod(ctx, t, ...)is consistent with new helper signature.Also applies to: 302-305
tests/helpers/kube.go (1)
696-742:DrainRunningPodsInNamespacecorrectly passestthrough toDeletePod.health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)
30-151: GPU reset parsing tests are clear and table-driven; good coverage of valid/invalid lines.health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
61-77: The concurrent map access concern does not apply to this code.The
pciToGPUUUIDmap in XIDHandler is accessed only sequentially. Within eachSyslogMonitor.Run()cycle, all checks (includingProcessLinecalls) execute in a sequential for-loop without goroutines. Across multipleRun()cycles, execution is serialized by the polling ticker and the blocking nature ofRun(), meaning the same XIDHandler instance cannot be accessed concurrently. The default polling interval is 30 minutes, and only a single SyslogMonitor instance exists in the application.Likely an incorrect or invalid review comment.
3b50eee to
0cb6782
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/helpers/kube.go (1)
453-475: Replacerequire.Eventuallyfatal assertion with error-returning timeout logic.
DeletePodusesrequire.Eventuallywhich will fatally abort the test on timeout, yet the function signature returnserrorand callers (likeDrainRunningPodsInNamespace) expect to handle errors viat.Errorf. This breaks error handling: if the pod deletion times out, the test fails immediately before the error is returned, making the caller's error handling unreachable and terminating the entire drain operation. Additionally, if the pod is already deleted, treatIsNotFoundon the delete call as success (idempotency).Convert the wait loop to return errors instead of calling
require.Eventually, and addt.Helper()to preserve caller context in test output. The pod should be checked for existence before starting the wait (implicit via the initial Delete call), and transient Get errors during polling should be logged but not fatal.
🧹 Nitpick comments (3)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
26-26: Optional: Consider a stricter GPU UUID regex pattern.The current pattern
[0-9a-fA-F-]+is permissive and will match malformed UUIDs likeGPU--123. If strict validation is desired, consider a pattern that enforces the standard GPU UUID format:GPU-[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}.However, the current permissive pattern is consistent with
reNvrmMapand may be intentionally lenient for robustness.♻️ More strict regex pattern
- gpuResetMap = regexp.MustCompile(`GPU reset executed: (GPU-[0-9a-fA-F-]+)`) + gpuResetMap = regexp.MustCompile(`GPU reset executed: (GPU-[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12})`)You may also want to update
reNvrmMapfor consistency if you choose this approach.tests/gpu_health_monitor_test.go (1)
65-69: Minor:WaitForPodsDeletedis likely redundant now thathelpers.DeletePodwaits for API absence.
IfDeletePodkeeps the “wait until NotFound” behavior, consider removing the extra wait to reduce test duration/flakiness.tests/syslog_health_monitor_test.go (1)
239-374: GPU reset E2E test: strengthen assertions to match the PR’s key invariants (impacted entities + action gating).
Suggestions:
- Avoid hard-coding
GPU-2222...unless it’s guaranteed byhelpers.CreateTestMetadata(...); derive expected UUID/PCI from the same test metadata fixture to prevent brittle failures.- Assert the “healthy after reset” artifact also includes the same normalized PCI + GPU UUID (not just that the condition becomes healthy), since that’s a core contract of COMPONENT_RESET matching.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
docs/configuration/syslog-health-monitor.mddocs/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/metadata/reader.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/helpers/kube.gotests/helpers/metadata.gotests/node_drainer_test.gotests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
- health-monitors/syslog-health-monitor/pkg/metadata/reader.go
- tests/node_drainer_test.go
- docs/configuration/syslog-health-monitor.md
- tests/helpers/metadata.go
- health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
- docs/syslog-health-monitor.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gotests/helpers/kube.gotests/gpu_health_monitor_test.gotests/syslog_health_monitor_test.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/syslog_health_monitor_test.go
🧠 Learnings (7)
📓 Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/syslog_health_monitor_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/helpers/kube.gotests/gpu_health_monitor_test.go
🧬 Code graph analysis (2)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-38)health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
NewXIDHandler(37-59)data-models/pkg/protos/health_event.pb.go (7)
Entity(159-165)Entity(178-178)Entity(193-195)RecommendedAction(40-40)RecommendedAction(90-92)RecommendedAction(94-96)RecommendedAction(103-105)health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (2)
Response(29-33)XIDDetails(36-48)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (2)
DeletePod(453-475)NVSentinelNamespace(64-64)
🔇 Additional comments (16)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (5)
30-86: LGTM: Well-structured test metadata.The test metadata constants provide comprehensive coverage for testing both normal and edge-case scenarios (missing PCI address). The use of realistic GPU metadata structure with normalized PCI addresses aligns well with the normalizePCI behavior.
120-151: LGTM: Comprehensive GPU reset line parsing tests.The test cases properly cover valid GPU reset lines, invalid formats, and ensure that XID log lines (which contain GPU UUIDs but aren't reset lines) are correctly rejected. The table-driven structure follows coding guidelines.
223-227: LGTM: Proper test file management.Using
t.TempDir()and creating temporary metadata files with secure permissions (0600) is the correct approach for filesystem-backed tests. This ensures test isolation and automatic cleanup.
296-363: LGTM: Thorough GPU reset event generation tests.The test cases effectively validate:
- Successful healthy event generation when metadata is available
- Error handling when metadata collector is not initialized
- Error handling when GPU UUID is not found in metadata
- Error handling when PCI address is missing from metadata
These scenarios ensure robust behavior across different metadata collector states.
365-436: LGTM: Critical remediation override logic is well-tested.These tests validate the core PR behavior:
- When GPU UUID is sourced from NVRM (dmesg), RESET_GPU is correctly overridden to RESTART_VM
- When GPU UUID is sourced from metadata collector, COMPONENT_RESET is preserved
This distinction is crucial for ensuring GPUs can only be reset when the same impacted entities can be resolved after the reset completes, as explained in the PR objectives.
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (8)
33-35: LGTM: Clear constant for healthy event message.Using a named constant improves maintainability and ensures consistency across healthy GPU reset events.
79-82: LGTM: GPU reset detection properly positioned.Checking for GPU reset lines before XID parsing is correct, as reset lines have a different format and should generate healthy events rather than being parsed as XIDs.
101-108: LGTM: Simple and correct GPU reset line parsing.The implementation correctly extracts the GPU UUID from reset lines and returns an empty string when the pattern doesn't match.
133-148: Verify thread-safety of pciToGPUUUID map access.The
pciToGPUUUIDmap is written at line 70 and read here at line 143 without synchronization. IfProcessLinecan be called concurrently from multiple goroutines, this could cause a data race.Please verify whether
ProcessLineis called:
- Sequentially from a single goroutine (no issue)
- Concurrently from multiple goroutines (requires synchronization with sync.RWMutex)
If concurrent access is possible, consider protecting the map with a
sync.RWMutex:type XIDHandler struct { nodeName string defaultAgentName string defaultComponentClass string checkName string mu sync.RWMutex pciToGPUUUID map[string]string parser parser.Parser metadataReader *metadata.Reader }Then use
xidHandler.mu.Lock()/Unlock()for writes andxidHandler.mu.RLock()/RUnlock()for reads.
150-190: LGTM: Excellent documentation for complex logic.This comment block thoroughly explains the GPU reset workflow, including:
- Why both PCI and GPU UUID are required for COMPONENT_RESET
- The distinction between metadata-sourced vs dmesg-sourced UUIDs
- The remediation action override behavior
- Example unhealthy and healthy event flows
This documentation is invaluable for understanding the design decisions and will help future maintainers.
224-230: LGTM: Remediation override logic correctly implements PR requirements.The override from COMPONENT_RESET to RESTART_VM when the GPU UUID is not from metadata is a critical safety measure. This ensures GPU resets are only attempted when the same impacted entities (PCI + UUID) can be reliably resolved after the reset, as explained in the documentation comment and PR objectives.
The logging at line 227 provides good visibility into when this override occurs.
254-286: LGTM: Robust healthy event generation with proper validation.The function correctly:
- Looks up GPU info from metadata using the UUID (line 255)
- Returns an error if lookup fails or PCI is missing (lines 259-265), ensuring partial events aren't sent
- Uses the same
getDefaultImpactedEntitieshelper for consistency (line 267)- Sets appropriate fields for a healthy event (IsHealthy=true, IsFatal=false, RecommendedAction=NONE)
The comment at lines 256-258 clearly explains why returning an error for incomplete data is necessary (fault-quarantine-module matching requirements).
333-348: LGTM: Consistent impacted entities composition.This helper ensures that:
- PCI is always included as the first entity
- GPU_UUID is conditionally added when available
- The same entity structure is used for both unhealthy and healthy events
This consistency is critical for the fault-quarantine-module to match events correctly.
tests/gpu_health_monitor_test.go (1)
302-305: LGTM: updatedhelpers.DeletePod(ctx, t, ...)call site.
No concerns with the signature update here.tests/syslog_health_monitor_test.go (2)
204-234: Teardown ordering: consider restarting the pod before deleting injected metadata (if startup/clear logic depends on metadata).
Not necessarily wrong, but if the syslog monitor’s “healthy event” generation/impacted-entity matching needs metadata, deleting it first could add flakes.
592-611: LGTM: teardown updated to usehelpers.DeletePod(ctx, t, ...).
This aligns with the new helper signature and keeps cleanup consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/helpers/kube.go (1)
453-474: Fix lint + avoid mutating outererrinsiderequire.Eventuallyclosure.
This should address the whitespace lint failure and makes the polling closure side-effect free.Proposed diff
func DeletePod(ctx context.Context, t *testing.T, c klient.Client, namespace, podName string) error { + t.Helper() + pod := &v1.Pod{ ObjectMeta: metav1.ObjectMeta{ Name: podName, Namespace: namespace, }, } err := c.Resources().Delete(ctx, pod) if err != nil { return fmt.Errorf("failed to delete pod %s: %w", podName, err) } require.Eventually(t, func() bool { - err = c.Resources().Get(ctx, podName, namespace, pod) - if err != nil { - return apierrors.IsNotFound(err) + getErr := c.Resources().Get(ctx, podName, namespace, pod) + if getErr != nil { + return apierrors.IsNotFound(getErr) } + return false }, EventuallyWaitTimeout, WaitInterval, "pod %s should be removed from API", podName) return nil }tests/syslog_health_monitor_test.go (1)
204-234: Teardown “cleared” check may be too strict if conditions are removed instead of set healthy.
IfSysLogsXIDErroris ever cleared by deleting the condition (nil) rather than flipping it toStatus=Falsewith reasonSysLogsXIDErrorIsHealthy, this teardown can flake.Consider accepting either “healthy” or “not present”:
Proposed diff
require.Eventually(t, func() bool { condition, err := helpers.CheckNodeConditionExists(ctx, client, nodeName, "SysLogsXIDError", "SysLogsXIDErrorIsHealthy") if err != nil { return false } - return condition != nil && condition.Status == v1.ConditionFalse + return condition == nil || condition.Status == v1.ConditionFalse }, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "SysLogsXIDError condition should be cleared")
🧹 Nitpick comments (4)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
150-190: Consider relocating detailed design rationale.While this documentation is valuable, the extensive implementation rationale and alternative approaches might be better suited for package-level godoc or a design document rather than inline comments. This would improve code readability while preserving the context.
As per coding guidelines, inline comments should be reserved for complex logic, and package-level documentation is preferred for broader context.
tests/gpu_health_monitor_test.go (1)
62-69: DeletePod now waits for API removal;WaitForPodsDeletedmay be redundant here.
Sincehelpers.DeletePod()now includes an eventual NotFound check, consider dropping the extrahelpers.WaitForPodsDeleted(...)to reduce latency/flakes (unless you’re intentionally waiting for additional downstream cleanup beyond API removal).tests/syslog_health_monitor_test.go (2)
239-305: GPU reset test is valuable, but hard-coded GPU UUID is brittle.
Since the test already creates/injects metadata, consider pulling the expected UUID from the injectedmetadata(or using a regex + separately asserting “same UUID before/after”) instead of pinningGPU-2222....
592-611: Same teardown strictness risk in SXID cleanup.
As with XID teardown, consider treating “condition removed” as cleared to avoid flakes if the controller clears by deletion instead of flipping toIsHealthy.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
docs/configuration/syslog-health-monitor.mddocs/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/metadata/reader.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/helpers/kube.gotests/helpers/metadata.gotests/node_drainer_test.gotests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
- health-monitors/syslog-health-monitor/pkg/metadata/reader.go
- tests/node_drainer_test.go
- docs/configuration/syslog-health-monitor.md
- docs/syslog-health-monitor.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gotests/helpers/kube.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/syslog_health_monitor_test.gotests/helpers/metadata.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/syslog_health_monitor_test.go
🧠 Learnings (8)
📓 Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/syslog_health_monitor_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/helpers/kube.gotests/gpu_health_monitor_test.go
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
📚 Learning: 2025-12-12T07:41:34.094Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:34.094Z
Learning: The XID74Reg2Bit13Set rule in tests/data/health-events-analyzer-config.yaml intentionally omits the time window filter because it only validates the register bit pattern (bit 13 in REG2) on the received XID 74 event itself, without needing to check historical events or count repeated occurrences.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
🧬 Code graph analysis (5)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
NewReader(46-50)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-38)health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (1)
Response(29-33)data-models/pkg/protos/health_event.pb.go (16)
HealthEvents(107-113)HealthEvents(126-126)HealthEvents(141-143)RecommendedAction_COMPONENT_RESET(44-44)RecommendedAction_RESTART_VM(47-47)HealthEvent(211-230)HealthEvent(243-243)HealthEvent(258-260)RecommendedAction(40-40)RecommendedAction(90-92)RecommendedAction(94-96)RecommendedAction(103-105)RecommendedAction_NONE(43-43)Entity(159-165)Entity(178-178)Entity(193-195)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (3)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-38)health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
NewXIDHandler(37-59)health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (2)
Response(29-33)XIDDetails(36-48)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (2)
DeletePod(453-475)NVSentinelNamespace(64-64)
tests/syslog_health_monitor_test.go (7)
tests/helpers/metadata.go (3)
DeleteMetadata(183-229)CreateTestMetadata(59-108)InjectMetadata(110-181)tests/helpers/kube.go (9)
NVSentinelNamespace(64-64)DeletePod(453-475)RemoveNodeManagedByNVSentinelLabel(1395-1409)GetPodOnWorkerNode(1412-1442)PortForwardPod(1598-1642)SetNodeManagedByNVSentinel(1373-1392)InjectSyslogMessages(1804-1827)VerifyNodeConditionMatchesSequence(1830-1867)CheckNodeConditionExists(1267-1284)platform-connectors/pkg/transformers/metadata/transformer.go (1)
New(42-63)platform-connectors/pkg/pipeline/pipeline.go (1)
New(35-37)platform-connectors/pkg/pipeline/config.go (1)
Config(17-21)platform-connectors/pkg/transformers/overrides/config.go (1)
Config(25-28)commons/pkg/auditlogger/auditlogger.go (1)
Log(114-134)
🪛 GitHub Check: modules-lint-test (tests)
tests/helpers/kube.go
[failure] 471-471:
missing whitespace above this line (too many lines above return)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (34)
- GitHub Check: ko-build-test (event-exporter, .)
- GitHub Check: ko-build-test (platform-connectors, .)
- GitHub Check: ko-build-test (labeler, .)
- GitHub Check: ko-build-test (janitor, .)
- GitHub Check: ko-build-test (fault-quarantine, .)
- GitHub Check: ko-build-test (health-monitors/kubernetes-object-monitor, .)
- GitHub Check: ko-build-test (fault-remediation, .)
- GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/maintenance-notifier)
- GitHub Check: container-build-test (gpu-health-monitor-dcgm3, make -C health-monitors/gpu-health-monitor docker...
- GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/csp-health-monitor)
- GitHub Check: container-build-test (syslog-health-monitor, make -C health-monitors/syslog-health-monitor docker...
- GitHub Check: ko-build-test (node-drainer, .)
- GitHub Check: ko-build-test (health-events-analyzer, .)
- GitHub Check: container-build-test (gpu-health-monitor-dcgm4, make -C health-monitors/gpu-health-monitor docker...
- GitHub Check: container-build-test (log-collector, make -C log-collector docker-build-log-collector)
- GitHub Check: container-build-test (file-server-cleanup, make -C log-collector docker-build-file-server-cleanup)
- GitHub Check: modules-lint-test (event-exporter)
- GitHub Check: modules-lint-test (metadata-collector)
- GitHub Check: modules-lint-test (fault-quarantine)
- GitHub Check: modules-lint-test (platform-connectors)
- GitHub Check: modules-lint-test (health-events-analyzer)
- GitHub Check: modules-lint-test (commons)
- GitHub Check: modules-lint-test (janitor)
- GitHub Check: modules-lint-test (store-client)
- GitHub Check: modules-lint-test (fault-remediation)
- GitHub Check: modules-lint-test (labeler)
- GitHub Check: health-monitors-lint-test (syslog-health-monitor)
- GitHub Check: health-monitors-lint-test (kubernetes-object-monitor)
- GitHub Check: health-monitors-lint-test (csp-health-monitor)
- GitHub Check: CodeQL PR Analysis
- GitHub Check: E2E Tests (AMD64 + MongoDB)
- GitHub Check: E2E Tests (ARM64 + MongoDB)
- GitHub Check: E2E Tests (ARM64 + PostgreSQL)
- GitHub Check: E2E Tests (AMD64 + PostgreSQL)
🔇 Additional comments (20)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
24-27: LGTM!The new
gpuResetMapregex correctly matches the GPU reset line format with GPU UUID capture group. The alignment adjustment toreNvrmMapimproves readability.tests/helpers/metadata.go (1)
67-105: LGTM! Test data aligns with PCI normalization.The trimmed PCI addresses (removing
.0suffix) correctly match the normalized format used in production code. This ensures metadata lookups work correctly in tests.health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
176-221: LGTM! Test follows established patterns.The table-driven test structure correctly validates
GetInfoByUUIDfor success and error scenarios. The test is consistent with existing test patterns in this file.health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (7)
33-35: LGTM!Good practice to define the healthy event message as a package-level constant for consistency.
79-82: LGTM!The GPU reset detection is correctly placed before XID parsing to short-circuit processing. The logging provides good observability.
101-108: LGTM!The parsing logic correctly extracts the GPU UUID using the regex capture group, with appropriate bounds checking.
133-148: LGTM!The signature change correctly tracks UUID provenance, enabling the remediation override logic. The fallback from metadata to dmesg is handled gracefully with appropriate error logging.
191-252: LGTM!The remediation override logic correctly implements the requirement that
COMPONENT_RESETis only permitted when the GPU UUID is sourced from metadata. The logging provides good observability for the override.
254-286: LGTM!The healthy event generation correctly requires both UUID and PCI from metadata, failing fast if either is unavailable. This ensures entity matching consistency as documented in the implementation comments.
333-348: LGTM!Good refactoring to centralize impacted entity construction, ensuring consistent ordering and reducing duplication.
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (7)
19-20: LGTM!The additional imports support filesystem-backed test metadata scenarios.
30-86: LGTM!The test metadata fixtures correctly use normalized PCI format and cover both valid and missing-PCI scenarios for comprehensive testing.
120-151: LGTM!The test provides good coverage of the GPU reset line parsing logic with valid, invalid, and non-matching cases. Follows the table-driven test pattern as per coding guidelines.
223-227: LGTM!The test setup correctly creates temporary metadata files for filesystem-backed test scenarios.
296-363: LGTM! Comprehensive GPU reset test coverage.The test cases thoroughly validate GPU reset event generation including:
- Successful healthy event creation with PCI+UUID from metadata
- Error handling when metadata is unavailable
- Error handling when GPU UUID is missing from metadata
- Error handling when PCI address is missing from metadata
All scenarios are correctly validated.
365-436: LGTM! Override logic thoroughly tested.The test cases correctly validate the remediation override behavior:
- RESET_GPU → RESTART_VM when UUID is sourced from dmesg
- RESET_GPU → COMPONENT_RESET (unchanged) when UUID is sourced from metadata
This ensures the PR requirement is properly implemented and tested.
569-580: LGTM!The updated PCI value expectation correctly reflects the normalized format without the
.0suffix, consistent with the normalization logic.tests/gpu_health_monitor_test.go (1)
302-305: DeletePod signature update looks correct.
Passingtaligns with the helper’s new behavior (eventual absence check).tests/helpers/kube.go (1)
696-742: DrainRunningPodsInNamespace callsite update is correct.
Passingtpreserves the helper’s improved delete verification behavior.tests/syslog_health_monitor_test.go (1)
307-329: Nice coverage of the “XID → reset → healthy” flow.
The end assertion matches the PR objective that a reset completion should emit a healthy event that clears the condition.
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
0cb6782 to
b0e0bcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/helpers/kube.go (1)
453-475:DeletePodhas conflicting semantics: it returnserrorbut usesrequire.Eventuallywhich hard-fails the test.When the condition in
require.Eventuallyis not met, it callsFailNow()on the*testing.Tinstance, terminating the test immediately rather than allowing the function to return an error. This conflicts with the function's error return type—callers expecting to handle an error will never receive one because the test fails first.Additionally, if the context is already canceled before entering the assertion block, the
Getcalls will continue failing untilEventuallyWaitTimeoutexpires (a long wait with no hope of success).The proposed fix correctly addresses this by:
- Replacing
require.Eventuallywith explicit timer/ticker logic- Checking
ctx.Err()upfront to return immediately if the context is already done- Returning errors to callers instead of hard-failing tests
- Handling the case where a pod doesn't exist after deletion (idempotent)
Proposed fix
func DeletePod(ctx context.Context, t *testing.T, c klient.Client, namespace, podName string) error { + t.Helper() + pod := &v1.Pod{ ObjectMeta: metav1.ObjectMeta{ Name: podName, Namespace: namespace, }, } err := c.Resources().Delete(ctx, pod) if err != nil { + if apierrors.IsNotFound(err) { + return nil + } return fmt.Errorf("failed to delete pod %s: %w", podName, err) } - require.Eventually(t, func() bool { - err = c.Resources().Get(ctx, podName, namespace, pod) - if err != nil { - return apierrors.IsNotFound(err) - } - return false - }, EventuallyWaitTimeout, WaitInterval, "pod %s should be removed from API", podName) + deadline := time.NewTimer(EventuallyWaitTimeout) + ticker := time.NewTicker(WaitInterval) + defer deadline.Stop() + defer ticker.Stop() + + for { + if ctx.Err() != nil { + return fmt.Errorf("context done while waiting for pod %s to be removed: %w", podName, ctx.Err()) + } + + err = c.Resources().Get(ctx, podName, namespace, pod) + if err != nil && apierrors.IsNotFound(err) { + return nil + } + + select { + case <-deadline.C: + return fmt.Errorf("pod %s was not removed from API within %s", podName, EventuallyWaitTimeout) + case <-ticker.C: + } + } - return nil }
🧹 Nitpick comments (1)
tests/syslog_health_monitor_test.go (1)
353-372: Teardown duplication: consider extracting a shared syslog-monitor cleanup helper.Multiple tests repeat “stop port-forward → DeleteMetadata → restart syslog-health-monitor → remove label”. A small helper (e.g.,
helpers.TeardownSyslogHealthMonitorTest(...)) would reduce drift between test suites.Also applies to: 592-600
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
docs/configuration/syslog-health-monitor.mddocs/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/metadata/reader.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/helpers/kube.gotests/node_drainer_test.gotests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
- health-monitors/syslog-health-monitor/pkg/metadata/reader.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/gpu_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gotests/helpers/kube.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/node_drainer_test.gotests/syslog_health_monitor_test.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/gpu_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/node_drainer_test.gotests/syslog_health_monitor_test.go
🧠 Learnings (8)
📓 Common learnings
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/gpu_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/node_drainer_test.gotests/syslog_health_monitor_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/gpu_health_monitor_test.gotests/helpers/kube.go
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
docs/configuration/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.godocs/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Applied to files:
docs/configuration/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.godocs/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:38:44.962Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:44.962Z
Learning: In NVSentinel, XID 74 errors will always have an NVLINK entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for NVLINK entities in XID 74-specific rules.
Applied to files:
docs/configuration/syslog-health-monitor.md
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-12T07:41:34.094Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:34.094Z
Learning: The XID74Reg2Bit13Set rule in tests/data/health-events-analyzer-config.yaml intentionally omits the time window filter because it only validates the register bit pattern (bit 13 in REG2) on the received XID 74 event itself, without needing to check historical events or count repeated occurrences.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
🧬 Code graph analysis (4)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (2)
DeletePod(453-475)NVSentinelNamespace(64-64)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
NewReader(46-50)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-38)health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
NewXIDHandler(37-59)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (3)
HealthEvent(54-128)Entity(46-52)RecommendedAction(14-24)health-monitors/syslog-health-monitor/pkg/xid/parser/parser.go (2)
Response(29-33)XIDDetails(36-48)
tests/node_drainer_test.go (1)
tests/helpers/kube.go (1)
DeletePod(453-475)
🔇 Additional comments (14)
docs/configuration/syslog-health-monitor.md (1)
5-5: Documentation updates properly reflect GPU reset monitoring.GPU reset support is consistently integrated throughout the configuration documentation:
- Line 5 correctly lists GPU reset events alongside other monitored events
- Line 45 accurately reflects that GPU reset monitoring is bundled under the existing SysLogsXIDError check (no new check type)
- Line 58 appropriately notes that SysLogsXIDError now monitors GPU reset messages in addition to XID errors
The updates maintain the right level of detail for a configuration reference without over-specifying implementation details.
Also applies to: 45-45, 58-58
docs/syslog-health-monitor.md (1)
5-5: Comprehensive and accurate GPU reset documentation.GPU reset support is well-integrated throughout the main documentation:
- Line 5 appropriately separates GPU resets from failure events as "other GPU-related events"
- Lines 11-16 clearly position GPU Reset alongside other notable occurrences
- Line 25 documents GPU reset log pattern detection as part of the monitoring pipeline
- Line 29 correctly adds gRPC mention for health event emission
- Line 42 clarifies that GPU reset monitoring is part of the SysLogsXIDError check
- Lines 81-82 provide a clear subsection explaining that GPU resets indicate completed remediation
The documentation accurately reflects that the system detects GPU resets in logs and emits corresponding healthy events, which aligns with the feature's remediation-completion semantics described in the PR objectives.
Also applies to: 11-16, 25-25, 29-29, 42-42, 81-82
tests/node_drainer_test.go (1)
96-99: DeletePod call-site update looks correct.tests/gpu_health_monitor_test.go (2)
65-69: DeletePod call-site update looks correct (setup restart path).
302-305: DeletePod call-site update looks correct (restart to trigger reconnection).tests/helpers/kube.go (1)
696-742: DrainRunningPodsInNamespace update is consistent with the new DeletePod signature.tests/syslog_health_monitor_test.go (2)
204-234: Teardown sequencing looks solid; consider idempotency once DeletePod is fixed.This cleanup (DeleteMetadata → restart pod → wait for healthy condition → remove label) is the right shape; once
helpers.DeletePodbecomes idempotent / ctx-aware, it should also reduce teardown flakes.
239-329: No action required—this test is not brittle. The UUIDGPU-22222222-2222-2222-2222-222222222222is deterministically assigned to PCI address0002:00:00byCreateTestMetadata, and the GPU reset detector uses a flexible regex pattern (GPU reset executed: (GPU-[0-9a-fA-F-]+)) that accepts any valid UUID format. The hard-coded values are intentional and correct.Likely an incorrect or invalid review comment.
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
176-221: LGTM! Well-structured test for UUID lookup.The test follows best practices with table-driven design and covers the essential scenarios: exact matches for different GPUs and error handling for non-existent UUIDs. The implementation is consistent with existing test patterns in the file.
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
24-27: LGTM! GPU reset pattern detection added.The new
gpuResetMapregex correctly matches GPU reset log lines and extracts the GPU UUID. The pattern aligns with the PR's goal of detecting GPU reset events.health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (4)
30-86: LGTM! Test metadata constants support comprehensive scenarios.The test metadata constants provide realistic data for testing GPU reset functionality with different configurations (complete metadata vs. missing PCI). The simplified UUID "GPU-123" makes test assertions clearer while still exercising the full code path.
120-151: LGTM! Thorough test coverage for GPU reset line parsing.The test validates both successful GPU UUID extraction and proper handling of invalid/mismatched patterns. The inclusion of an NVRM line as a negative test case is particularly good for ensuring pattern specificity.
222-363: LGTM! Comprehensive GPU reset test coverage.The test setup properly uses filesystem-backed metadata files for realistic testing. The GPU reset scenarios thoroughly validate:
- Successful healthy event creation with normalized PCI and GPU UUID
- Error handling when metadata is unavailable, UUID not found, or PCI missing
This aligns perfectly with the PR objective that GPU reset events require metadata-collector lookups for PCI resolution.
365-436: LGTM! Remediation action override logic correctly tested.These test cases validate the critical requirement from the PR objectives:
- COMPONENT_RESET is only used when GPU UUID is sourced from metadata-collector
- When UUID is from NVRM/dmesg, the action is overridden to RESTART_VM
The test setup properly distinguishes between the two UUID sources (pciToGPUUUID map vs. metadataFile), and the assertions confirm the correct remediation actions are applied in each scenario.
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
Signed-off-by: Nathan Herz <[email protected]>
b0e0bcf to
262c7b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)
222-507: MakeTestProcessLinehermetic: avoid/tmp/metadata.json(can exist on runners and break expectations).Several cases intentionally model “metadata collector not initialized / missing”, but using a fixed
/tmp/metadata.jsonpath makes the test outcome depend on the machine state. Prefer a guaranteed-missing path undert.TempDir()(and use the temp-backedmetadataFilewhen you want it initialized).Proposed fix (use a temp “missing” path instead of /tmp)
func TestProcessLine(t *testing.T) { tmpDir := t.TempDir() metadataFile := filepath.Join(tmpDir, "gpu_metadata.json") require.NoError(t, os.WriteFile(metadataFile, []byte(testMetadataJSON), 0600)) metadataFileMissingPCI := filepath.Join(tmpDir, "gpu_metadata_missing_pci.json") require.NoError(t, os.WriteFile(metadataFileMissingPCI, []byte(testMetadataMissingPCIJSON), 0600)) + metadataFileMissing := filepath.Join(tmpDir, "gpu_metadata_missing.json") testCases := []struct { name string message string setupHandler func() *XIDHandler expectEvent bool expectError bool validateEvent func(t *testing.T, events *pb.HealthEvents) }{ { name: "NVRM GPU Map Line", message: "NVRM: GPU at PCI:0000:00:08.0: GPU-12345678-1234-1234-1234-123456789012", setupHandler: func() *XIDHandler { - h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", "/tmp/metadata.json") + h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", metadataFileMissing) h.parser = &mockParser{ parseFunc: func(msg string) (*parser.Response, error) { return nil, nil }, } return h }, expectEvent: false, expectError: false, }, { name: "Valid XID Message", message: "NVRM: Xid (PCI:0000:00:08.0): 79, pid=12345, name=test-process", setupHandler: func() *XIDHandler { - h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", "/tmp/metadata.json") + h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", metadataFileMissing) h.parser = &mockParser{ parseFunc: func(msg string) (*parser.Response, error) { return &parser.Response{ Success: true, Result: parser.XIDDetails{ DecodedXIDStr: "Xid 79", PCIE: "0000:00:08.0", Mnemonic: "GPU has fallen off the bus", Resolution: "CONTACT_SUPPORT", Number: 79, }, }, nil }, } return h }, expectEvent: true, expectError: false, ... }, { name: "Valid GPU Reset Message with Metadata Collector not Initialized", message: "GPU reset executed: GPU-123", setupHandler: func() *XIDHandler { - h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", "/tmp/metadata.json") + h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", metadataFileMissing) return h }, expectEvent: false, expectError: true, }, { name: "Valid XID with GPU UUID from NVRM: RESET_GPU overridden to RESTART_VM", message: "NVRM: Xid (PCI:0000:00:08.0): 79, pid=12345, name=test-process", setupHandler: func() *XIDHandler { - h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", "/tmp/metadata.json") + h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", metadataFileMissing) h.parser = &mockParser{ ... } h.pciToGPUUUID["0000:00:08"] = "GPU-12345678-1234-1234-1234-123456789012" return h }, ... }, { name: "Parser Returns Error", message: "Some random message", setupHandler: func() *XIDHandler { - h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", "/tmp/metadata.json") + h, _ := NewXIDHandler("test-node", "test-agent", "GPU", "xid-check", "", metadataFileMissing) h.parser = &mockParser{ parseFunc: func(msg string) (*parser.Response, error) { return nil, errors.New("parse error") }, } return h }, expectEvent: false, expectError: false, }, ... }
🤖 Fix all issues with AI agents
In @docs/syslog-health-monitor.md:
- Around line 5-26: Update the “Enabled Checks” option list text under the
“Configuration Options”/“Enabled Checks” section to include “GPU reset”
alongside “XID, SXID, fallen-off-bus”, and ensure all related mentions in the
overview/config/check comments are updated to reflect the new set; also remove
any trailing whitespace in the affected paragraphs (the blocks that currently
list enabled checks and the surrounding overview text).
🧹 Nitpick comments (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
176-221: StrengthenTestGetInfoByUUIDassertions (currently only checksGPUID).Consider also asserting the returned
gpu.UUID(and optionallygpu.PCIAddressif present onmodel.GPUInfo) so the test guards against accidental UUID→wrong-GPU mapping regressions.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
docs/configuration/syslog-health-monitor.mddocs/syslog-health-monitor.mdhealth-monitors/syslog-health-monitor/pkg/metadata/reader.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/helpers/kube.gotests/node_drainer_test.gotests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
- docs/configuration/syslog-health-monitor.md
- health-monitors/syslog-health-monitor/pkg/metadata/reader.go
- health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
- health-monitors/syslog-health-monitor/pkg/xid/types.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/helpers/kube.gotests/gpu_health_monitor_test.gotests/node_drainer_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gotests/syslog_health_monitor_test.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/node_drainer_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gotests/syslog_health_monitor_test.go
🧠 Learnings (7)
📓 Common learnings
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.godocs/syslog-health-monitor.md
📚 Learning: 2025-12-12T07:41:34.094Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:34.094Z
Learning: The XID74Reg2Bit13Set rule in tests/data/health-events-analyzer-config.yaml intentionally omits the time window filter because it only validates the register bit pattern (bit 13 in REG2) on the received XID 74 event itself, without needing to check historical events or count repeated occurrences.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/gpu_health_monitor_test.gotests/node_drainer_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.gotests/syslog_health_monitor_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/metadata/reader_test.go
📚 Learning: 2025-12-12T07:36:34.289Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:34.289Z
Learning: In NVSentinel, XID 74 errors will always have a GPU_UUID entry in the healthevent.entitiesimpacted array, so null-checking with $ifNull is not required when filtering for GPU_UUID entities in XID 74-specific rules.
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.godocs/syslog-health-monitor.md
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/helpers/kube.gotests/gpu_health_monitor_test.go
🧬 Code graph analysis (4)
tests/gpu_health_monitor_test.go (1)
tests/helpers/kube.go (2)
DeletePod(453-476)NVSentinelNamespace(64-64)
tests/node_drainer_test.go (1)
tests/helpers/kube.go (1)
DeletePod(453-476)
health-monitors/syslog-health-monitor/pkg/metadata/reader_test.go (1)
health-monitors/syslog-health-monitor/pkg/metadata/reader.go (1)
NewReader(46-50)
tests/syslog_health_monitor_test.go (7)
tests/helpers/metadata.go (3)
DeleteMetadata(183-229)CreateTestMetadata(59-108)InjectMetadata(110-181)tests/helpers/kube.go (10)
NVSentinelNamespace(64-64)DeletePod(453-476)RemoveNodeManagedByNVSentinelLabel(1396-1410)GetPodOnWorkerNode(1413-1443)PortForwardPod(1599-1643)SetNodeManagedByNVSentinel(1374-1393)VerifyNodeConditionMatchesSequence(1831-1868)EventuallyWaitTimeout(61-61)WaitInterval(63-63)CheckNodeConditionExists(1268-1285)platform-connectors/pkg/transformers/metadata/transformer.go (1)
New(42-63)platform-connectors/pkg/pipeline/pipeline.go (1)
New(35-37)platform-connectors/pkg/pipeline/config.go (1)
Config(17-21)platform-connectors/pkg/transformers/overrides/config.go (1)
Config(25-28)commons/pkg/auditlogger/auditlogger.go (1)
Log(114-134)
🔇 Additional comments (7)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)
120-151:TestParseGPUResetLinecoverage looks good.Covers valid/invalid reset lines and ensures it doesn’t false-positive on NVRM XID lines.
tests/node_drainer_test.go (1)
96-99: DeletePod call-site update matches the helper’s new signature.tests/gpu_health_monitor_test.go (1)
62-69: DeletePod call-sites updated correctly to passt.Also applies to: 302-305
tests/helpers/kube.go (1)
453-476: LGTM! Post-deletion verification enhances test reliability.The addition of the
*testing.Tparameter and post-deletion verification usingrequire.Eventuallyis a solid improvement. This ensures that the pod is actually removed from the API server before returning, preventing race conditions in tests that depend on pod absence.The signature change is consistently applied across all call sites (as noted in the AI summary), maintaining API consistency.
tests/syslog_health_monitor_test.go (3)
204-234: LGTM! Enhanced teardown improves test hygiene.The addition of metadata cleanup before pod deletion ensures proper test isolation and prevents metadata pollution between tests. The updated
DeletePodcall with thetparameter aligns with the improved helper signature, and the additional logging improves observability during teardown.
239-374: LGTM! Comprehensive GPU reset test follows established patterns.The new
TestSyslogHealthMonitorGPUResettest provides end-to-end coverage for GPU reset flows:
- First assess verifies XID error detection with COMPONENT_RESET action
- Second assess verifies that GPU reset events clear the error condition
- Teardown correctly omits the condition-clearing wait (line 361) because the last assess already verified it
The test structure follows existing conventions and includes appropriate logging for debugging.
592-599: LGTM! Consistent teardown enhancements.The metadata cleanup and updated
DeletePodcall maintain consistency with other test teardowns in this file, ensuring proper cleanup of test resources.
| The Syslog Health Monitor watches system logs for GPU-related errors that may not be caught by DCGM. It monitors journald/syslog for XID errors, SXID errors (NVSwitch/NVLink errors), and GPU fallen-off-bus events - critical failures that indicate serious GPU, NVSwitch, or driver problems. In addition to failures, it monitors system logs for other GPU-related events such as GPU resets to indicate that a required remediation action has completed. | ||
|
|
||
| Think of it as a log analyzer that reads between the lines - catching GPU and NVSwitch problems recorded in system logs that other monitoring might miss. | ||
|
|
||
| ### Why Do You Need This? | ||
|
|
||
| Some GPU and NVSwitch failures manifest in system logs before DCGM can detect them: | ||
| Some GPU and NVSwitch failures or events manifest in system logs before DCGM can detect them: | ||
|
|
||
| - **XID errors**: GPU hardware errors logged by the NVIDIA driver | ||
| - **SXID errors**: NVSwitch errors related to NVSwitch and NVLink interconnects | ||
| - **GPU fallen off the bus**: GPU became inaccessible to the system | ||
| - **GPU Reset**: A GPU reset was executed by nvidia-smi | ||
|
|
||
| These errors often appear in system logs first and can indicate imminent GPU or fabric failure, making early detection critical for preventing workload disruptions. | ||
| These errors or events often appear in system logs first and can indicate imminent GPU or fabric failure, making early detection critical for preventing workload disruptions or returning GPUs to service. | ||
|
|
||
| ## How It Works | ||
|
|
||
| The Syslog Health Monitor runs as a DaemonSet on GPU nodes: | ||
|
|
||
| 1. Reads journald logs from the host system | ||
| 2. Parses log entries for GPU-related error patterns (XID, SXID, fallen-off-bus) | ||
| 2. Parses log entries for GPU-related error patterns (XID, SXID, fallen-off-bus, GPU reset) | ||
| 3. Maintains cursor position to avoid re-processing old logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs: update “Configuration Options” to include GPU reset (and consider trimming trailing whitespace).
You added GPU reset to overview/config/check comments, but the “Enabled Checks” option list still says it selects only “(XID, SXID, fallen-off-bus)”. That’s now incomplete given GPU reset monitoring.
Also applies to: 42-44, 58-60, 79-83
🤖 Prompt for AI Agents
In @docs/syslog-health-monitor.md around lines 5 - 26, Update the “Enabled
Checks” option list text under the “Configuration Options”/“Enabled Checks”
section to include “GPU reset” alongside “XID, SXID, fallen-off-bus”, and ensure
all related mentions in the overview/config/check comments are updated to
reflect the new set; also remove any trailing whitespace in the affected
paragraphs (the blocks that currently list enabled checks and the surrounding
overview text).
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
Summary
Related design doc for GPU reset: https://github.com/NVIDIA/NVSentinel/blob/main/docs/designs/020-nvsentinel-gpu-reset.md.
This PR adds functionality to the syslog-health-monitor to send unhealthy HealthEvents that will result in a GPU reset as asll we to send healthy HealthEvents after a GPU reset has successfully occurred. In addition to the PCI, we will always add the GPU UUID as an impacted entity if it is available from either dmesg or from the metadata-collector. The COMPONENT_RESET remediation action requires the PCI and GPU UUID are available in the initial unhealthy event we're sending. Additionally, the corresponding healthy event triggered after the COMPONENT_RESET requires the same PCI and GPU UUID impacted entities are included as the initial event. As a result, we will only permit the COMPONENT_RESET action if the GPU UUID was sourced from the metadata-collector to ensure that the same impacted entities can be fetched after a reset occurs. If the GPU UUID does not exist or is sourced from dmesg, we will still include it as an impacted entity but override the remediation action from COMPONENT_RESET to RESTART_VM.
Unhealthy event generation:
Healthy event generation:
Implementation notes:
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.