Skip to content

Conversation

@jtschelling
Copy link
Contributor

@jtschelling jtschelling commented Nov 20, 2025

Summary

Implements #124

AI Disclosure: Cursor helped quite a bit here once I got the hand-made skeleton outlined. Broad strokes are me but the normalization is mostly Cursor - I had to tone down the amount of information pushed to the event (metadata) though. I almost always use gen-ai for testing and thats the same here. The testing seems quite good - solid test cases that do not use mocking and walk through the most important logic of the new pkg/normalizer.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Azure CSP support added: polls Azure VM maintenance, normalizes events, and emits them; configurable subscription source and polling interval.
  • Bug Fixes / Validation

    • CSP validation updated to enforce a single enabled provider and validate Azure polling intervals.
  • Tests

    • Unit tests added for Azure monitoring and event normalization.
  • Chores

    • .gitignore updated (version managers restored); dependency manifest updated to include Azure SDKs and supporting modules; restored event-exporter entry.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 20, 2025

Walkthrough

Adds Azure CSP support: new AzureConfig, CSPAzure constant, Azure client that resolves subscription (config or IMDS), uses DefaultAzureCredential and UpdatesClient, polls per-node maintenance updates via Kubernetes + Azure APIs, normalizes events, wires init in main, updates go.mod, and adds unit tests and normalizer.

Changes

Cohort / File(s) Summary
Repo metadata
\.gitignore
Restored event-exporter/event-exporter entry and added a Version Managers section containing mise.toml.
Module deps
health-monitors/csp-health-monitor/go.mod
Added Azure SDK deps (azcore, azidentity, armmaintenance) and related indirect modules.
Configuration
pkg/config
health-monitors/csp-health-monitor/pkg/config/config.go
Added AzureConfig (Enabled, SubscriptionID, PollingIntervalSeconds); extended CSP validation and single-CSP enforcement to include Azure.
Azure CSP client
pkg/csp/azure
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
New Client: resolves subscription (config or IMDS), creates DefaultAzureCredential and UpdatesClient, builds kube client and node informer, polls per-node Updates API, filters/normalizes updates, emits events to channel with metrics and context-aware concurrency.
Client tests
health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go
Added tests for no-event and single-event polling using fake Azure and fake Kubernetes clients.
Event normalization
pkg/event
health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go, health-monitors/csp-health-monitor/pkg/event/normalizer.go
Added AzureNormalizer with status mapping helpers; GetNormalizer returns AzureNormalizer for CSPAzure.
Normalizer tests
health-monitors/csp-health-monitor/pkg/event/azure_normalizer_test.go
Added tests for pending and in‑progress update normalization.
Model constants
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go
Added CSPAzure constant to CSP enum.
Main integration
cmd/csp-health-monitor
health-monitors/csp-health-monitor/cmd/csp-health-monitor/main.go
Init path updated to initialize Azure monitor when cfg.Azure.Enabled is true; initialization errors increment metrics and are logged.

Sequence Diagram(s)

sequenceDiagram
    participant Main as Monitor Init
    participant AzureClient as Azure Client
    participant IMDS as Instance Metadata
    participant Creds as Azure Credential
    participant K8s as Kubernetes API
    participant Updates as Azure Updates API
    participant Chan as Event Channel

    Main->>AzureClient: NewClient(ctx, AzureConfig)
    AzureClient->>Creds: DefaultAzureCredential()
    alt subscription in config
        AzureClient-->>AzureClient: use config.SubscriptionID
    else
        AzureClient->>IMDS: GET /metadata/instance?api-version=...
        IMDS-->>AzureClient: { "subscriptionId": "..." }
    end
    AzureClient->>Updates: New UpdatesClient(subscriptionID)
    AzureClient->>K8s: Build Kubernetes client / informer
    AzureClient-->>Main: Client ready

    Main->>AzureClient: StartMonitoring(ctx, eventChan)
    loop every PollingIntervalSeconds
        AzureClient->>K8s: List Nodes
        par per-node
            AzureClient->>Updates: ListUpdates(resourceID)
            Updates-->>AzureClient: []Update
            AzureClient->>AzureClient: filter & normalize
            AzureClient->>Chan: send MaintenanceEvent
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Focus areas:
    • pkg/csp/azure/azure.go — concurrency, context cancellation, Azure SDK paging, IMDS parsing, error handling, and metrics.
    • pkg/event/azure_normalizer.go — status mappings, metadata composition, event ID generation, and time fields.
    • pkg/config/config.go & cmd/csp-health-monitor/main.go — CSP selection/validation and initialization ordering.
    • Tests — verify fakes/mocks correctly simulate Azure responses and timing.

Poem

🐰 I hopped to Azure, ears all keen and bright,
Polling VMs by day and through the night,
I normalize each hop and nudge a gentle log,
Metrics twitch like whiskers in the fog,
A tiny rabbit guard, keeping clouds polite.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: azure maintenance events' clearly and concisely summarizes the main change: adding Azure maintenance event support to the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)

39-40: CSPAzure enum addition is good; fix NodeName BSON tag typo

The new CSPAzure constant wiring looks correct and consistent with the other CSP values.

However, NodeName’s BSON tag has a typo:

NodeName string `json:"nodeName,omitempty"           bson:"nodeName,omitemtpy"`

Because omitemtpy is not recognized, empty nodeName values will not be omitted as intended and may affect how events are stored/queried.

Consider correcting it to:

-	NodeName               string            `json:"nodeName,omitempty"           bson:"nodeName,omitemtpy"`
+	NodeName               string            `json:"nodeName,omitempty"           bson:"nodeName,omitempty"`

Also applies to: 55-59

🧹 Nitpick comments (6)
health-monitors/csp-health-monitor/cmd/csp-health-monitor/main.go (1)

37-43: Azure monitor init path is consistent; consider doc/log clarifications and confirm CSP priority

The Azure branch in initActiveMonitor is wired consistently with GCP/AWS: it respects cfg.Azure.Enabled, increments CSPMonitorErrors on init failure, logs, and returns the Azure monitor. A few minor points to consider:

  • The comment above initActiveMonitor still says “GCP/AWS”; it would be clearer to mention Azure as well.
  • CSP selection remains “first enabled wins” with priority GCP -> AWS -> Azure. That matches the existing pattern but means Azure is ignored if GCP or AWS are enabled; confirm this is the intended behavior.
  • The Azure init log uses cfg.Azure.SubscriptionID, which will be empty if the subscription ID is discovered via IMDS. If you care about logging the effective subscription ID, you may want to expose/log it from the Azure client instead.

None of these are blockers, but updating them would improve clarity and observability.

Also applies to: 218-220, 260-279

health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go (1)

1-181: Azure polling tests are well‑structured and exercise key paths

Both tests use the k8s and Azure fakes effectively to cover the “no updates” and “single pending update” cases, and they validate the important fields on the emitted MaintenanceEvent (CSP, cluster, node, status, type, and non‑empty EventID). This gives good confidence in the Azure polling + normalization flow. If you find yourself adding more scenarios later, you might consider extracting a small helper to build the fake client setup, but it’s fine as‑is.

health-monitors/csp-health-monitor/pkg/event/azure_normalizer_test.go (1)

28-173: Azure normalizer tests validate key paths; consider extending for additional statuses

The two tests exercise the main happy paths (Pending + InProgress with impact) and validate CSP, resource fields, status mappings, metadata, and EventID format; they align with the current Normalize implementation. It would be useful to add cases for Completed and retry-related statuses (and perhaps ImpactTypeNone) to lock in the mapping behavior for those edge cases as well.

health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (1)

52-95: Event construction and status mappings look good; double‑check EventID semantics vs provider IDs

The MaintenanceEvent construction and mapAzureStatusToInternal / mapAzureStatusToCSPStatus logic look reasonable, and the tests cover the Pending/InProgress cases. One thing to re‑evaluate is EventID: it’s currently a synthetic azure-<resourceGroup>-<vmName>-<timestamp> string, while in the AWS path EventID is intentionally set to the provider entity ARN and reused downstream as the stable resource identifier. If any consumer assumes EventID is a provider‑native identifier, you may want to instead use the Azure resource ID or a stable update identifier (and move this synthetic value into Metadata if still needed). Based on learnings.

Also applies to: 100-118

health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (2)

111-135: Guard StartMonitoring against invalid polling intervals

time.NewTicker(time.Duration(c.config.PollingIntervalSeconds) * time.Second) will panic if PollingIntervalSeconds <= 0. Today config validation should prevent that, but Client can be constructed with arbitrary AzureConfig (e.g., tests or future callers) and would panic on a bad value. It’s safer to clamp or default the interval (and log a warning) before creating the ticker so the client fails gracefully instead of panicking if an invalid interval slips through.


157-171: Limit per‑node goroutine fan‑out when polling Azure updates

pollForMaintenanceEvents spawns one goroutine per node, and each goroutine pages through the Azure Updates API. On large clusters this can create significant parallel load against Azure and the Kubernetes API, and rely on UpdatesClient being fully concurrency‑safe. Consider introducing a bounded worker pool or semaphore (e.g., max N concurrent nodes per poll) to cap concurrency while still keeping per‑node isolation.

Also applies to: 190-216

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 663d0f7 and 29d0490.

⛔ Files ignored due to path filters (1)
  • health-monitors/csp-health-monitor/go.sum is excluded by !**/*.sum
📒 Files selected for processing (10)
  • .gitignore (2 hunks)
  • health-monitors/csp-health-monitor/cmd/csp-health-monitor/main.go (2 hunks)
  • health-monitors/csp-health-monitor/go.mod (4 hunks)
  • health-monitors/csp-health-monitor/pkg/config/config.go (4 hunks)
  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1 hunks)
  • health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go (1 hunks)
  • health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (1 hunks)
  • health-monitors/csp-health-monitor/pkg/event/azure_normalizer_test.go (1 hunks)
  • health-monitors/csp-health-monitor/pkg/event/normalizer.go (1 hunks)
  • health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/model/maintenance_event.go
  • health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go
  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
🧬 Code graph analysis (6)
health-monitors/csp-health-monitor/pkg/event/normalizer.go (2)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
  • CSPAzure (56-56)
health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (1)
  • AzureNormalizer (26-26)
health-monitors/csp-health-monitor/cmd/csp-health-monitor/main.go (3)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)
  • NewClient (39-105)
health-monitors/csp-health-monitor/pkg/metrics/metrics.go (1)
  • CSPMonitorErrors (62-68)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
  • CSPAzure (56-56)
health-monitors/csp-health-monitor/pkg/event/azure_normalizer_test.go (3)
health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (1)
  • AzureNormalizer (26-26)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
  • CSPAzure (56-56)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_RESTART_VM (46-46)
health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (2)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
  • CSPAzure (56-56)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_RESTART_VM (46-46)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (4)
health-monitors/csp-health-monitor/pkg/config/config.go (2)
  • AzureConfig (67-71)
  • Config (39-48)
health-monitors/csp-health-monitor/pkg/event/normalizer.go (2)
  • Normalizer (25-30)
  • GetNormalizer (34-45)
health-monitors/csp-health-monitor/pkg/metrics/metrics.go (4)
  • CSPMonitorErrors (62-68)
  • CSPPollingDuration (39-46)
  • CSPAPIErrors (47-53)
  • CSPEventsReceived (32-38)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (3)
  • CSPAzure (56-56)
  • CSP (43-43)
  • MaintenanceEvent (22-40)
health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go (4)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)
  • Client (28-36)
health-monitors/csp-health-monitor/pkg/config/config.go (1)
  • AzureConfig (67-71)
health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (1)
  • AzureNormalizer (26-26)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
  • CSPAzure (56-56)
🔇 Additional comments (5)
.gitignore (2)

10-11: LGTM: Version manager support.

Adding mise.toml to the ignore list is appropriate for projects using the mise version manager. This is a reasonable housekeeping addition.


456-456: No duplicate entry found—the review comment's concern is unfounded.

The verification confirms that platform-connectors/platform-connectors appears only once in the .gitignore file (at line 456). There is no duplicate entry elsewhere in the file. The review comment's assertion that this is a duplicate ignore entry is incorrect.

Likely an incorrect or invalid review comment.

health-monitors/csp-health-monitor/go.mod (1)

10-11: Azure SDK/auth dependency additions look coherent; please verify versions/security externally

The added Azure SDK/auth/jwt/browser dependencies and versions line up with the new Azure maintenance client usage and test fakes; they look reasonable and narrowly scoped to the Azure feature. Please just double‑check that these specific versions match your target Azure SDK minor line and have no known security advisories in your environment/tooling before merging.

Also applies to: 42-45, 81-81, 92-92, 97-97

health-monitors/csp-health-monitor/pkg/event/normalizer.go (1)

35-41: Azure normalizer wiring into factory looks correct

Adding the model.CSPAzure branch that returns &AzureNormalizer{} is consistent with the existing GCP/AWS pattern and matches how NewClient requests the Azure normalizer.

health-monitors/csp-health-monitor/pkg/config/config.go (1)

39-48: Azure config wiring and CSP validation look consistent

AzureConfig is correctly added to Config, and validateCSPConfig now enforces a minimum Azure polling interval and single-CSP enablement in line with existing GCP/AWS behavior. Just ensure the new azure.* keys are covered in operator-facing config docs so users understand how to enable Azure monitoring.

Also applies to: 66-71, 185-223

go func(node v1.Node) {
defer wg.Done()

c.processNodeMaintenanceEvents(ctx, node, eventChan)
Copy link
Contributor

@KaivalyaMDabhadkar KaivalyaMDabhadkar Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I was under the impression that with the updates list api we would not need to query for each individual node. It seems that is not supported (we are back to the same original problem of having to query the api for each VM in the cluster). I looked into some other options to query for VM maintenance events at a cluster-level in azure, and it seems Azure resource graph: https://learn.microsoft.com/en-us/azure/virtual-machines/resource-graph-availability, this is the sdk: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/resourcegraph/[email protected]#Client.Resources
Can this be used to query for the VM maintenance events? (we are doing something similar in GCP with the cloudlogging query)

@jtschelling jtschelling force-pushed the jtschelling/124-azure-maintenance-events branch from aa3ff61 to 6244c8a Compare December 1, 2025 20:37
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa3ff61 and 6244c8a.

⛔ Files ignored due to path filters (1)
  • health-monitors/csp-health-monitor/go.sum is excluded by !**/*.sum
📒 Files selected for processing (2)
  • .gitignore (2 hunks)
  • health-monitors/csp-health-monitor/go.mod (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .gitignore
🧰 Additional context used
📓 Path-based instructions (1)
**/go.mod

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • health-monitors/csp-health-monitor/go.mod
🧠 Learnings (3)
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • health-monitors/csp-health-monitor/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • health-monitors/csp-health-monitor/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • health-monitors/csp-health-monitor/go.mod

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (3)

195-221: Consider Azure Resource Graph for cluster-level queries.

The current implementation queries maintenance updates for each VM individually, which may not scale well for large clusters. A previous review suggested using Azure Resource Graph to query maintenance events at the cluster level, similar to the GCP implementation's use of cloud logging queries.

See: https://learn.microsoft.com/en-us/azure/virtual-machines/resource-graph-availability
SDK: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/resourcegraph/armresourcegraph


150-156: Use informer instead of direct API calls for listing nodes.

Direct API calls to list nodes can overload the Kubernetes API server, especially in production environments. The coding guidelines specify using informers to cache and watch resources instead.

Refer to the AWS CSP monitor implementation for the recommended informer pattern.


284-319: Complete the IMDS hardening—add timeout, check all errors, validate response.

While NewRequestWithContext was added (line 294), the remaining hardening items from the previous review are still missing:

  1. No timeout: HTTP client (line 292) can hang indefinitely outside Azure
  2. Ignored error: Line 294 discards the error from NewRequestWithContext
  3. No status validation: No check for resp.StatusCode after line 302
  4. Empty subscriptionId accepted: Line 318 returns empty string if JSON field is missing

Apply this diff to complete the hardening:

 	var PTransport = &http.Transport{Proxy: nil}
 
-	client := http.Client{Transport: PTransport}
+	client := http.Client{
+		Transport: PTransport,
+		Timeout:   5 * time.Second,
+	}
 
-	req, _ := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil)
+	req, err := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil)
+	if err != nil {
+		return "", fmt.Errorf("failed to construct IMDS request: %w", err)
+	}
 	req.Header.Add("Metadata", "True")
 
 	q := req.URL.Query()
 	q.Add("format", "json")
 	q.Add("api-version", "2021-02-01")
 	req.URL.RawQuery = q.Encode()
 
 	resp, err := client.Do(req)
 	if err != nil {
 		return "", err
 	}
 	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		return "", fmt.Errorf("unexpected IMDS status code: %d", resp.StatusCode)
+	}
 
 	// now that we have the response get the subscription ID from it
 	var result struct {
 		Compute struct {
 			SubscriptionID string `json:"subscriptionId"`
 		} `json:"compute"`
 	}
 	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
 		return "", fmt.Errorf("failed to decode IMDS response: %w", err)
 	}
+	if result.Compute.SubscriptionID == "" {
+		return "", fmt.Errorf("subscriptionId missing from IMDS response")
+	}
 
 	return result.Compute.SubscriptionID, nil
health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go (1)

38-183: Replace fake Kubernetes client with envtest.

Both test functions use fakek8s.NewSimpleClientset, which violates the established coding guidelines and a previous review comment. Tests should use envtest to provide a more realistic Kubernetes API server for controller testing.

Based on coding guidelines and learnings, refer to the AWS CSP monitor tests for the recommended pattern.

🧹 Nitpick comments (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)

123-127: Redundant context check before select.

The context cancellation is already handled in the select statement at line 132. The check at lines 123-127 is redundant and can be removed.

Apply this diff:

 	for {
-		if ctx.Err() != nil {
-			slog.Info("Azure monitoring stopping due to context cancellation.")
-			return ctx.Err()
-		}
-
 		c.pollForMaintenanceEvents(ctx, eventChan)
 
 		select {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c8415e and cbfffc7.

📒 Files selected for processing (2)
  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1 hunks)
  • health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
  • health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go
**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go
🧠 Learnings (4)
📚 Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go
🧬 Code graph analysis (2)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (2)
health-monitors/csp-health-monitor/pkg/config/config.go (2)
  • AzureConfig (67-71)
  • Config (39-48)
health-monitors/csp-health-monitor/pkg/event/normalizer.go (2)
  • Normalizer (25-30)
  • GetNormalizer (34-45)
health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go (4)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)
  • Client (42-50)
health-monitors/csp-health-monitor/pkg/config/config.go (1)
  • AzureConfig (67-71)
health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (1)
  • AzureNormalizer (26-26)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
  • CSPAzure (56-56)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)

318-353: IMDS hardening issues remain unresolved.

A previous review flagged four hardening issues in getSubscriptionID, and while the comment was marked as addressed, all four issues are still present:

  1. No timeout (line 326): HTTP client can hang indefinitely outside Azure
  2. Ignored error (line 328): NewRequestWithContext error is discarded with _
  3. No status code check (after line 336): resp.StatusCode is not validated
  4. Empty subscriptionId accepted (line 352): Returns empty string as success if JSON decodes but field is missing

Apply the previously suggested diff:

 func getSubscriptionID(cfg config.AzureConfig) (string, error) {
 	if cfg.SubscriptionID != "" {
 		return cfg.SubscriptionID, nil
 	}
 
-	// pulled from https://github.com/Microsoft/azureimds/blob/master/imdssample.go
 	var PTransport = &http.Transport{Proxy: nil}
 
-	client := http.Client{Transport: PTransport}
+	client := http.Client{
+		Transport: PTransport,
+		Timeout:   5 * time.Second,
+	}
 
-	req, _ := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil)
+	req, err := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil)
+	if err != nil {
+		return "", fmt.Errorf("failed to construct IMDS request: %w", err)
+	}
 	req.Header.Add("Metadata", "True")
 
 	q := req.URL.Query()
 	q.Add("format", "json")
 	q.Add("api-version", "2021-02-01")
 	req.URL.RawQuery = q.Encode()
 
 	resp, err := client.Do(req)
 	if err != nil {
 		return "", err
 	}
 	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		return "", fmt.Errorf("unexpected IMDS status code: %d", resp.StatusCode)
+	}
 
-	// now that we have the response get the subscription ID from it
 	var result struct {
 		Compute struct {
 			SubscriptionID string `json:"subscriptionId"`
 		} `json:"compute"`
 	}
 	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
 		return "", fmt.Errorf("failed to decode IMDS response: %w", err)
 	}
+	if result.Compute.SubscriptionID == "" {
+		return "", fmt.Errorf("subscriptionId missing from IMDS response")
+	}
 
 	return result.Compute.SubscriptionID, nil
 }

Consider also accepting a context.Context parameter (from NewClient's ctx) to respect caller cancellation.

🧹 Nitpick comments (3)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (3)

119-119: Simplify resync period expression.

0*time.Second is unnecessarily verbose.

Apply this diff:

-	factory := informers.NewSharedInformerFactory(k8sClient, 0*time.Second)
+	factory := informers.NewSharedInformerFactory(k8sClient, 0)

157-172: Remove redundant context check.

The context check at lines 158-161 is redundant since the select block at lines 165-170 already handles ctx.Done(). Based on coding guidelines, this makes the flow more elegant.

Apply this diff:

 	for {
-		if ctx.Err() != nil {
-			slog.Info("Azure monitoring stopping due to context cancellation.")
-			return ctx.Err()
-		}
-
 		c.pollForMaintenanceEvents(ctx, eventChan)
 
 		select {

260-263: Remove obvious comment.

As per coding guidelines, the comment doesn't add significant value beyond what the code clearly expresses.

Apply this diff:

 func (c *Client) shouldReportUpdate(update *armmaintenance.Update) bool {
-	// These are the two fields we need to determine if a maintenance event needs to be reported
 	if update.Status == nil || update.ImpactType == nil {
 		return false
 	}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cbfffc7 and cfebb54.

📒 Files selected for processing (2)
  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1 hunks)
  • health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
🧠 Learnings (3)
📚 Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
🧬 Code graph analysis (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (4)
health-monitors/csp-health-monitor/pkg/config/config.go (2)
  • AzureConfig (67-71)
  • Config (39-48)
health-monitors/csp-health-monitor/pkg/event/normalizer.go (2)
  • Normalizer (25-30)
  • GetNormalizer (34-45)
health-monitors/csp-health-monitor/pkg/metrics/metrics.go (3)
  • CSPPollingDuration (39-46)
  • CSPAPIErrors (47-53)
  • CSPEventsReceived (32-38)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (3)
  • CSPAzure (56-56)
  • CSP (43-43)
  • MaintenanceEvent (22-40)
🔇 Additional comments (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)

56-56: Add godoc for exported function.

As per coding guidelines, all exported functions require documentation comments.

Add godoc before the function:

+// NewClient creates and initializes a new Azure health monitoring client.
+// It resolves the subscription ID from config or IMDS, creates Azure SDK clients,
+// initializes the Kubernetes client and node informer, and returns a ready-to-use Client.
 func NewClient(

Likely an incorrect or invalid review comment.

// See the License for the specific language governing permissions and
// limitations under the License.

package azure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add package-level godoc.

As per coding guidelines, all Go packages require package-level documentation describing the package's purpose.

Add godoc before the package declaration:

+// Package azure implements Azure CSP health monitoring by polling the Azure Maintenance Updates API
+// for VM maintenance events and forwarding them to the main event pipeline.
 package azure
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
package azure
// Package azure implements Azure CSP health monitoring by polling the Azure Maintenance Updates API
// for VM maintenance events and forwarding them to the main event pipeline.
package azure
🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around line 15,
there is no package-level godoc; add a descriptive package comment immediately
above the "package azure" line that succinctly explains the package's purpose,
what it provides (e.g., Azure CSP health-check utilities/clients/helpers), and
any high-level usage notes or important package responsibilities so the package
conforms to the project's Go documentation guidelines.


slog.Info("Azure Client: Successfully initialized Kubernetes client")

nodeInformer, _ := newNodeInformer(k8sClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Handle error from newNodeInformer.

The error from newNodeInformer is silently discarded. If informer initialization fails, nodeInformer will be nil, causing a nil pointer dereference at line 184 when c.nodeInformer.Lister().List() is called.

Apply this diff:

-	nodeInformer, _ := newNodeInformer(k8sClient)
+	nodeInformer, stopCh := newNodeInformer(k8sClient)
+	if nodeInformer == nil {
+		metrics.CSPMonitorErrors.WithLabelValues(string(model.CSPAzure), "node_informer_error").Inc()
+		return nil, fmt.Errorf("failed to initialize node informer")
+	}

Note: You'll also need to store stopCh in the Client struct and close it during shutdown to properly clean up the informer.

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +124 to +132
_, err := nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {},
DeleteFunc: func(obj interface{}) {},
UpdateFunc: func(oldObj, newObj interface{}) {},
})
if err != nil {
fmt.Fprintf(os.Stderr, "failed to add event handler: %v", err)
return nil, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove no-op event handlers and use structured logging.

The event handlers are empty and serve no purpose. If handlers aren't needed, don't register them. Additionally, error handling should use slog for consistency with the rest of the codebase.

Apply this diff:

-	_, err := nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
-		AddFunc:    func(obj interface{}) {},
-		DeleteFunc: func(obj interface{}) {},
-		UpdateFunc: func(oldObj, newObj interface{}) {},
-	})
-	if err != nil {
-		fmt.Fprintf(os.Stderr, "failed to add event handler: %v", err)
+	// No event handlers needed; the informer is used only for listing/caching nodes
+	if err := nodeInformer.Informer().GetController(); err != nil {
+		slog.Error("failed to initialize node informer controller", "error", err)
 		return nil, nil
 	}

Or simply remove the handler registration entirely if not required:

-	_, err := nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
-		AddFunc:    func(obj interface{}) {},
-		DeleteFunc: func(obj interface{}) {},
-		UpdateFunc: func(oldObj, newObj interface{}) {},
-	})
-	if err != nil {
-		fmt.Fprintf(os.Stderr, "failed to add event handler: %v", err)
-		return nil, nil
-	}
-
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_, err := nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {},
DeleteFunc: func(obj interface{}) {},
UpdateFunc: func(oldObj, newObj interface{}) {},
})
if err != nil {
fmt.Fprintf(os.Stderr, "failed to add event handler: %v", err)
return nil, nil
}
// No event handlers needed; the informer is used only for listing/caching nodes
🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around lines
124-132, remove the no-op AddEventHandler registration (or omit calling
AddEventHandler entirely) since empty AddFunc/DeleteFunc/UpdateFunc serve no
purpose; if you do need handlers implement meaningful logic instead of empty
closures. Also replace the fmt.Fprintf error handling with the structured slog
logger (e.g., slog.Error with a descriptive message and the err value) to match
the codebase logging; preserve the current return behavior after logging.

Comment on lines +135 to +141
synced := factory.WaitForCacheSync(stopCh)
for v, ok := range synced {
if !ok {
fmt.Fprintf(os.Stderr, "caches failed to sync: %v", v)
return nil, nil
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use structured logging for cache sync errors.

Replace fmt.Fprintf(os.Stderr, ...) with slog.Error for consistency with the codebase's structured logging approach.

Apply this diff:

 	synced := factory.WaitForCacheSync(stopCh)
 	for v, ok := range synced {
 		if !ok {
-			fmt.Fprintf(os.Stderr, "caches failed to sync: %v", v)
+			slog.Error("node informer cache failed to sync", "informer", v)
 			return nil, nil
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
synced := factory.WaitForCacheSync(stopCh)
for v, ok := range synced {
if !ok {
fmt.Fprintf(os.Stderr, "caches failed to sync: %v", v)
return nil, nil
}
}
synced := factory.WaitForCacheSync(stopCh)
for v, ok := range synced {
if !ok {
slog.Error("node informer cache failed to sync", "informer", v)
return nil, nil
}
}
🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around lines 135 to
141, replace the fmt.Fprintf(os.Stderr, ...) call used when a cache fails to
sync with a structured slog.Error call; call slog.Error with a clear message
like "caches failed to sync" and include the failing informer key/value (v) as a
field (e.g., slog.String("informer", fmt.Sprintf("%v", v)) or
slog.Any("informer", v)) so the error is logged in structured form, and ensure
the function returns an appropriate error (not nil) after logging; also add any
missing slog import if needed.

Comment on lines +146 to +148
func (c *Client) GetName() model.CSP {
return model.CSPAzure
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add godoc for exported method.

As per coding guidelines, all exported functions require documentation comments.

Apply this diff:

+// GetName returns the CSP identifier for this client.
 func (c *Client) GetName() model.CSP {
 	return model.CSPAzure
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (c *Client) GetName() model.CSP {
return model.CSPAzure
}
// GetName returns the CSP identifier for this client.
func (c *Client) GetName() model.CSP {
return model.CSPAzure
}
🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around lines 146 to
148, the exported method GetName lacks a godoc comment; add a one-line
documentation comment immediately above the function that briefly describes what
GetName returns (e.g., "GetName returns the CSP identifier for Azure") and
mention that it returns model.CSPAzure, following Go doc comment style for
exported functions.

return model.CSPAzure
}

func (c *Client) StartMonitoring(ctx context.Context, eventChan chan<- model.MaintenanceEvent) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add godoc for exported method.

As per coding guidelines, all exported functions require documentation comments.

Apply this diff:

+// StartMonitoring begins polling Azure for VM maintenance events at the configured interval.
+// It sends normalized events to eventChan and respects context cancellation.
 func (c *Client) StartMonitoring(ctx context.Context, eventChan chan<- model.MaintenanceEvent) error {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (c *Client) StartMonitoring(ctx context.Context, eventChan chan<- model.MaintenanceEvent) error {
// StartMonitoring begins polling Azure for VM maintenance events at the configured interval.
// It sends normalized events to eventChan and respects context cancellation.
func (c *Client) StartMonitoring(ctx context.Context, eventChan chan<- model.MaintenanceEvent) error {
🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around line 150,
the exported method StartMonitoring lacks a godoc comment; add a documentation
comment immediately above the function that begins with "StartMonitoring" and
concisely describes what the method does, its context parameter and the
eventChan parameter (chan<- model.MaintenanceEvent), and what the returned error
signifies, following standard Go godoc style and punctuation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (7)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (7)

137-143: Use structured logging for cache sync errors.

Replace fmt.Fprintf with slog.Error for consistency with the codebase's structured logging.

Apply this diff:

 	synced := factory.WaitForCacheSync(stopCh)
 	for v, ok := range synced {
 		if !ok {
-			fmt.Fprintf(os.Stderr, "caches failed to sync: %v", v)
+			slog.Error("node informer cache failed to sync", "informer", v)
 			return nil, nil
 		}
 	}

As per coding guidelines, use structured logging via log/slog in Go code.


15-15: Add package-level godoc.

As per coding guidelines, all Go packages require package-level documentation.

Apply this diff:

+// Package azure implements Azure CSP health monitoring by polling the Azure Maintenance Updates API
+// for VM maintenance events and forwarding them to the main event pipeline.
 package azure

As per coding guidelines, package-level godoc is required for all Go packages.


152-152: Add godoc for exported method.

As per coding guidelines, all exported functions require documentation comments.

Apply this diff:

+// StartMonitoring begins polling Azure for VM maintenance events at the configured interval.
+// It sends normalized events to eventChan and respects context cancellation.
 func (c *Client) StartMonitoring(ctx context.Context, eventChan chan<- model.MaintenanceEvent) error {

As per coding guidelines, function comments are required for all exported Go functions.


148-150: Add godoc for exported method.

As per coding guidelines, all exported functions require documentation comments.

Apply this diff:

+// GetName returns the CSP identifier for this client.
 func (c *Client) GetName() model.CSP {
 	return model.CSPAzure
 }

As per coding guidelines, function comments are required for all exported Go functions.


320-355: Critical issues in getSubscriptionID remain unaddressed.

Despite being marked as addressed in a previous commit, four critical issues remain:

  1. No timeout: HTTP call can hang indefinitely outside Azure
  2. Ignored error: Line 330 discards error from NewRequestWithContext
  3. No status code validation: No check for resp.StatusCode after client.Do
  4. Empty subscriptionId accepted: Line 354 returns empty string as success if JSON decodes but subscriptionId is missing, causing downstream failures in UpdatesClient construction

Apply this diff:

 func getSubscriptionID(cfg config.AzureConfig) (string, error) {
 	if cfg.SubscriptionID != "" {
 		return cfg.SubscriptionID, nil
 	}
 
 	// pulled from https://github.com/Microsoft/azureimds/blob/master/imdssample.go
 	var PTransport = &http.Transport{Proxy: nil}
 
-	client := http.Client{Transport: PTransport}
+	client := http.Client{
+		Transport: PTransport,
+		Timeout:   5 * time.Second,
+	}
 
-	req, _ := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil)
+	req, err := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil)
+	if err != nil {
+		return "", fmt.Errorf("failed to construct IMDS request: %w", err)
+	}
 	req.Header.Add("Metadata", "True")
 
 	q := req.URL.Query()
 	q.Add("format", "json")
 	q.Add("api-version", "2021-02-01")
 	req.URL.RawQuery = q.Encode()
 
 	resp, err := client.Do(req)
 	if err != nil {
 		return "", err
 	}
 	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		return "", fmt.Errorf("unexpected IMDS status code: %d", resp.StatusCode)
+	}
 
 	// now that we have the response get the subscription ID from it
 	var result struct {
 		Compute struct {
 			SubscriptionID string `json:"subscriptionId"`
 		} `json:"compute"`
 	}
 	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
 		return "", fmt.Errorf("failed to decode IMDS response: %w", err)
 	}
+	if result.Compute.SubscriptionID == "" {
+		return "", fmt.Errorf("subscriptionId missing from IMDS response")
+	}
 	return result.Compute.SubscriptionID, nil
 }

Consider also accepting a context.Context parameter from the caller to respect cancellation.


101-101: Handle error from newNodeInformer and store stopCh for proper cleanup.

The error and stopCh from newNodeInformer are discarded. If informer initialization fails, nodeInformer will be nil, causing a nil pointer dereference at line 186. Additionally, the stopCh is never stored or closed, causing a goroutine leak.

Apply this diff to handle the error and store stopCh:

-	nodeInformer, _ := newNodeInformer(k8sClient)
+	nodeInformer, stopCh := newNodeInformer(k8sClient)
+	if nodeInformer == nil {
+		metrics.CSPMonitorErrors.WithLabelValues(string(model.CSPAzure), "node_informer_error").Inc()
+		return nil, fmt.Errorf("failed to initialize node informer")
+	}

Then add stopCh chan struct{} to the Client struct (line 54) and close it in a cleanup method to prevent resource leaks.


125-133: Remove no-op event handlers.

The empty event handlers serve no purpose since the informer is only used for caching/listing nodes, not for reacting to node changes.

Apply this diff:

-	_, err := nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
-		AddFunc:    func(obj interface{}) {},
-		DeleteFunc: func(obj interface{}) {},
-		UpdateFunc: func(oldObj, newObj interface{}) {},
-	})
-	if err != nil {
-		fmt.Fprintf(os.Stderr, "failed to add event handler: %v", err)
-		return nil, nil
-	}
-

As per coding guidelines, extract informer event handler setup into helper methods only when handlers are meaningful.

🧹 Nitpick comments (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)

159-174: Refactor to eliminate redundant context checks.

The context cancellation is checked twice: once at line 160 and again in the select at line 168. This is redundant.

Apply this diff:

 	ticker := time.NewTicker(time.Duration(c.config.PollingIntervalSeconds) * time.Second)
 	defer ticker.Stop()
 
 	for {
-		if ctx.Err() != nil {
-			slog.Info("Azure monitoring stopping due to context cancellation.")
-			return ctx.Err()
-		}
-
 		c.pollForMaintenanceEvents(ctx, eventChan)
 
 		select {
 		case <-ctx.Done():
 			slog.Info("Azure monitoring stopping due to context cancellation.")
 			return ctx.Err()
 		case <-ticker.C:
 		}
 	}

Based on learnings, this pattern is more idiomatic and avoids duplicate checks.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfebb54 and e2888b3.

📒 Files selected for processing (1)
  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
🧠 Learnings (9)
📚 Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
🧬 Code graph analysis (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (3)
health-monitors/csp-health-monitor/pkg/config/config.go (2)
  • AzureConfig (67-71)
  • Config (39-48)
health-monitors/csp-health-monitor/pkg/event/normalizer.go (2)
  • Normalizer (25-30)
  • GetNormalizer (34-45)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (3)
  • CSPAzure (56-56)
  • CSP (43-43)
  • MaintenanceEvent (22-40)
🔇 Additional comments (2)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (2)

260-318: LGTM!

The filtering logic in shouldReportUpdate correctly validates required fields and filters for non-completed events with actual impact. The normalizeAndSendEvent method properly handles normalization, error cases, and context cancellation.


357-383: LGTM!

The parseAzureProviderID function correctly handles the Azure provider ID format, validates the structure, and extracts the required resource group and VM name with appropriate error handling.

Comment on lines +231 to +237
pager := c.updatesClient.NewListPager(
resourceGroup,
"Microsoft.Compute",
"virtualMachines",
vmName,
nil,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Azure Resource Graph API query maintenance updates events

💡 Result:

You can use Azure Resource Graph (ARG) to query maintenance notifications, planned maintenance windows, and update/patch events. Key ARG tables: maintenanceresources, ServiceHealthResources (Microsoft.ResourceHealth/events), patchassessmentresources/patchinstallationresources, and Resources. Examples and notes below.

Sample: find maintenance updates (from notifications / microsoft.maintenance/updates)
resources
| project resource = tolower(id)
| join kind=inner (
maintenanceresources
| where type == "microsoft.maintenance/updates"
| extend p = parse_json(properties)
| mv-expand d = p.value
| where d has 'notificationId' and d.notificationId == ''
| project resource = tolower(name), status = d.status, resourceGroup, location, startTimeUtc = d.startTimeUtc, endTimeUtc = d.endTimeUtc, impactType = d.impactType
) on resource
| project resource, status, resourceGroup, location, startTimeUtc, endTimeUtc, impactType. [1]

Sample: list active Service Health events (planned maintenance, service issues, advisories)
ServiceHealthResources
| where type =~ 'Microsoft.ResourceHealth/events'
| extend eventType = properties.EventType, status = properties.Status, title = properties.Title, trackingId = properties.TrackingId, start = properties.ImpactStartTime, mitigation = properties.ImpactMitigationTime
| where (eventType in ('HealthAdvisory','SecurityAdvisory','PlannedMaintenance') and mitigation > now()) or (eventType == 'ServiceIssue' and status == 'Active'). [4]

Patch / Update Manager logs (assessment/install results) are available in patchassessmentresources and patchinstallationresources (30-day / 7-day retention details in docs). Use these tables to query per-VM patch status and installationState. [3]

Notes / troubleshooting

  • Some fields (maintenanceStartTime/maintenanceEndTime/info) show null unless event is PlannedMaintenance or you have proper permissions; event details may be populated later in the event lifecycle. [5]
  • ARG table reference and supported resource-type tables listed in docs. Use ARG Explorer, az graph query or the Resource Graph REST API to run these queries at scale across subscriptions. [2]

Useful docs

  • Maintenance updates example (ARG query). [1]
  • ARG table list / query language reference. [2]
  • Update Manager / patch tables and schema. [3]
  • Service Health sample queries. [4]
  • Troubleshooting null maintenance fields. [5]

References:
[1] Configure advance notifications for planned maintenance events (sample ARG query).
[2] Understand the query language / ARG table reference.
[3] Query Resources with Azure Update Manager (patchassessmentresources / patchinstallationresources).
[4] Azure Resource Graph sample queries for Azure Service Health.
[5] Azure Resource Graph Overview — troubleshooting maintenance fields.


🏁 Script executed:

cat -n health-monitors/csp-health-monitor/pkg/csp/azure/azure.go | sed -n '200,240p'

Repository: NVIDIA/NVSentinel

Length of output: 178


Consider using Azure Resource Graph API for cluster-level maintenance event queries.

The current implementation queries the Azure Maintenance Updates API for each VM individually (lines 231-237), which may lead to API rate limiting and performance issues in large clusters. Azure Resource Graph API supports querying maintenance events at subscription and resource group level using the maintenanceresources and ServiceHealthResources tables, enabling more efficient cluster-wide queries similar to GCP's cloudlogging approach.

Note: Some maintenance event fields may be null depending on permissions and event lifecycle timing, so implementation should handle these cases appropriately.

🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around lines
231-237 the code creates a per-VM pager via c.updatesClient.NewListPager(...)
which will hit the Maintenance Updates API for each VM and can cause rate
limiting and poor performance for large clusters; replace this per-VM approach
with an Azure Resource Graph query against the maintenanceresources and/or
ServiceHealthResources tables at the subscription or resource-group scope to
fetch maintenance events in bulk (implement a Resource Graph client call,
construct a query that filters by subscription/resourceGroup and relevant
resource types, handle pagination of results and API throttling backoff), map
Resource Graph results to the existing internal event model (safely handling
nullable fields and missing attributes), and only fall back to per-VM
Maintenance Updates API when Resource Graph returns incomplete data or
permissions prevent access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support for monitoring Azure maintenance events

2 participants