-
Notifications
You must be signed in to change notification settings - Fork 33
fix: Retry on failures writing into MongoDB #670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Suket Sharma <[email protected]>
📝 WalkthroughWalkthroughAdds configurable retry logic for MongoDB health-event writes: a maxRetries field is threaded into the DatabaseStoreConnector, retry/backoff is handled via the ring buffer's rate-limited requeueing, and tests cover retry and drop behaviors. Changes
Sequence DiagramsequenceDiagram
participant SC as Store Connector
participant MongoDB as MongoDB
participant RB as Ring Buffer
participant RL as Rate Limiter
SC->>MongoDB: InsertMany(events)
alt Insert Success
MongoDB-->>SC: Success
SC->>RB: HealthMetricEleProcessingCompleted(events)
else Insert Failure
MongoDB-->>SC: Error
SC->>SC: Compute retryCount via RB.NumRequeues(events)
alt retryCount < maxRetries
SC->>RB: AddRateLimited(events)
RB->>RL: Enqueue with rate/backoff
else retryCount >= maxRetries
SC->>RB: HealthMetricEleProcessingCompleted(events) (Forget + Done -> drop)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
platform-connectors/pkg/connectors/store/store_connector_test.go (1)
21-21: Unused sync import.The
sync.Mutexat line 318 is declared, andmu.Lock()/mu.Unlock()are called in theEventuallycallback, butcallCountis only used within that same callback and the mutex doesn't protect any shared state between the test goroutine and theFetchAndProcessHealthMetricgoroutine. The mock client'sCallsslice is accessed, which is already thread-safe via testify's internal synchronization.Consider removing the unused mutex and the
syncimport to simplify the test.♻️ Suggested simplification
import ( "context" "errors" "os" - "sync" "testing" "time"And in the test function (lines 318-319, 351-353):
- var mu sync.Mutex - callCount := 0 ... require.Eventually(t, func() bool { - mu.Lock() - defer mu.Unlock() - callCount = len(mockClient.Calls) - return callCount >= 3 + return len(mockClient.Calls) >= 3 }, 500*time.Millisecond, 10*time.Millisecond, "Should retry and eventually succeed")platform-connectors/pkg/ringbuffer/ring_buffer.go (1)
118-121: Correct workqueue retry pattern.The sequence of
AddRateLimitedfollowed byDoneis the proper pattern for workqueue retries:
AddRateLimitedschedules the item for re-processing after the backoff delayDonemarks the current processing attempt as complete, allowing the queue to process other itemsConsider adding a brief comment explaining this pattern for future maintainers.
📝 Optional: Add clarifying comment
func (rb *RingBuffer) AddRateLimited(data *protos.HealthEvents) { + // Re-queue with exponential backoff delay, then mark current attempt done rb.healthMetricQueue.AddRateLimited(data) rb.healthMetricQueue.Done(data) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
platform-connectors/pkg/connectors/store/store_connector.goplatform-connectors/pkg/connectors/store/store_connector_test.goplatform-connectors/pkg/ringbuffer/ring_buffer.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
platform-connectors/pkg/connectors/store/store_connector_test.goplatform-connectors/pkg/connectors/store/store_connector.goplatform-connectors/pkg/ringbuffer/ring_buffer.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
platform-connectors/pkg/connectors/store/store_connector_test.go
🧠 Learnings (4)
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code
Applied to files:
platform-connectors/pkg/connectors/store/store_connector_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Applied to files:
platform-connectors/pkg/connectors/store/store_connector_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
platform-connectors/pkg/connectors/store/store_connector_test.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
platform-connectors/pkg/connectors/store/store_connector_test.go
🧬 Code graph analysis (1)
platform-connectors/pkg/connectors/store/store_connector_test.go (3)
platform-connectors/pkg/ringbuffer/ring_buffer.go (2)
NewRingBuffer(55-82)WithRetryConfig(48-53)store-client/pkg/client/interfaces.go (1)
InsertManyResult(95-97)platform-connectors/pkg/connectors/store/store_connector.go (1)
DatabaseStoreConnector(42-48)
🔇 Additional comments (9)
platform-connectors/pkg/connectors/store/store_connector.go (2)
35-40: LGTM! Well-documented constant for retry cap.The constant is appropriately documented, explaining both the purpose (preventing unbounded memory growth) and the resilience benefit. The value of 3 is reasonable for transient failures.
114-136: Retry logic correctly implements bounded retries with exponential backoff.The flow is sound:
- Initial attempt fails →
NumRequeuesreturns 0 → retry- After 3 requeues,
NumRequeuesreturns 3 → dropGood inclusion of contextual information in the error log (node name, check name, event count) to aid debugging dropped events.
platform-connectors/pkg/connectors/store/store_connector_test.go (2)
307-362: Good test coverage for retry success scenario.The test correctly:
- Configures short delays for fast test execution
- Sets up mock to fail twice then succeed
- Verifies 3 total
InsertManycalls- Confirms queue is emptied after success
364-414: Good test coverage for max retry drop scenario.The test correctly verifies that:
- After 4 attempts (initial + 3 retries), the event is dropped
- The queue is emptied to prevent unbounded memory growth
- The expected number of
InsertManycalls matchesMaxRetries + 1One minor observation: consider adding an assertion or comment to make the relationship between
MaxRetries(3) and expected calls (4) more explicit for maintainability.platform-connectors/pkg/ringbuffer/ring_buffer.go (5)
28-33: Good default retry configuration with clear documentation.The timing values are well-chosen:
- 500ms base delay provides quick recovery for brief transient failures
- 3s max delay caps the backoff to avoid excessive waits
- The comment explaining the progression (500ms → 1.5s → 3s) aids understanding
41-53: Clean functional options pattern implementation.This enables configuring shorter delays in tests while maintaining production defaults, aligning well with the PR objective of making retry logic "configurable to avoid slowing unit tests."
55-82: LGTM! Constructor properly integrates configurable retry settings.The implementation correctly:
- Initializes config with production defaults
- Applies any provided options to override
- Uses the configured delays for the exponential backoff rate limiter
108-111: Correct:Forgetclears rate limiter state before marking done.Calling
ForgetbeforeDoneensures the rate limiter's failure count for this item is cleared, preventing stale retry metadata from accumulating.
123-125: LGTM! Correctly exposes requeue count for retry logic.
| // MaxRetries defines the maximum number of retry attempts for MongoDB writes | ||
| // before dropping the event. This prevents unbounded memory growth while | ||
| // providing resilience against transient failures. | ||
| MaxRetries = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we expose this as a config value so that we have the option to update it if required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah make sense to me. I moved this into the helm values and read it off the ConfigMap, which is how I think we're wiring in all the config values but lmk if that's not right
|
/ok to test ab004a6 |
I believe for the k8s one we might already be using retry one error -- https://github.com/NVIDIA/NVSentinel/blob/main/platform-connectors/pkg/connectors/kubernetes/process_node_events.go#L88 which I believe should handle the retries for us.
Yeah, it is a misnomer when we initially started off, we said we'd do a ring buffer, but we never ended up doing so since we didn't want to drop or overwrite events. If we do make a ring buffer we will likely also have to maintain a WAL like you mentioned so that we can keep the memory growth bounded and flush to disk at some cadence to not loose events. |
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
ab004a6 to
36a2b21
Compare
Signed-off-by: Suket Sharma <[email protected]>
36a2b21 to
582424c
Compare
|
/ok to test 582424c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @distros/kubernetes/nvsentinel/values.yaml:
- Line 115: Add an inline comment for the maxRetries field explaining what it
controls, its default value, acceptable range/units, and its behavior (e.g.,
whether it applies per request or overall retry attempts and how backoff is
handled); update the maxRetries line (the maxRetries key) to include a brief
one-line comment above or to the right that documents purpose, default (3), and
any limits or interaction with backoff/retry logic.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
distros/kubernetes/nvsentinel/templates/configmap.yamldistros/kubernetes/nvsentinel/values.yamlplatform-connectors/main.goplatform-connectors/pkg/connectors/store/store_connector.goplatform-connectors/pkg/connectors/store/store_connector_test.go
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
platform-connectors/main.goplatform-connectors/pkg/connectors/store/store_connector.goplatform-connectors/pkg/connectors/store/store_connector_test.go
**/values.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/values.yaml
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
platform-connectors/pkg/connectors/store/store_connector_test.go
🧠 Learnings (1)
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
platform-connectors/pkg/connectors/store/store_connector_test.go
🧬 Code graph analysis (2)
platform-connectors/main.go (2)
platform-connectors/pkg/connectors/store/store_connector.go (2)
DatabaseStoreConnector(35-42)InitializeDatabaseStoreConnector(58-80)platform-connectors/pkg/server/platform_connector_server.go (1)
InitializeAndAttachRingBufferForConnectors(70-72)
platform-connectors/pkg/connectors/store/store_connector_test.go (3)
platform-connectors/pkg/connectors/store/store_connector.go (2)
InitializeDatabaseStoreConnector(58-80)DatabaseStoreConnector(35-42)platform-connectors/pkg/ringbuffer/ring_buffer.go (2)
NewRingBuffer(55-82)WithRetryConfig(48-53)store-client/pkg/client/interfaces.go (1)
InsertManyResult(95-97)
🔇 Additional comments (9)
distros/kubernetes/nvsentinel/templates/configmap.yaml (1)
28-28: LGTM!The config entry is correctly added to the Helm template and properly references the values.yaml field.
platform-connectors/main.go (1)
136-159: LGTM!The config-based initialization properly reads and converts the maxRetries value, following the established patterns in this file for other configuration parameters.
platform-connectors/pkg/connectors/store/store_connector_test.go (3)
298-298: LGTM!The test correctly updated to include the new maxRetries parameter.
305-357: Excellent test coverage for retry-on-failure scenario.The test correctly validates that:
- Failed writes are retried with exponential backoff
- Queue eventually drains after a successful retry
- InsertMany is called the expected number of times
The test setup with mock failures followed by success effectively exercises the retry logic.
359-411: Excellent test coverage for drop-after-max-retries scenario.The test correctly validates the critical behavior that prevents unbounded memory growth:
- Events are dropped after exceeding maxRetries
- Queue drains even when all attempts fail
- InsertMany is called exactly 4 times (1 initial + 3 retries) as expected with maxRetries=3
This test ensures the system won't accumulate events indefinitely during prolonged outages.
platform-connectors/pkg/connectors/store/store_connector.go (4)
41-41: LGTM!The maxRetries field is appropriately added to the connector struct to support configurable retry behavior.
44-56: LGTM!The constructor properly accepts and stores the maxRetries parameter.
58-80: LGTM!The initialization function correctly accepts maxRetries, passes it to the constructor, and includes it in the success log for observability.
111-129: Excellent retry logic implementation with proper safeguards.The implementation correctly:
- Retries failed writes with exponential backoff via rate-limited requeue
- Drops events after exceeding maxRetries to prevent unbounded memory growth
- Provides rich structured logging for both retry and drop scenarios
- Includes relevant event details (node name, check name) in drop logs for troubleshooting
The condition
retryCount < maxRetriesmeans with maxRetries=3, events are attempted up to 4 times total (1 initial + 3 retries), which aligns with the test expectations and is a reasonable interpretation.
| mongodbStore: | ||
| enabled: false | ||
| clientCertMountPath: "/etc/ssl/mongo-client" | ||
| maxRetries: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add inline documentation for the maxRetries field.
The field lacks an inline comment explaining its purpose and behavior.
As per coding guidelines for Helm values.yaml files, all configuration values must be documented with inline comments.
📝 Suggested documentation
mongodbStore:
enabled: false
clientCertMountPath: "/etc/ssl/mongo-client"
- maxRetries: 3
+ # Maximum number of retry attempts for MongoDB write operations (default: 3)
+ # With maxRetries: 3, each event will be attempted up to 4 times total (1 initial + 3 retries)
+ # before being dropped to prevent unbounded memory growth during outages
+ maxRetries: 3Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In @distros/kubernetes/nvsentinel/values.yaml at line 115, Add an inline comment
for the maxRetries field explaining what it controls, its default value,
acceptable range/units, and its behavior (e.g., whether it applies per request
or overall retry attempts and how backoff is handled); update the maxRetries
line (the maxRetries key) to include a brief one-line comment above or to the
right that documents purpose, default (3), and any limits or interaction with
backoff/retry logic.
Summary
Going through the platform connectors, I noticed that we drop health events if there's a failure writing into MongoDB. I think there's value in retrying across a few seconds when writing to MongoDB before we bail. This ensures that if there's a primary failover happening in the underlying data layer or a brief networking interruption, we're resilient to it. Fwiw, I don't think
ring_buffer.gois an actual implementation of a ring buffer - it seems to be a thin wrapper around workqueues and effectively unbounded so you could still run into memory issues, so that's worth addressing too at a later point.I think there's value in also making the same changes for the K8s connector since that's got the same issue IMO but I'll wait for feedback in this PR before doing that too.
In the long term, I think we should do the following -
I've added in some unit tests, but wasn't able to test this against an actual cluster yet. I'm doing that concurrently but figured I could get feedback in the interim.
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.