-
Notifications
You must be signed in to change notification settings - Fork 33
ci: fixes for uat janitor provider install #671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: fixes for uat janitor provider install #671
Conversation
π WalkthroughWalkthroughThe changes rename the janitor service account and IAM role to janitor-provider across AWS EKS and Helm configurations. A new global janitorProvider.enabled flag is introduced, the CSP section key is renamed from janitor to janitor-provider in multiple Helm values files, and the installation script is refactored to use array-based argument assembly. Changes
Estimated code review effortπ― 2 (Simple) | β±οΈ ~10 minutes Poem
π₯ Pre-merge checks | β 3β Passed checks (3 passed)
βοΈ Tip: You can configure your own custom pre-merge checks in the settings. β¨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (2)
tests/uat/install-apps.sh (2)
17-17: Fixsetflags:set -euox pipefaillikely breaks (wrong-oargument).
You almost certainly wantset -euxo pipefail(orset -euo pipefailplus conditionalset -x).Proposed fix
-set -euox pipefail +set -euxo pipefail
63-85: Be careful logging + xtrace:GCP_SERVICE_ACCOUNT/ cloud identifiers can leak into CI logs.
At minimum, consider redacting these fields in the βUsing configurationβ log (and/or only enabling-xunder a debug flag).
π§Ή Nitpick comments (3)
tests/uat/aws/nvsentinel-values.yaml (1)
32-36: Same concern as other UAT values: ensurejanitorProvidertoggle behavior is intended alongsidejanitor.
Ifglobal.janitor.enabledandglobal.janitorProvider.enabledare meant to be independent, consider adding a short comment in the values file to prevent future toggling mistakes.tests/uat/kind/nvsentinel-values.yaml (1)
32-36: LGTM for enabling provider in kind UAT, but confirm this is required for all kind runs.
If kind is used for local dev too, enabling provider unconditionally may add overhead/flakes; consider gating if that becomes an issue.tests/uat/install-apps.sh (1)
278-299:--set janitor-provider...is fine, but verify chart supports hyphenated root key + required inputs.
Also consider failing fast whenCSP=gcpand any ofGCP_PROJECT_ID/GCP_ZONE/GCP_SERVICE_ACCOUNTare empty, to avoid silent misconfig.Possible fail-fast check (optional)
elif [[ "$CSP" == "gcp" ]]; then + : "${GCP_PROJECT_ID:?GCP_PROJECT_ID must be set when CSP=gcp}" + : "${GCP_ZONE:?GCP_ZONE must be set when CSP=gcp}" + : "${GCP_SERVICE_ACCOUNT:?GCP_SERVICE_ACCOUNT must be set when CSP=gcp}" extra_set_args+=( "--set" "janitor-provider.csp.gcp.project=$GCP_PROJECT_ID" "--set" "janitor-provider.csp.gcp.zone=$GCP_ZONE" "--set" "janitor-provider.csp.gcp.serviceAccount=$GCP_SERVICE_ACCOUNT" )
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (5)
tests/uat/aws/eks-cluster-config.yaml.templatetests/uat/aws/nvsentinel-values.yamltests/uat/gcp/nvsentinel-values.yamltests/uat/install-apps.shtests/uat/kind/nvsentinel-values.yaml
π§° Additional context used
π§ Learnings (4)
π Common learnings
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml:20-28
Timestamp: 2026-01-09T18:55:38.501Z
Learning: The janitor-provider gRPC service only requires get/list/watch permissions on nodes in its ClusterRole. It reads node metadata and delegates to CSP APIs. The janitor controller (separate component) performs actual Kubernetes node modifications including deletion and has its own RBAC configuration with appropriate write permissions.
π Learning: 2026-01-09T18:55:38.501Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml:20-28
Timestamp: 2026-01-09T18:55:38.501Z
Learning: The janitor-provider gRPC service only requires get/list/watch permissions on nodes in its ClusterRole. It reads node metadata and delegates to CSP APIs. The janitor controller (separate component) performs actual Kubernetes node modifications including deletion and has its own RBAC configuration with appropriate write permissions.
Applied to files:
tests/uat/aws/eks-cluster-config.yaml.templatetests/uat/aws/nvsentinel-values.yamltests/uat/kind/nvsentinel-values.yamltests/uat/install-apps.shtests/uat/gcp/nvsentinel-values.yaml
π Learning: 2025-12-12T07:38:37.023Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:37.023Z
Learning: In NVSentinel, XID 74 errors always include an NVLINK entry in healthevent.entitiesimpacted, so null-checking with $ifNull is unnecessary when filtering for NVLINK entities in XID 74-specific rules. Apply this rule to YAML test fixtures under tests/ data (e.g., tests/data/health-events-analyzer-config.yaml) and any similar health-event configuration tests. If applying in code, ensure downstream filters rely on the presence of NVLINK in entitiesimpacted for XID 74 only, but continue to guard other fields and XIDs with appropriate null checks.
Applied to files:
tests/uat/aws/nvsentinel-values.yamltests/uat/kind/nvsentinel-values.yamltests/uat/gcp/nvsentinel-values.yaml
π Learning: 2025-12-12T07:41:27.339Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:27.339Z
Learning: In tests/data/health-events-analyzer-config.yaml, the XID74Reg2Bit13Set rule intentionally omits the time window filter; tests should verify only the register bit pattern (bit 13 in REG2) on the incoming XID 74 event and should not rely on historical events or counts of repeats. If adding similar rules elsewhere, apply the same pattern and document that the time window filter is unnecessary for single-event bit checks.
Applied to files:
tests/uat/aws/nvsentinel-values.yamltests/uat/kind/nvsentinel-values.yamltests/uat/gcp/nvsentinel-values.yaml
π Additional comments (6)
tests/uat/gcp/nvsentinel-values.yaml (2)
32-36: Confirmglobal.janitorProvider.enabledis actually consumed (and clarify interaction withglobal.janitor.enabled).
Right now bothglobal.janitor.enabledandglobal.janitorProvider.enabledaretrue; if these control different components (controller vs provider) thatβs fine, but if not, this can double-deploy or conflict.
85-88: Hyphenated values key (janitor-provider) requires Helm templates to useindexaccess.
Make sure the chart templates are updated to reference this block viaindex .Values "janitor-provider"(Go templates canβt dot-access keys with-).tests/uat/aws/nvsentinel-values.yaml (1)
79-86: Verify AWS CSP values path matches script (janitor-provider.csp.aws.*) and chart expectations.
This relies on consistent naming across: Helm templates, the values fixture, andinstall-apps.sh--set janitor-provider.csp.aws.*.tests/uat/kind/nvsentinel-values.yaml (1)
88-91: Same Helm keying constraint:janitor-providermust be read viaindexin templates.tests/uat/aws/eks-cluster-config.yaml.template (1)
30-33: Ensure Helm deploy uses the same ServiceAccount name as the IRSA mapping (janitor-provider).
The eksctl IRSA mapping is now pinned tometadata.name: janitor-provider; if the chart creates/uses a different SA name, pods wonβt get the role.Also applies to: 52-54
tests/uat/install-apps.sh (1)
255-332: [Rewritten review comment]
[Exactly ONE classification tag]
π‘οΈ CodeQL Analysisπ¨ Found 1 security alert(s) π View details |
Summary
UAT tests need changes to reflect #490
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
βοΈ Tip: You can customize this high-level summary in your review settings.