Skip to content

Conversation

@jtschelling
Copy link
Contributor

@jtschelling jtschelling commented Jan 9, 2026

Summary

UAT tests need changes to reflect #490

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • Chores
    • Updated infrastructure configuration naming conventions across multiple test environments (AWS, GCP, KinD) to align service account and role references.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 9, 2026

πŸ“ Walkthrough

Walkthrough

The changes rename the janitor service account and IAM role to janitor-provider across AWS EKS and Helm configurations. A new global janitorProvider.enabled flag is introduced, the CSP section key is renamed from janitor to janitor-provider in multiple Helm values files, and the installation script is refactored to use array-based argument assembly.

Changes

Cohort / File(s) Summary
EKS Cluster Configuration
tests/uat/aws/eks-cluster-config.yaml.template
Updated service account name from janitor to janitor-provider; updated IAM role name from ${CLUSTER_NAME}-janitor to ${CLUSTER_NAME}-janitor-provider.
Helm Values Files
tests/uat/aws/nvsentinel-values.yaml, tests/uat/gcp/nvsentinel-values.yaml, tests/uat/kind/nvsentinel-values.yaml
Added global.janitorProvider.enabled: true flag; renamed top-level CSP section key from janitor to janitor-provider while preserving nested provider configurations.
Installation Script
tests/uat/install-apps.sh
Updated janitor role references to janitor-provider; refactored NVSentinel Helm invocation from inline argument assembly to array-based helm_args with CSP-specific append operations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A rabbit hops through configs with glee,
Renaming each janitor to provider we see,
With helm_args now bundled in arrays so neat,
The CSP section's refactor complete! ✨

πŸš₯ Pre-merge checks | βœ… 3
βœ… Passed checks (3 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title 'ci: fixes for uat janitor provider install' directly relates to the main change: updating UAT configuration files to reflect janitor-to-janitor-provider naming updates across AWS, GCP, and Kind clusters.
Docstring Coverage βœ… Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • πŸ“ Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/uat/install-apps.sh (2)

17-17: Fix set flags: set -euox pipefail likely breaks (wrong -o argument).
You almost certainly want set -euxo pipefail (or set -euo pipefail plus conditional set -x).

Proposed fix
-set -euox pipefail
+set -euxo pipefail

63-85: Be careful logging + xtrace: GCP_SERVICE_ACCOUNT / cloud identifiers can leak into CI logs.
At minimum, consider redacting these fields in the β€œUsing configuration” log (and/or only enabling -x under a debug flag).

🧹 Nitpick comments (3)
tests/uat/aws/nvsentinel-values.yaml (1)

32-36: Same concern as other UAT values: ensure janitorProvider toggle behavior is intended alongside janitor.
If global.janitor.enabled and global.janitorProvider.enabled are meant to be independent, consider adding a short comment in the values file to prevent future toggling mistakes.

tests/uat/kind/nvsentinel-values.yaml (1)

32-36: LGTM for enabling provider in kind UAT, but confirm this is required for all kind runs.
If kind is used for local dev too, enabling provider unconditionally may add overhead/flakes; consider gating if that becomes an issue.

tests/uat/install-apps.sh (1)

278-299: --set janitor-provider... is fine, but verify chart supports hyphenated root key + required inputs.
Also consider failing fast when CSP=gcp and any of GCP_PROJECT_ID/GCP_ZONE/GCP_SERVICE_ACCOUNT are empty, to avoid silent misconfig.

Possible fail-fast check (optional)
     elif [[ "$CSP" == "gcp" ]]; then
+        : "${GCP_PROJECT_ID:?GCP_PROJECT_ID must be set when CSP=gcp}"
+        : "${GCP_ZONE:?GCP_ZONE must be set when CSP=gcp}"
+        : "${GCP_SERVICE_ACCOUNT:?GCP_SERVICE_ACCOUNT must be set when CSP=gcp}"
         extra_set_args+=(
             "--set" "janitor-provider.csp.gcp.project=$GCP_PROJECT_ID"
             "--set" "janitor-provider.csp.gcp.zone=$GCP_ZONE"
             "--set" "janitor-provider.csp.gcp.serviceAccount=$GCP_SERVICE_ACCOUNT"
         )
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 47f275b and 7850d3a.

πŸ“’ Files selected for processing (5)
  • tests/uat/aws/eks-cluster-config.yaml.template
  • tests/uat/aws/nvsentinel-values.yaml
  • tests/uat/gcp/nvsentinel-values.yaml
  • tests/uat/install-apps.sh
  • tests/uat/kind/nvsentinel-values.yaml
🧰 Additional context used
🧠 Learnings (4)
πŸ““ Common learnings
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml:20-28
Timestamp: 2026-01-09T18:55:38.501Z
Learning: The janitor-provider gRPC service only requires get/list/watch permissions on nodes in its ClusterRole. It reads node metadata and delegates to CSP APIs. The janitor controller (separate component) performs actual Kubernetes node modifications including deletion and has its own RBAC configuration with appropriate write permissions.
πŸ“š Learning: 2026-01-09T18:55:38.501Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml:20-28
Timestamp: 2026-01-09T18:55:38.501Z
Learning: The janitor-provider gRPC service only requires get/list/watch permissions on nodes in its ClusterRole. It reads node metadata and delegates to CSP APIs. The janitor controller (separate component) performs actual Kubernetes node modifications including deletion and has its own RBAC configuration with appropriate write permissions.

Applied to files:

  • tests/uat/aws/eks-cluster-config.yaml.template
  • tests/uat/aws/nvsentinel-values.yaml
  • tests/uat/kind/nvsentinel-values.yaml
  • tests/uat/install-apps.sh
  • tests/uat/gcp/nvsentinel-values.yaml
πŸ“š Learning: 2025-12-12T07:38:37.023Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:37.023Z
Learning: In NVSentinel, XID 74 errors always include an NVLINK entry in healthevent.entitiesimpacted, so null-checking with $ifNull is unnecessary when filtering for NVLINK entities in XID 74-specific rules. Apply this rule to YAML test fixtures under tests/ data (e.g., tests/data/health-events-analyzer-config.yaml) and any similar health-event configuration tests. If applying in code, ensure downstream filters rely on the presence of NVLINK in entitiesimpacted for XID 74 only, but continue to guard other fields and XIDs with appropriate null checks.

Applied to files:

  • tests/uat/aws/nvsentinel-values.yaml
  • tests/uat/kind/nvsentinel-values.yaml
  • tests/uat/gcp/nvsentinel-values.yaml
πŸ“š Learning: 2025-12-12T07:41:27.339Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:27.339Z
Learning: In tests/data/health-events-analyzer-config.yaml, the XID74Reg2Bit13Set rule intentionally omits the time window filter; tests should verify only the register bit pattern (bit 13 in REG2) on the incoming XID 74 event and should not rely on historical events or counts of repeats. If adding similar rules elsewhere, apply the same pattern and document that the time window filter is unnecessary for single-event bit checks.

Applied to files:

  • tests/uat/aws/nvsentinel-values.yaml
  • tests/uat/kind/nvsentinel-values.yaml
  • tests/uat/gcp/nvsentinel-values.yaml
πŸ”‡ Additional comments (6)
tests/uat/gcp/nvsentinel-values.yaml (2)

32-36: Confirm global.janitorProvider.enabled is actually consumed (and clarify interaction with global.janitor.enabled).
Right now both global.janitor.enabled and global.janitorProvider.enabled are true; if these control different components (controller vs provider) that’s fine, but if not, this can double-deploy or conflict.


85-88: Hyphenated values key (janitor-provider) requires Helm templates to use index access.
Make sure the chart templates are updated to reference this block via index .Values "janitor-provider" (Go templates can’t dot-access keys with -).

tests/uat/aws/nvsentinel-values.yaml (1)

79-86: Verify AWS CSP values path matches script (janitor-provider.csp.aws.*) and chart expectations.
This relies on consistent naming across: Helm templates, the values fixture, and install-apps.sh --set janitor-provider.csp.aws.*.

tests/uat/kind/nvsentinel-values.yaml (1)

88-91: Same Helm keying constraint: janitor-provider must be read via index in templates.

tests/uat/aws/eks-cluster-config.yaml.template (1)

30-33: Ensure Helm deploy uses the same ServiceAccount name as the IRSA mapping (janitor-provider).
The eksctl IRSA mapping is now pinned to metadata.name: janitor-provider; if the chart creates/uses a different SA name, pods won’t get the role.

Also applies to: 52-54

tests/uat/install-apps.sh (1)

255-332: [Rewritten review comment]
[Exactly ONE classification tag]

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

πŸ›‘οΈ CodeQL Analysis

🚨 Found 1 security alert(s)

πŸ”— View details

@lalitadithya lalitadithya merged commit 6af540e into NVIDIA:main Jan 10, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants