-
Notifications
You must be signed in to change notification settings - Fork 267
WIP: Add node-team plugin #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| { | ||
| "name": "node-team", | ||
| "description": "OpenShift Node team assistant for development, deployment, debugging, and workflow tasks across kubelet, MCO, CRI-O, crun, conmonrs, Kueue operator, Jira, Red Hat KB/support cases, Prometheus, and platform docs.", | ||
| "version": "0.12.0", | ||
| "author": { | ||
| "name": "github.com/openshift-eng" | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| { | ||
| "description": "Node Core team roster — maps Jira display names to GitHub handles", | ||
| "members": { | ||
| "Jira Display Name": "github-handle", | ||
| "Another Person": "their-github-handle" | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| { | ||
| "description": "Node Devices (DRA) team roster — maps Jira display names to GitHub handles", | ||
| "members": { | ||
| "Jira Display Name": "github-handle", | ||
| "Another Person": "their-github-handle" | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| --- | ||
| name: node-team | ||
| description: "OpenShift Node team assistant. Covers kubelet, MCO, CRI-O, crun, conmonrs, Kueue operator, Jira (OCPNODE/OCPBUGS), Red Hat KB/support cases, Prometheus, and K8s/OCP docs. Triggers on any OpenShift node-layer development, deployment, debugging, or workflow task." | ||
| allowed-tools: Bash(curl:*) | ||
| --- | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| ## How to use this skill | ||
|
|
||
| 1. Read [references/INDEX.md](references/INDEX.md) to route to the relevant reference | ||
| 2. Read the reference, then act on it — run scripts, fetch data, present results | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| # Node Skill Reference Index | ||
|
|
||
| Reference files contain only tribal knowledge and non-obvious nuances. For discoverable details (build commands, repo layout, test targets), browse the source code directly. | ||
|
|
||
| Root: `./` | ||
|
|
||
| ## Setup | ||
|
|
||
| |SETUP.md | ||
|
|
||
| ## Development | ||
|
|
||
| |development:{kubelet-dev.md,mco-dev.md,crio-dev.md,crun-conmon.md,kueue-operator-dev.md,worktrees.md} | ||
|
|
||
| ## Deployment | ||
|
|
||
| |deployment:{debug-binary.md} | ||
| |deployment/debug-binary:{crio.md,cross-compile.md,deploy.md,rollback.md,ssh-bastion.md} | ||
|
|
||
| ## Jira | ||
|
|
||
| |jira.md | ||
|
|
||
| ## Red Hat Support | ||
|
|
||
| |support.md | ||
|
|
||
| ## Platform Documentation | ||
|
|
||
| |platform-docs.md | ||
|
|
||
| ## Prometheus | ||
|
|
||
| |prometheus.md |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| # Standard Repo Setup | ||
|
|
||
| All node team repos follow the same clone + worktree workflow for feature work. | ||
|
|
||
| ## Clone | ||
|
|
||
| Clone into the current working directory (if not already present): | ||
|
|
||
| ```bash | ||
| git clone <repo-url> | ||
| cd <repo-name> | ||
| ``` | ||
|
|
||
| ## Worktree for Feature Work | ||
|
|
||
| Never work directly on the default branch. Create a worktree: | ||
|
|
||
| ```bash | ||
| git worktree add .worktrees/<name> -b wt/<name> | ||
| cd .worktrees/<name> | ||
| ``` | ||
|
|
||
| Deduce `<name>` from the task description (e.g., "reflink feature" -> `reflink`, "fix cgroup leak" -> `fix-cgroup-leak`, "OCPNODE-1234" -> `ocpnode-1234`). | ||
|
|
||
| ## Worktree for PR Work | ||
|
|
||
| To review or continue work on an existing PR: | ||
|
|
||
| ```bash | ||
| git fetch origin pull/<number>/head:pr-<number> | ||
| git worktree add .worktrees/pr-<number> pr-<number> | ||
| cd .worktrees/pr-<number> | ||
| ``` | ||
|
|
||
| If resuming work on a PR you've already fetched, check `git worktree list` first — the worktree may already exist. | ||
|
|
||
| ## Worktree for Jira Ticket Work | ||
|
|
||
| To investigate or fix a Jira issue: | ||
|
|
||
| 1. Fetch the issue details to determine the component (see [jira.md](../jira.md) for auth setup): | ||
| ```bash | ||
| curl -s -u "$JIRA_USER:$JIRA_API_TOKEN" "https://redhat.atlassian.net/rest/api/3/issue/OCPNODE-1234?fields=summary,components" | ||
| ``` | ||
| 2. Map the component to a repo (see Repo URLs below), confirm with the user, and clone if needed. | ||
| 3. Create a worktree named after the ticket: | ||
| ```bash | ||
| git worktree add .worktrees/ocpnode-1234 -b wt/ocpnode-1234 | ||
| cd .worktrees/ocpnode-1234 | ||
| ``` | ||
|
|
||
| ## Component to Repo Mapping | ||
|
|
||
| | Jira Label / Component | Repo | | ||
| |-------------------------|------| | ||
| | `crio` | cri-o | | ||
| | `kubelet` | kubernetes | | ||
| | `mco` | machine-config-operator | | ||
| | `crun` | crun | | ||
| | `conmonrs` | conmon-rs | | ||
| | `kueue` | kueue-operator | | ||
|
|
||
| ## Enable Node Assistant in the Worktree | ||
|
|
||
| After creating a worktree, install the skill locally so it's available when you launch Claude there: | ||
|
|
||
| ```bash | ||
| cd .worktrees/<name> | ||
| claude plugin install node-assistant@node-skills --scope local | ||
| ``` | ||
|
|
||
| ## Repo URLs | ||
|
|
||
| | Component | Upstream | Downstream (OpenShift) | | ||
| |-----------|----------|------------------------| | ||
| | CRI-O | `https://github.com/cri-o/cri-o.git` | `https://github.com/openshift/cri-o.git` | | ||
| | Kubelet | `https://github.com/kubernetes/kubernetes.git` | `https://github.com/openshift/kubernetes.git` | | ||
| | MCO | — | `https://github.com/openshift/machine-config-operator.git` | | ||
| | crun | `https://github.com/containers/crun.git` | — | | ||
| | conmon-rs | `https://github.com/containers/conmon-rs.git` | — | | ||
| | Kueue Operator | `https://github.com/kubernetes-sigs/kueue.git` | `https://github.com/openshift/kueue-operator.git` | | ||
|
|
||
| For upstream features and bug fixes, clone upstream. For OpenShift-specific work, clone downstream. | ||
|
|
||
| ## Cleanup | ||
|
|
||
| ```bash | ||
| # List worktrees | ||
| git worktree list | ||
|
|
||
| # Remove when done | ||
| git worktree remove .worktrees/<name> | ||
| git branch -d wt/<name> | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| # Deploying Debug Binaries to RHCOS Nodes | ||
|
|
||
| Deploy a custom-built binary (CRI-O, crun, kubelet, etc.) to an OpenShift worker node running RHCOS for debugging or POC testing. | ||
|
|
||
| ## The Challenge | ||
|
|
||
| RHCOS (Red Hat Enterprise Linux CoreOS) has an **immutable `/usr` filesystem**. You cannot overwrite `/usr/bin/crio` or any other system binary directly. There is no package manager (`dnf`/`yum`), no compiler toolchain, and no development headers on the node. | ||
|
|
||
| ## The Solution | ||
|
|
||
| **Bind-mount** your custom binary over the original. The bind mount shadows the original file without modifying the rootfs. The original binary remains intact underneath and is instantly recoverable by unmounting. | ||
|
|
||
| ``` | ||
| mount --bind /home/core/crio /usr/bin/crio | ||
| ``` | ||
|
Comment on lines
+13
to
+15
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Specify languages for fenced blocks in the overview doc. Line 13 and Line 97 are unlabeled fenced blocks. Add Suggested patch-```
+```bash
mount --bind /home/core/crio /usr/bin/crio-
Also applies to: 97-113 🧰 Tools🪛 markdownlint-cli2 (0.22.1)[warning] 13-13: Fenced code blocks should have a language specified (MD040, fenced-code-language) 🤖 Prompt for AI Agents |
||
|
|
||
| For cluster-wide deployment that survives reboots, use [layered images](layered-image.md) instead. | ||
|
|
||
| ## Four Phases | ||
|
|
||
| ### Phase 1: Build (Cross-Compile) | ||
|
|
||
| Cross-compile the binary for `linux/amd64` using a Docker container that matches the target OS libraries. The binary must be dynamically linked against compatible library versions (same sonames as RHCOS). | ||
|
|
||
| See [debug-binary/cross-compile.md](debug-binary/cross-compile.md) | ||
|
|
||
| ### Phase 2: Access (SSH Bastion) | ||
|
|
||
| Reach the worker node via an SSH bastion pod. RHCOS nodes are not directly accessible from outside the cluster. You need to discover the SSH key used at cluster install time and deploy a bastion DaemonSet. | ||
|
|
||
| See [debug-binary/ssh-bastion.md](debug-binary/ssh-bastion.md) | ||
|
|
||
| ### Phase 3: Deploy (Bind Mount) | ||
|
|
||
| Transfer the binary to the node, verify it works, cordon/drain the node, set SELinux context, bind-mount over the original, restart the service. This phase has the most gotchas around SELinux, systemd, and service dependencies. | ||
|
|
||
| See [debug-binary/deploy.md](debug-binary/deploy.md) | ||
|
|
||
| ### Phase 4: Rollback (Unmount) | ||
|
|
||
| Unmount the bind mount, remove any config drop-ins, restart the service. The original binary is untouched underneath. | ||
|
|
||
| See [debug-binary/rollback.md](debug-binary/rollback.md) | ||
|
|
||
| ## Binary-Specific References | ||
|
|
||
| Each binary has its own reference with build dependencies, systemd units, and deployment details: | ||
|
|
||
| - **CRI-O**: [debug-binary/crio.md](debug-binary/crio.md) -- build tags, library deps, kubelet restart, config drop-ins | ||
|
|
||
| ## Safety Rules | ||
|
|
||
| These are non-negotiable. Skipping any of these can take a node out of the cluster. | ||
|
|
||
| 1. **Verify SSH bastion connectivity first.** Before building or deploying anything, confirm you can reach the target worker node via the bastion. Run `uname -a` over SSH. If you cannot reach the node, nothing else matters. | ||
|
|
||
| 2. **Always preflight-test the binary** before deploying. SCP it to `/home/core/`, run `ldd` to verify libraries resolve, and run `<binary> --version` to confirm it loads. If either fails, do not proceed. | ||
|
|
||
| 3. **Always cordon and drain first.** Never restart a container runtime on a node with running workloads. | ||
|
|
||
| 4. **Always test on ONE worker node.** Keep at least one healthy worker to maintain cluster capacity. | ||
|
|
||
| 5. **Always set the SELinux context** before bind-mounting: | ||
| ```bash | ||
| sudo chcon --reference=/usr/bin/<original> /home/core/<binary> | ||
| ``` | ||
| Without the correct context (e.g., `container_runtime_exec_t` for CRI-O), systemd will refuse to execute the binary with `Permission denied`. | ||
|
|
||
| 6. **Know how to rollback before you deploy.** The rollback is: unmount, restart service. Read [debug-binary/rollback.md](debug-binary/rollback.md) before starting. | ||
|
|
||
| ## Quick Reference | ||
|
|
||
| | Step | Command | | ||
| |------|---------| | ||
| | Check node OS | `oc get nodes -o wide` | | ||
| | Check current binary version | SSH in, `<binary> --version` | | ||
| | Cordon node | `oc adm cordon <node>` | | ||
| | Drain node | `oc adm drain <node> --ignore-daemonsets --delete-emptydir-data` | | ||
| | Uncordon node | `oc adm uncordon <node>` | | ||
| | Verify node health | `oc get node <node>` (wait for Ready) | | ||
| | Check bind mounts | `ssh core@<node> "mount \| grep /usr/bin"` | | ||
|
|
||
| ## Deciding: Bind Mount vs Layered Image | ||
|
|
||
| | | Bind Mount | Layered Image | | ||
| |---|---|---| | ||
| | Scope | Single node | All nodes in a pool | | ||
| | Survives reboot | No (unless systemd drop-in) | Yes | | ||
| | Speed | Minutes | 30-60 min (MCO rollout) | | ||
| | Use case | Quick debug/test | Cluster-wide validation, customer simulation | | ||
| | Rollback | `umount` | Delete MachineConfig | | ||
|
|
||
| Use bind mounts for quick single-node testing. Use [layered images](layered-image.md) when you need the binary on all nodes or need it to persist across reboots. | ||
|
|
||
| ## Workflow Diagram | ||
|
|
||
| ``` | ||
| Local Machine RHCOS Worker Node | ||
| ───────────── ───────────────── | ||
| 1. Cross-compile in Docker | ||
| (linux/amd64, matching libs) | ||
| │ | ||
| 2. SCP via bastion ─────────────► /home/core/<binary> | ||
| 3. ldd, --version (preflight) | ||
| 4. chcon (SELinux) | ||
| │ | ||
| oc adm cordon/drain | ||
| 5. mount --bind | ||
| 6. systemctl restart | ||
| │ | ||
| oc adm uncordon | ||
| 7. Verify: --version, node Ready | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: openshift-eng/ai-helpers
Length of output: 296
Synchronize version with marketplace.json.
The plugin declares version
0.12.0, but marketplace.json lists0.11.0. This mismatch will cause inconsistent version display and registration issues. Update marketplace.json to version0.12.0.🤖 Prompt for AI Agents