Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,12 @@
"source": "./plugins/marketplace-ops",
"description": "Maintenance commands for Claude Code plugin marketplaces",
"version": "0.1.2"
},
{
"name": "node-team",
"source": "./plugins/node-team",
"description": "OpenShift Node team assistant for development, deployment, debugging, and workflow tasks",
"version": "0.11.0"
}
]
}
8 changes: 8 additions & 0 deletions plugins/node-team/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"name": "node-team",
"description": "OpenShift Node team assistant for development, deployment, debugging, and workflow tasks across kubelet, MCO, CRI-O, crun, conmonrs, Kueue operator, Jira, Red Hat KB/support cases, Prometheus, and platform docs.",
"version": "0.12.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check marketplace.json for node-team plugin version

# Search for node-team entry in marketplace and extract version
rg -A 5 -B 2 '"node-team"' .claude-plugin/marketplace.json

Repository: openshift-eng/ai-helpers

Length of output: 296


Synchronize version with marketplace.json.

The plugin declares version 0.12.0, but marketplace.json lists 0.11.0. This mismatch will cause inconsistent version display and registration issues. Update marketplace.json to version 0.12.0.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/node-team/.claude-plugin/plugin.json` at line 4, The plugin.json
declares "version": "0.12.0" while marketplace.json still lists "0.11.0"; update
the "version" field in marketplace.json to "0.12.0" so both files match. Locate
the "version" key in marketplace.json (the counterpart to plugin.json's version)
and change its value to "0.12.0", then run any validation or lint steps you have
to ensure the manifest stays valid.

"author": {
"name": "github.com/openshift-eng"
}
}
7 changes: 7 additions & 0 deletions plugins/node-team/config/team-roster-core.example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"description": "Node Core team roster — maps Jira display names to GitHub handles",
"members": {
"Jira Display Name": "github-handle",
"Another Person": "their-github-handle"
}
}
7 changes: 7 additions & 0 deletions plugins/node-team/config/team-roster-dra.example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"description": "Node Devices (DRA) team roster — maps Jira display names to GitHub handles",
"members": {
"Jira Display Name": "github-handle",
"Another Person": "their-github-handle"
}
}
10 changes: 10 additions & 0 deletions plugins/node-team/skills/node/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
name: node-team
description: "OpenShift Node team assistant. Covers kubelet, MCO, CRI-O, crun, conmonrs, Kueue operator, Jira (OCPNODE/OCPBUGS), Red Hat KB/support cases, Prometheus, and K8s/OCP docs. Triggers on any OpenShift node-layer development, deployment, debugging, or workflow task."
allowed-tools: Bash(curl:*)
---
Comment thread
coderabbitai[bot] marked this conversation as resolved.

## How to use this skill

1. Read [references/INDEX.md](references/INDEX.md) to route to the relevant reference
2. Read the reference, then act on it — run scripts, fetch data, present results
34 changes: 34 additions & 0 deletions plugins/node-team/skills/node/references/INDEX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Node Skill Reference Index

Reference files contain only tribal knowledge and non-obvious nuances. For discoverable details (build commands, repo layout, test targets), browse the source code directly.

Root: `./`

## Setup

|SETUP.md

## Development

|development:{kubelet-dev.md,mco-dev.md,crio-dev.md,crun-conmon.md,kueue-operator-dev.md,worktrees.md}

## Deployment

|deployment:{debug-binary.md}
|deployment/debug-binary:{crio.md,cross-compile.md,deploy.md,rollback.md,ssh-bastion.md}

## Jira

|jira.md

## Red Hat Support

|support.md

## Platform Documentation

|platform-docs.md

## Prometheus

|prometheus.md
94 changes: 94 additions & 0 deletions plugins/node-team/skills/node/references/SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Standard Repo Setup

All node team repos follow the same clone + worktree workflow for feature work.

## Clone

Clone into the current working directory (if not already present):

```bash
git clone <repo-url>
cd <repo-name>
```

## Worktree for Feature Work

Never work directly on the default branch. Create a worktree:

```bash
git worktree add .worktrees/<name> -b wt/<name>
cd .worktrees/<name>
```

Deduce `<name>` from the task description (e.g., "reflink feature" -> `reflink`, "fix cgroup leak" -> `fix-cgroup-leak`, "OCPNODE-1234" -> `ocpnode-1234`).

## Worktree for PR Work

To review or continue work on an existing PR:

```bash
git fetch origin pull/<number>/head:pr-<number>
git worktree add .worktrees/pr-<number> pr-<number>
cd .worktrees/pr-<number>
```

If resuming work on a PR you've already fetched, check `git worktree list` first — the worktree may already exist.

## Worktree for Jira Ticket Work

To investigate or fix a Jira issue:

1. Fetch the issue details to determine the component (see [jira.md](../jira.md) for auth setup):
```bash
curl -s -u "$JIRA_USER:$JIRA_API_TOKEN" "https://redhat.atlassian.net/rest/api/3/issue/OCPNODE-1234?fields=summary,components"
```
2. Map the component to a repo (see Repo URLs below), confirm with the user, and clone if needed.
3. Create a worktree named after the ticket:
```bash
git worktree add .worktrees/ocpnode-1234 -b wt/ocpnode-1234
cd .worktrees/ocpnode-1234
```

## Component to Repo Mapping

| Jira Label / Component | Repo |
|-------------------------|------|
| `crio` | cri-o |
| `kubelet` | kubernetes |
| `mco` | machine-config-operator |
| `crun` | crun |
| `conmonrs` | conmon-rs |
| `kueue` | kueue-operator |

## Enable Node Assistant in the Worktree

After creating a worktree, install the skill locally so it's available when you launch Claude there:

```bash
cd .worktrees/<name>
claude plugin install node-assistant@node-skills --scope local
```

## Repo URLs

| Component | Upstream | Downstream (OpenShift) |
|-----------|----------|------------------------|
| CRI-O | `https://github.com/cri-o/cri-o.git` | `https://github.com/openshift/cri-o.git` |
| Kubelet | `https://github.com/kubernetes/kubernetes.git` | `https://github.com/openshift/kubernetes.git` |
| MCO | — | `https://github.com/openshift/machine-config-operator.git` |
| crun | `https://github.com/containers/crun.git` | — |
| conmon-rs | `https://github.com/containers/conmon-rs.git` | — |
| Kueue Operator | `https://github.com/kubernetes-sigs/kueue.git` | `https://github.com/openshift/kueue-operator.git` |

For upstream features and bug fixes, clone upstream. For OpenShift-specific work, clone downstream.

## Cleanup

```bash
# List worktrees
git worktree list

# Remove when done
git worktree remove .worktrees/<name>
git branch -d wt/<name>
```
113 changes: 113 additions & 0 deletions plugins/node-team/skills/node/references/deployment/debug-binary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Deploying Debug Binaries to RHCOS Nodes

Deploy a custom-built binary (CRI-O, crun, kubelet, etc.) to an OpenShift worker node running RHCOS for debugging or POC testing.

## The Challenge

RHCOS (Red Hat Enterprise Linux CoreOS) has an **immutable `/usr` filesystem**. You cannot overwrite `/usr/bin/crio` or any other system binary directly. There is no package manager (`dnf`/`yum`), no compiler toolchain, and no development headers on the node.

## The Solution

**Bind-mount** your custom binary over the original. The bind mount shadows the original file without modifying the rootfs. The original binary remains intact underneath and is instantly recoverable by unmounting.

```
mount --bind /home/core/crio /usr/bin/crio
```
Comment on lines +13 to +15
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Specify languages for fenced blocks in the overview doc.

Line 13 and Line 97 are unlabeled fenced blocks. Add bash for the bind-mount command and text for the ASCII diagram to resolve MD040.

Suggested patch
-```
+```bash
 mount --bind /home/core/crio /usr/bin/crio

- +text
Local Machine RHCOS Worker Node
───────────── ─────────────────

  1. Cross-compile in Docker
    (linux/amd64, matching libs)
    @@
    oc adm uncordon
    7. Verify: --version, node Ready

Also applies to: 97-113

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 13-13: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/node-team/skills/node/references/deployment/debug-binary.md` around
lines 13 - 15, The two unlabeled fenced code blocks in debug-binary.md should be
annotated with languages to satisfy MD040: add ```bash before the bind-mount
command line containing "mount --bind /home/core/crio /usr/bin/crio" and add
```text before the ASCII diagram block that begins with "Local Machine          
RHCOS Worker Node" (the multi-line diagram spanning the steps) so the first
block is marked as bash and the diagram as text.


For cluster-wide deployment that survives reboots, use [layered images](layered-image.md) instead.

## Four Phases

### Phase 1: Build (Cross-Compile)

Cross-compile the binary for `linux/amd64` using a Docker container that matches the target OS libraries. The binary must be dynamically linked against compatible library versions (same sonames as RHCOS).

See [debug-binary/cross-compile.md](debug-binary/cross-compile.md)

### Phase 2: Access (SSH Bastion)

Reach the worker node via an SSH bastion pod. RHCOS nodes are not directly accessible from outside the cluster. You need to discover the SSH key used at cluster install time and deploy a bastion DaemonSet.

See [debug-binary/ssh-bastion.md](debug-binary/ssh-bastion.md)

### Phase 3: Deploy (Bind Mount)

Transfer the binary to the node, verify it works, cordon/drain the node, set SELinux context, bind-mount over the original, restart the service. This phase has the most gotchas around SELinux, systemd, and service dependencies.

See [debug-binary/deploy.md](debug-binary/deploy.md)

### Phase 4: Rollback (Unmount)

Unmount the bind mount, remove any config drop-ins, restart the service. The original binary is untouched underneath.

See [debug-binary/rollback.md](debug-binary/rollback.md)

## Binary-Specific References

Each binary has its own reference with build dependencies, systemd units, and deployment details:

- **CRI-O**: [debug-binary/crio.md](debug-binary/crio.md) -- build tags, library deps, kubelet restart, config drop-ins

## Safety Rules

These are non-negotiable. Skipping any of these can take a node out of the cluster.

1. **Verify SSH bastion connectivity first.** Before building or deploying anything, confirm you can reach the target worker node via the bastion. Run `uname -a` over SSH. If you cannot reach the node, nothing else matters.

2. **Always preflight-test the binary** before deploying. SCP it to `/home/core/`, run `ldd` to verify libraries resolve, and run `<binary> --version` to confirm it loads. If either fails, do not proceed.

3. **Always cordon and drain first.** Never restart a container runtime on a node with running workloads.

4. **Always test on ONE worker node.** Keep at least one healthy worker to maintain cluster capacity.

5. **Always set the SELinux context** before bind-mounting:
```bash
sudo chcon --reference=/usr/bin/<original> /home/core/<binary>
```
Without the correct context (e.g., `container_runtime_exec_t` for CRI-O), systemd will refuse to execute the binary with `Permission denied`.

6. **Know how to rollback before you deploy.** The rollback is: unmount, restart service. Read [debug-binary/rollback.md](debug-binary/rollback.md) before starting.

## Quick Reference

| Step | Command |
|------|---------|
| Check node OS | `oc get nodes -o wide` |
| Check current binary version | SSH in, `<binary> --version` |
| Cordon node | `oc adm cordon <node>` |
| Drain node | `oc adm drain <node> --ignore-daemonsets --delete-emptydir-data` |
| Uncordon node | `oc adm uncordon <node>` |
| Verify node health | `oc get node <node>` (wait for Ready) |
| Check bind mounts | `ssh core@<node> "mount \| grep /usr/bin"` |

## Deciding: Bind Mount vs Layered Image

| | Bind Mount | Layered Image |
|---|---|---|
| Scope | Single node | All nodes in a pool |
| Survives reboot | No (unless systemd drop-in) | Yes |
| Speed | Minutes | 30-60 min (MCO rollout) |
| Use case | Quick debug/test | Cluster-wide validation, customer simulation |
| Rollback | `umount` | Delete MachineConfig |

Use bind mounts for quick single-node testing. Use [layered images](layered-image.md) when you need the binary on all nodes or need it to persist across reboots.

## Workflow Diagram

```
Local Machine RHCOS Worker Node
───────────── ─────────────────
1. Cross-compile in Docker
(linux/amd64, matching libs)
2. SCP via bastion ─────────────► /home/core/<binary>
3. ldd, --version (preflight)
4. chcon (SELinux)
oc adm cordon/drain
5. mount --bind
6. systemctl restart
oc adm uncordon
7. Verify: --version, node Ready
```
Loading
Loading