Utilities for OpenShift cluster operations including package management and cluster validation.
This directory contains multiple helper tools for various OpenShift cluster operations:
- Resource Agent Patching: Scripts and playbooks (recommended usage) for installing RPM packages on cluster nodes using rpm-ostree's override functionality
- Fencing Validation: Tools for validating two-node cluster fencing configuration and health
- Custom OCP 5.x payload (optional):
resource-agents-build/custom-payload.shto build a custom RHCOS layer from the resource-agents RPM and publish a custom release image withoc adm release new
- Ansible playbooks. See below for specific prerequisites for each
ocCLI tool (logged into OpenShift cluster)jqfor JSON processing- SSH access to cluster nodes
ocCLI tool (logged into OpenShift cluster)- For SSH transport: passwordless sudo access to cluster nodes
- Two-node cluster with fencing topology
Automates etcd cluster recovery by configuring CIB (Cluster Information Base) attributes to force a new etcd cluster formation. This is useful when etcd quorum is lost and manual intervention is required to restore cluster functionality.
Features:
- Automated etcd snapshot creation before recovery operations
- CIB attribute management for force-new-cluster operations
- Leader/follower node detection and verification
- Etcd member list management
- Automatic cleanup and resource recovery
- STONITH management during operations
Usage:
# From helpers/ directory
ansible-playbook -i ../deploy/openshift-clusters/inventory.ini force-new-cluster.ymlPrerequisites:
- Inventory file with exactly 2 nodes in
cluster_vmsgroup - SSH access to cluster VMs with sudo privileges
- Running Pacemaker cluster with etcd resources
What it does:
- Validates cluster has exactly 2 nodes
- Disables STONITH temporarily for safety
- Takes etcd snapshots on both nodes (if etcd is not running)
- Clears existing CIB attributes (learner_node, standalone_node, force_new_cluster)
- Sets force_new_cluster attribute on the leader node (first node in cluster_vms)
- Verifies CIB attributes on both nodes
- Removes follower from etcd member list
- Performs pcs resource cleanup on both nodes
- Re-enables STONITH after completion
Attribution: Original shell script by Carlo Lobrano
Collects etcd related logs from cluster VMs
Usage:
From deploy/ directory (recommended):
make get-tnf-logsUsing Ansible directly:
# From helpers/ directory
ansible-playbook -i ../deploy/openshift-clusters/inventory.ini collect-tnf-logs.ymlPrerequisites:
- Inventory file with
cluster_vmsgroup - SSH access to cluster VMs via ProxyJump
ocCLI tool on cluster nodes
Validates fencing configuration and health for two-node OpenShift clusters with STONITH-enabled Pacemaker.
Features:
- Non-disruptive validation (default): Checks STONITH presence/enabled status, node health, etcd quorum, and daemon status
- Disruptive testing: Performs actual fencing of both nodes to verify recovery (optional with
--disruptive) - Multiple transport methods: Auto-detection, SSH, or oc debug
- IPv4/IPv6 support with automatic node discovery
Usage:
From outside the hypervisor (uses oc debug transport by default):
# Non-disruptive validation (recommended)
./fencing_validator.sh
# With custom hosts
./fencing_validator.sh --hosts "10.0.0.10,10.0.0.11"From inside the hypervisor via ansible (requires hypervisor deployed via make deploy):
# Copy script to hypervisor and execute remotely
ansible all -i deploy/openshift-clusters/inventory.ini -m copy -a "src=helpers/fencing_validator.sh dest=~/fencing_validator.sh mode=0755"
ansible all -i deploy/openshift-clusters/inventory.ini -m shell -a "./fencing_validator.sh"Disruptive testing options:
# Disruptive testing (NOTE: Not yet supported - under development)
./fencing_validator.sh --disruptive
# Dry run to see what would be tested
./fencing_validator.sh --disruptive --dry-runNote: Disruptive testing functionality is not yet fully supported and should not be used in production environments.
The build-and-patch-resource-agents.yml playbook automates the entire workflow:
- Builds the resource-agents RPM on the hypervisor
- Copies the RPM back to your laptop
- Automatically calls
apply-rpm-patch.ymlto patch cluster nodes
# From the deploy/ directory
# Simplest, no customization. Uses resource-agents repo, main branch, auto sets next version
make patch-nodes# From the helpers/ directory
# Use defaults (ClusterLabs repo, main branch, version 4.11)
ansible-playbook -i ../deploy/openshift-clusters/inventory.ini \
build-and-patch-resource-agents.yml
# Specify custom version
ansible-playbook -i ../deploy/openshift-clusters/inventory.ini \
build-and-patch-resource-agents.yml \
-e rpm_version=4.12
# Use custom repository and branch
ansible-playbook -i ../deploy/openshift-clusters/inventory.ini \
build-and-patch-resource-agents.yml \
-e repo_url=https://github.com/myorg/resource-agents \
-e rpm_branch=my-feature-branch \
-e rpm_version=5.0Prerequisites:
- Inventory file at
../deploy/openshift-clusters/inventory.iniwith bothmetal_machineandcluster_vmsgroups - SSH access to hypervisor (metal_machine)
- ProxyJump SSH configuration for cluster VMs (automatically configured by setup.yml)
What it does:
- Validates inventory contains both
[metal_machine]and[cluster_vms]groups - Installs build dependencies on hypervisor
- Clones resource-agents repository on hypervisor
- Builds RPM using
make rpm VERSION=<version> - Fetches RPM back to helpers/ directory
- Automatically patches cluster_vms group with the new RPM
- Reboots cluster nodes one at a time with etcd health verification
Variables:
repo_url: Git repository URL (default:https://github.com/ClusterLabs/resource-agents)rpm_branch: Git branch to checkout (default:main)rpm_version: Version string for the RPM (default:4.11)
If the RPM to be installed is already available to you, this Ansible playbook provides automated installation and rebooting with proper orchestration.
Use with the automatically-generated inventory from the openshift-clusters deployment:
# Target the cluster_vms group
ansible-playbook -i /path/to/inventory.ini \
apply-rpm-patch.yml \
-l cluster_vms \
-e rpm_full_path=/absolute/path/to/package.rpmPrerequisites:
- Inventory with
cluster_vmsgroup (created automatically by update-cluster-inventory.yml task) - ProxyJump SSH configuration through hypervisor (automatically configured in inventory)
- Absolute path to RPM file on your laptop
Process:
- Validates RPM file exists on localhost
- Copies RPM to cluster VMs via ProxyJump
- Installs using rpm-ostree override with privilege escalation
- Reboots nodes one at a time
- Verifies etcd health after reboot
Use with a custom inventory directly on the hypervisor:
# On the hypervisor, create a simple inventory file first
# See inventory_ocp_hosts.sample for reference
ansible-playbook -i inventory_ocp_hosts \
apply-rpm-patch.yml \
-e rpm_full_path=/path/to/package.rpmPrerequisites:
- Copy RPM file and apply-rpm-patch.yml playbook to hypervisor
- Create inventory file listing cluster VM IPs (see
inventory_ocp_hosts.sample)
Process:
- Validates RPM file existence
- Copies RPM to all nodes
- Installs using rpm-ostree override with privilege escalation
- Reboots nodes one at a time
- Verifies etcd health after reboot
If you don't want or are unable to use the previous Ansible playbooks, you can use this shell script .It should be inoked from within the hypervisor, as it requires direct access to the nodes via SSH and assumes the "core" user.
./apply-rpm-patch.sh /path/to/package.rpmProcess:
- Validates required tools and RPM file
- Discovers all node IPs via OpenShift API
- Copies RPM to each node using SCP
- Installs package with
rpm-ostree override replace - Provides manual reboot commands
Note: The shell script does not handle reboots automatically. You must manually reboot nodes after installation. Follow the instructions provided at the end of the script execution
The resource-agents-build/ directory contains Dockerfiles and a script for validating that resource-agents compiles correctly on CentOS Stream 9 and 10, without needing a hypervisor or cluster. This is useful for quickly verifying a branch builds before running the full build-and-patch-resource-agents.yml playbook.
Usage:
cd helpers/resource-agents-build
# Run both builds — prompts for repo and ref, press Enter to use defaults
./local-build-test.sh
# Skip prompts by providing values via flags
./local-build-test.sh --repo https://github.com/myorg/resource-agents --ref my-feature-branch
# Build individually with podman
podman build -f Dockerfile.stream9 -t localhost/tnf-resource-agents-build:stream9 .
podman build -f Dockerfile.stream10 -t localhost/tnf-resource-agents-build:stream10 .Script options:
| Option | Description |
|---|---|
--repo URL |
Git repository URL (default: https://github.com/ClusterLabs/resource-agents) |
--ref REF |
Git branch, tag, or commit (default: main) |
-h, --help |
Show help |
When no flags are provided, the script prompts for each value. Press Enter to use the default.
Extracting the built RPM from the container (Stream 9 only — Stream 10 skips make rpm):
# Build from a specific branch
./local-build-test.sh --ref my-feature-branch
# Copy the RPM out of the Stream 9 image
podman create --name ra-build localhost/tnf-resource-agents-build:stream9
podman cp ra-build:/tmp/resource-agents.rpm ./resource-agents.rpm
podman rm ra-build
# Then patch your cluster nodes with it
ansible-playbook -i ../../deploy/openshift-clusters/inventory.ini \
../apply-rpm-patch.yml \
-l cluster_vms \
-e rpm_full_path=$(pwd)/resource-agents.rpmThis is useful when you want to validate the RPM locally before patching.
Stream 10 limitation: libqb-devel is not yet available in EPEL 10. The Dockerfile builds libqb from source for configure/make validation, but skips make rpm since rpmbuild's BuildRequires: libqb-devel cannot be satisfied without the actual RPM package.
custom-payload.sh ties the containerized build to an OpenShift 5.x release: it resolves a nightly payload, prints the base rhel-coreos / rhel-coreos-10 image from oc adm release info, and generates Dockerfile.custompayload by combining the contents of Dockerfile.stream9 or Dockerfile.stream10 (depending on --base-os) with an RPM-ostree override snippet—the script does not change Dockerfile.stream9 or Dockerfile.stream10 on disk. Then, unless you use dry-run-style flags, it runs podman to build and push the custom OS image and runs oc adm release new to publish a custom release payload image that points that OS layer at the correct component name (for example rhel-coreos-10=...).
When to use it: after you are happy with a resource-agents RPM from local-build-test.sh (or an equivalent build), and you want a full custom payload image to install or test on a 5.0 line cluster—without going through the hypervisor-based build-and-patch-resource-agents.yml path for node RPM overrides alone.
Requirements: curl, oc, podman, and jq or python3 (for --auto-release). Pull secret handling works the same way as other oc / podman registry operations (-a / PULL_SECRET_PATH, or the default ~/.docker/config.json / ~/.config/containers/auth.json as documented in the script's --help).
Usage (from helpers/resource-agents-build/):
cd helpers/resource-agents-build
# Use the newest Accepted 5.0.0-0.nightly tag from the release API
./custom-payload.sh --auto-release
# Or pin a full release pullspec
./custom-payload.sh --release registry.ci.openshift.org/ocp/release-5:5.0.0-0.nightly-...
# Explicit pull secret (optional if a default auth file exists)
./custom-payload.sh --auto-release -a /path/to/pull-secret
# Only print Dockerfile and commands; do not run oc / skip real builds (see --help)
./custom-payload.sh --auto-release --print-only
./custom-payload.sh --auto-release --no-buildRun ./custom-payload.sh --help for flags such as --base-os (rhel-coreos vs rhel-coreos-10), --to-os-image, and --to-payload-image (custom registry targets and tags).
Pull secret and the same registry twice: if you use different tokens on the same registry host—for example one credential for the general quay.io pull path and another for a specific org such as quay.io/rh-edge-enablement—put them in separate auths entries, for example one key quay.io and another quay.io/rh-edge-enablement. A single quay.io entry may not match both paths reliably; splitting them keeps pulls and pushes unambiguous for oc and podman.
ex:
{
"auths": {
....
"quay.io": {
"auth": "<secret>"
},
"quay.io/eggfoobar": {
"auth": "<secret_same>"
},
"quay.io/rh-edge-enablement": {
"auth": "<secret_same>"
},
....
}
}- Both tools use
rpm-ostree override replacewhich is appropriate for updating existing packages - Node reboots are required to activate rpm-ostree changes
- The Ansible playbooks handle rebooting automatically with proper orchestration; the shell script requires manual intervention
- Plan reboots carefully to maintain cluster availability
- Monitor cluster health during the patching process