From 6ea44c35c11e5b0302127519388718b6bde6800a Mon Sep 17 00:00:00 2001 From: Michael Burke Date: Mon, 9 Feb 2026 11:47:57 -0500 Subject: [PATCH] [enterprise-4.20] OSDOCS 16930 CQA2.0 of NODES-2: Node Management and Maintenance Part II --- .../mco-update-boot-images-configuring.adoc | 148 +++++++++++++----- modules/mco-update-boot-images-disable.adoc | 31 ++-- modules/nodes-nodes-viewing-listing-pods.adoc | 3 +- modules/nodes-nodes-viewing-listing.adoc | 51 +++--- modules/nodes-nodes-viewing-memory.adoc | 5 +- ...des-nodes-working-deleting-bare-metal.adoc | 10 +- modules/nodes-nodes-working-deleting.adoc | 9 +- modules/nodes-nodes-working-evacuating.adoc | 48 +++--- modules/nodes-nodes-working-marking.adoc | 10 +- modules/nodes-nodes-working-updating.adoc | 5 +- .../sno-clusters-reboot-without-drain.adoc | 3 + ...ere-virtual-hardware-on-compute-nodes.adoc | 2 +- nodes/nodes/nodes-nodes-viewing.adoc | 23 ++- nodes/nodes/nodes-nodes-working.adoc | 16 +- ...-remediating-fencing-maintaining-rhwa.adoc | 5 +- nodes/nodes/nodes-update-boot-images.adoc | 18 ++- snippets/mco-update-boot-images-abstract.adoc | 7 + ...-hardware-on-nodes-running-on-vsphere.adoc | 2 +- 18 files changed, 243 insertions(+), 153 deletions(-) create mode 100644 snippets/mco-update-boot-images-abstract.adoc diff --git a/modules/mco-update-boot-images-configuring.adoc b/modules/mco-update-boot-images-configuring.adoc index 3827eaffb3f7..bfa7d3ee4d31 100644 --- a/modules/mco-update-boot-images-configuring.adoc +++ b/modules/mco-update-boot-images-configuring.adoc @@ -7,6 +7,9 @@ [id="mco-update-boot-images-configuring_{context}"] = Enabling boot image management +[role="_abstract"] +include::snippets/mco-update-boot-images-abstract.adoc[] + By default, for {gcp-first} and {aws-first} clusters, the Machine Config Operator (MCO) updates the boot image in the machine sets in your cluster whenever you update your cluster. If you disabled the boot image management feature, so that the boot images are not updated, you can re-enable the feature by editing the `MachineConfiguration` object. @@ -21,30 +24,6 @@ Enabling the feature updates the boot image to the current {product-title} versi .Prerequisites * For {vmw-short}, enable the `TechPreviewNoUpgrade` feature set on the cluster. For more information, see "Enabling features using feature gates". -+ -[NOTE] -==== -Enabling the `TechPreviewNoUpgrade` feature set cannot be undone and prevents minor version updates. These feature sets are not recommended on production clusters. -==== -+ -Wait until the `managedBootImagesStatus` stanza displays in the `MachineConfiguration` object. -+ -[source,yaml] ----- -apiVersion: operator.openshift.io/v1 -kind: MachineConfiguration -metadata: - name: cluster -# ... -status: -# ... - managedBootImagesStatus: - machineManagers: - - apiGroup: machine.openshift.io - resource: machinesets - selection: - mode: None ----- .Procedure @@ -65,17 +44,18 @@ metadata: name: cluster spec: # ... - managedBootImages: <1> + managedBootImages: machineManagers: - - apiGroup: machine.openshift.io <2> - resource: machinesets <3> + - apiGroup: machine.openshift.io + resource: machinesets selection: - mode: All <4> + mode: All ---- -<1> Configures the boot image management feature. -<2> Specifies the API group. This must be `machine.openshift.io`. -<3> Specifies the resource within the specified API group to apply the change. This must be `machinesets`. -<4> Specifies that the feature is enabled for all machine sets in the cluster. +where: + +`spec.managedBootImages`:: Configures the boot image management feature. +`spec.managedBootImages.machineManagers.selection.mode`:: Specifies that all the machine sets in the cluster are to be updated. + * Optional: Enable the boot image management feature for specific machine sets: + @@ -87,21 +67,21 @@ metadata: name: cluster spec: # ... - managedBootImages: <1> + managedBootImages: machineManagers: - - apiGroup: machine.openshift.io <2> - resource: machinesets <3> + - apiGroup: machine.openshift.io + resource: machinesets selection: - mode: Partial <4> + mode: Partial partial: machineResourceSelector: matchLabels: region: "east" ---- -<1> Configures the boot image update feature. -<2> Specifies the API group. This must be `machine.openshift.io`. -<3> Specifies the resource within the specified API group to apply the change. This must be `machinesets`. -<4> Specifies that the feature is enabled for machine sets with the specified label. +where: + +`spec.managedBootImages`:: Configures the boot image management feature. +`spec.managedBootImages.machineManagers.selection.partial.machineResourceSelector.matchLabels`:: Specifies that any machine set with this label is to be updated. + [TIP] ==== @@ -114,4 +94,90 @@ $ oc label machineset.machine ci-ln-hmy310k-72292-5f87z-worker-a region="east" - .Verification -include::snippets/mco-update-boot-images-verification.adoc[] +. View the current state of the boot image management feature by using the following command to view the machine configuration object: ++ +[source,terminal] +---- +$ oc get machineconfiguration cluster -o yaml +---- ++ +.Example machine set with the boot image reference +[source,yaml] +---- +kind: MachineConfiguration +metadata: + name: cluster +# ... +status: + conditions: + - lastTransitionTime: "2025-05-01T20:11:49Z" + message: Reconciled 2 of 4 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets + | Reconciled 0 of 0 CAPI MachineDeployments + reason: BootImageUpdateConfigurationUpdated + status: "True" + type: BootImageUpdateProgressing + - lastTransitionTime: "2025-05-01T19:30:13Z" + message: 0 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments + reason: BootImageUpdateConfigurationUpdated + status: "False" + type: BootImageUpdateDegraded + managedBootImagesStatus: + machineManagers: + - apiGroup: machine.openshift.io + resource: controlplanemachinesets + selection: + mode: All + - apiGroup: machine.openshift.io + resource: machinesets + selection: + mode: All +---- ++ +-- +where: + +`status.managedBootImagesStatus.machineManagers.selection.mode`:: Specifies that the boot image management feature is enabled when set to `All`. +-- + +. Scale a machine set to create a new node by using a command similar to the following. The boot image is updated only for new nodes. ++ +[source,terminal] +---- +$ oc scale --replicas=2 machinesets.machine.openshift.io -n openshift-machine-api +---- + +. If your cluster was using an older boot image version, you can see the new boot image version when the new node reaches the `READY` state. View the {op-system-first} version on a nodes: + +.. Log in to the node by using a command similar to the following: ++ +[source,terminal] +---- +$ oc debug node/ +---- + +.. Set `/host` as the root directory within the debug shell by using the following command: ++ +[source,terminal] +---- +sh-5.1# chroot /host +---- + +.. View the `/sysroot/.coreos-aleph-version.json` file by using a command similar to the following: ++ +[source,terminal] +---- +sh-5.1# cat /sysroot/.coreos-aleph-version.json +---- ++ +.Example output +[source,yaml] +---- +{ +# ... + "ref": "docker://ostree-image-signed:oci-archive:/rhcos-9.6.20251015-1-ostree.x86_64.ociarchive", + "version": "9.6.20251015-1" +} +---- +where: + +``:: Specifies the boot image version. diff --git a/modules/mco-update-boot-images-disable.adoc b/modules/mco-update-boot-images-disable.adoc index b26c9cab6322..f81d043ee6df 100644 --- a/modules/mco-update-boot-images-disable.adoc +++ b/modules/mco-update-boot-images-disable.adoc @@ -7,7 +7,8 @@ [id="mco-update-boot-images-disable_{context}"] = Disabling boot image management -By default, for {gcp-first} and {aws-first} clusters, the Machine Config Operator (MCO) manages and updates the boot image in the machine sets in your cluster whenever you update your cluster. For {vmw-first}, you can enable boot image management as a Technology Preview feature. +[role="_abstract"] +You can disable the boot image management feature so that the Machine Config Operator (MCO) no longer manages or updates the boot image in the affected machine sets. For example, you could disable this feature for the worker nodes in order to use a custom boot image that you do not want changed. You can disable the boot image management feature for your cluster by editing the `MachineConfiguration` object. When disabled, the Machine Config Operator (MCO) no longer manages the boot image in your cluster and no longer updates the boot image with each cluster update. @@ -43,10 +44,8 @@ spec: ---- + -- -<1> Configures the boot image management feature. -<2> Specifies an API group. This must be `machine.openshift.io`. -<3> Specifies the resource within the specified API group to apply the change. This must be `machinesets`. -<4> Specifies that the feature is disabled for all machine sets in the cluster. +`spec.managedBootImages`:: Configures the boot image management feature. +`spec.managedBootImages.machineManagers.selection.mode.None`:: Specifies that the feature is disabled for all machine sets in the cluster. -- //// @@ -61,22 +60,24 @@ metadata: name: cluster spec: # ... - managedBootImages: <1> + managedBootImages: machineManagers: - - apiGroup: machine.openshift.io <2> - resource: machinesets <3> + - apiGroup: machine.openshift.io + resource: machinesets selection: - mode: Partial <4> + mode: Partial partial: - machineResourceSelector: <5> + machineResourceSelector: matchLabels: region: "east" ---- -<1> Configures the boot image management feature. -<2> Specifies an API group. This must be `machine.openshift.io`. -<3> Specifies the resource within the specified API group to apply the change. This must be `machinesets`. -<4> Specifies that the feature is disabled for specific machine sets. -<5> Specifies that the feature is enabled only for machine sets with these labels. The feature is disabled for any machine set that does not contain the listed labels. +where: + +`spec.managedBootImages`:: Specifies the configuration of the boot image management feature. +`spec.managedBootImages.machineManagers.apiGroup`:: Specifies an API group. This must be `machine.openshift.io`. +`spec.managedBootImages.machineManagers.resource`:: Specifies the resource within the specified API group to apply the change. This must be `machinesets`. +`spec.managedBootImages.machineManagers.selection.mode`:: Specifies that the feature is disabled for specific machine sets. +`spec.managedBootImages.machineManagers.selection.partial.machineResourceSelector`:: Specifies that the feature is enabled only for machine sets with these labels. The feature is disabled for any machine set that does not contain the listed labels. //// .Verification diff --git a/modules/nodes-nodes-viewing-listing-pods.adoc b/modules/nodes-nodes-viewing-listing-pods.adoc index f08fb29381c4..e87045d5ca84 100644 --- a/modules/nodes-nodes-viewing-listing-pods.adoc +++ b/modules/nodes-nodes-viewing-listing-pods.adoc @@ -6,7 +6,8 @@ [id="nodes-nodes-viewing-listing-pods_{context}"] = Listing pods on a node in your cluster -You can list all the pods on a specific node. +[role="_abstract"] +You can list all of the pods on a node by using the `oc get pods` command along with specific flags. This command shows the number of pods on that node, the state of the pods, number of pod restarts, and the age of the pods. .Procedure diff --git a/modules/nodes-nodes-viewing-listing.adoc b/modules/nodes-nodes-viewing-listing.adoc index 3fab77913df8..b10b5f6a9286 100644 --- a/modules/nodes-nodes-viewing-listing.adoc +++ b/modules/nodes-nodes-viewing-listing.adoc @@ -6,7 +6,8 @@ [id="nodes-nodes-viewing-listing_{context}"] = About listing all the nodes in a cluster -You can get detailed information on the nodes in the cluster. +[role="_abstract"] +You can get detailed information about the nodes in the cluster, which can help you understand the state of the nodes in your cluster. * The following command lists all nodes: + @@ -108,11 +109,11 @@ include::snippets/osd-aws-example-only.adoc[] .Example output [source,text] ---- -Name: node1.example.com <1> -Roles: worker <2> +Name: node1.example.com +Roles: worker Labels: kubernetes.io/os=linux kubernetes.io/hostname=ip-10-0-131-14 - kubernetes.io/arch=amd64 <3> + kubernetes.io/arch=amd64 node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m4.large node.openshift.io/os_id=rhcos @@ -120,15 +121,15 @@ Labels: kubernetes.io/os=linux region=east topology.kubernetes.io/region=us-east-1 topology.kubernetes.io/zone=us-east-1a -Annotations: cluster.k8s.io/machine: openshift-machine-api/ahardin-worker-us-east-2a-q5dzc <4> +Annotations: cluster.k8s.io/machine: openshift-machine-api/ahardin-worker-us-east-2a-q5dzc machineconfiguration.openshift.io/currentConfig: worker-309c228e8b3a92e2235edd544c62fea8 machineconfiguration.openshift.io/desiredConfig: worker-309c228e8b3a92e2235edd544c62fea8 machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Wed, 13 Feb 2019 11:05:57 -0500 -Taints: <5> +Taints: Unschedulable: false -Conditions: <6> +Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- OutOfDisk False Wed, 13 Feb 2019 15:09:42 -0500 Wed, 13 Feb 2019 11:05:57 -0500 KubeletHasSufficientDisk kubelet has sufficient disk space available @@ -136,11 +137,11 @@ Conditions: <6> DiskPressure False Wed, 13 Feb 2019 15:09:42 -0500 Wed, 13 Feb 2019 11:05:57 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 13 Feb 2019 15:09:42 -0500 Wed, 13 Feb 2019 11:05:57 -0500 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 13 Feb 2019 15:09:42 -0500 Wed, 13 Feb 2019 11:07:09 -0500 KubeletReady kubelet is posting ready status -Addresses: <7> +Addresses: InternalIP: 10.0.140.16 InternalDNS: ip-10-0-140-16.us-east-2.compute.internal Hostname: ip-10-0-140-16.us-east-2.compute.internal -Capacity: <8> +Capacity: attachable-volumes-aws-ebs: 39 cpu: 2 hugepages-1Gi: 0 @@ -154,7 +155,7 @@ Allocatable: hugepages-2Mi: 0 memory: 7558116Ki pods: 250 -System Info: <9> +System Info: Machine ID: 63787c9534c24fde9a0cde35c13f1f66 System UUID: EC22BF97-A006-4A58-6AF8-0A38DEEA122A Boot ID: f24ad37d-2594-46b4-8830-7f7555918325 @@ -167,7 +168,7 @@ System Info: <9> Kube-Proxy Version: v1.33.4 PodCIDR: 10.128.4.0/24 ProviderID: aws:///us-east-2a/i-04e87b31dc6b3e171 -Non-terminated Pods: (12 in total) <10> +Non-terminated Pods: (12 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits --------- ---- ------------ ---------- --------------- ------------- openshift-cluster-node-tuning-operator tuned-hdl5q 0 (0%) 0 (0%) 0 (0%) 0 (0%) @@ -188,7 +189,7 @@ Allocated resources: cpu 380m (25%) 270m (18%) memory 880Mi (11%) 250Mi (3%) attachable-volumes-aws-ebs 0 0 -Events: <11> +Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeHasSufficientPID 6d (x5 over 6d) kubelet, m01.example.com Node m01.example.com status is now: NodeHasSufficientPID @@ -200,17 +201,21 @@ Events: <11> Normal Starting 6d kubelet, m01.example.com Starting kubelet. #... ---- -<1> The name of the node. -<2> The role of the node, either `master` or `worker`. -<3> The labels applied to the node. -<4> The annotations applied to the node. -<5> The taints applied to the node. -<6> The node conditions and status. The `conditions` stanza lists the `Ready`, `PIDPressure`, `MemoryPressure`, `DiskPressure` and `OutOfDisk` status. These condition are described later in this section. -<7> The IP address and hostname of the node. -<8> The pod resources and allocatable resources. -<9> Information about the node host. -<10> The pods on the node. -<11> The events reported by the node. +where: ++ +-- +`Names`:: Specifies the name of the node. +`Roles`:: Specifies the role of the node, either `master` or `worker`. +`Labels`:: Specifies the labels applied to the node. +`Annotations`:: Specifies the annotations applied to the node. +`Taints`:: Specifies the taints applied to the node. +`Conditions`:: Specifies the node conditions and status. The `conditions` stanza lists the `Ready`, `PIDPressure`, `MemoryPressure`, `DiskPressure` and `OutOfDisk` status. These condition are described later in this section. +`Addresses`:: Specifies the IP address and hostname of the node. +`Capacity`:: Specifies the pod resources and allocatable resources. +`Information`:: Specifies information about the node host. +`Non-terminated Pods`:: Specifies the pods on the node. +`Events`:: Specifies the events reported by the node. +-- + ifndef::openshift-rosa,openshift-rosa-hcp,openshift-dedicated[] [NOTE] diff --git a/modules/nodes-nodes-viewing-memory.adoc b/modules/nodes-nodes-viewing-memory.adoc index 5ddc6a916414..fcd6b223f032 100644 --- a/modules/nodes-nodes-viewing-memory.adoc +++ b/modules/nodes-nodes-viewing-memory.adoc @@ -6,9 +6,8 @@ [id="nodes-nodes-viewing-memory_{context}"] = Viewing memory and CPU usage statistics on your nodes -You can display usage statistics about nodes, which provide the runtime -environments for containers. These usage statistics include CPU, memory, and -storage consumption. +[role="_abstract"] +You can display usage statistics about nodes, including CPU, memory, and storage consumption. These statistics can help you ensure your cluster is running efficiently. .Prerequisites diff --git a/modules/nodes-nodes-working-deleting-bare-metal.adoc b/modules/nodes-nodes-working-deleting-bare-metal.adoc index a78bfc30268b..493e73c30b70 100644 --- a/modules/nodes-nodes-working-deleting-bare-metal.adoc +++ b/modules/nodes-nodes-working-deleting-bare-metal.adoc @@ -7,16 +7,18 @@ [id="nodes-nodes-working-deleting-bare-metal_{context}"] = Deleting nodes from a bare metal cluster +[role="_abstract"] +You can delete a node from a {product-title} cluster that does not use machine sets by using the `oc delete node` command and decommissioning the node. + When you delete a node using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node are not deleted. Any bare pods not backed by a replication controller become inaccessible to {product-title}. Pods backed by replication controllers are rescheduled to other available nodes. You must delete local manifest pods. -.Procedure +The following procedure deletes a node from an {product-title} cluster running on bare metal. -Delete a node from an {product-title} cluster running on bare metal by completing -the following steps: +.Procedure . Mark the node as unschedulable: + @@ -32,7 +34,7 @@ $ oc adm cordon $ oc adm drain --force=true ---- + -This step might fail if the node is offline or unresponsive. Even if the node does not respond, it might still be running a workload that writes to shared storage. To avoid data corruption, power down the physical hardware before you proceed. +This step might fail if the node is offline or unresponsive. Even if the node does not respond, the node might still be running a workload that writes to shared storage. To avoid data corruption, power down the physical hardware before you proceed. . Delete the node from the cluster: + diff --git a/modules/nodes-nodes-working-deleting.adoc b/modules/nodes-nodes-working-deleting.adoc index d6fdbba51a1f..1a346b60c675 100644 --- a/modules/nodes-nodes-working-deleting.adoc +++ b/modules/nodes-nodes-working-deleting.adoc @@ -6,7 +6,8 @@ [id="nodes-nodes-working-deleting_{context}"] = Deleting nodes from a cluster -To delete a node from the {product-title} cluster, scale down the appropriate `MachineSet` object. +[role="_abstract"] +You can delete a node from a {product-title} cluster by scaling down the appropriate `MachineSet` object. [IMPORTANT] ==== @@ -58,7 +59,9 @@ metadata: namespace: openshift-machine-api # ... spec: - replicas: 2 # <1> + replicas: 2 # ... ---- -<1> Specify the number of replicas to scale down to. \ No newline at end of file +where: + +`spec.replicas`:: Specifies the number of replicas to scale down to. diff --git a/modules/nodes-nodes-working-evacuating.adoc b/modules/nodes-nodes-working-evacuating.adoc index 332a06c57659..ce4335191d6e 100644 --- a/modules/nodes-nodes-working-evacuating.adoc +++ b/modules/nodes-nodes-working-evacuating.adoc @@ -4,23 +4,23 @@ :_mod-docs-content-type: PROCEDURE [id="nodes-nodes-working-evacuating_{context}"] -= Understanding how to evacuate pods on nodes += Evacuating pods on nodes -Evacuating pods allows you to migrate all or selected pods from a given node or -nodes. +[role="_abstract"] +You can remove, or evacuate, pods from a given node or nodes. Evacuating pods allows you to migrate all or selected pods to other nodes. -You can only evacuate pods backed by a replication controller. The replication controller creates new pods on +You can evacuate only pods that are backed by a replication controller. The replication controller creates new pods on other nodes and removes the existing pods from the specified node(s). Bare pods, meaning those not backed by a replication controller, are unaffected by default. -You can evacuate a subset of pods by specifying a pod-selector. Pod selectors are -based on labels, so all the pods with the specified label will be evacuated. +You can evacuate a subset of pods by specifying a pod selector. Because pod selectors are +based on labels, all of the pods with the specified label are evacuated. .Procedure -. Mark the nodes unschedulable before performing the pod evacuation. +. Mark the nodes as unschedulable before performing the pod evacuation. -.. Mark the node as unschedulable: +.. Mark the node as unschedulable by running the following command: + [source,terminal] ---- @@ -33,7 +33,7 @@ $ oc adm cordon node/ cordoned ---- -.. Check that the node status is `Ready,SchedulingDisabled`: +.. Check that the node status is `Ready,SchedulingDisabled` by running the following command: + [source,terminal] ---- @@ -47,26 +47,26 @@ NAME STATUS ROLES AGE VERSION Ready,SchedulingDisabled worker 1d v1.33.4 ---- -. Evacuate the pods using one of the following methods: +. Evacuate the pods by using one of the following methods: -** Evacuate all or selected pods on one or more nodes: +** Evacuate all or selected pods on one or more nodes by running the `oc adm drain` command: + [source,terminal] ---- $ oc adm drain [--pod-selector=] ---- -** Force the deletion of bare pods using the `--force` option. When set to +** Force the deletion of bare pods by using the `--force` option with the `oc adm drain` command. When set to `true`, deletion continues even if there are pods not managed by a replication -controller, replica set, job, daemon set, or stateful set: +controller, replica set, job, daemon set, or stateful set. + [source,terminal] ---- $ oc adm drain --force=true ---- -** Set a period of time in seconds for each pod to -terminate gracefully, use `--grace-period`. If negative, the default value specified in the pod will +** Set a period of time in seconds for each pod to +terminate gracefully by using the `--grace-period` option with the `oc adm drain` command. If negative, the default value specified in the pod will be used: + [source,terminal] @@ -74,31 +74,31 @@ be used: $ oc adm drain --grace-period=-1 ---- -** Ignore pods managed by daemon sets using the `--ignore-daemonsets` flag set to `true`: +** Ignore pods managed by daemon sets by using the `--ignore-daemonsets=true` option with the `oc adm drain` command: + [source,terminal] ---- $ oc adm drain --ignore-daemonsets=true ---- -** Set the length of time to wait before giving up using the `--timeout` flag. A -value of `0` sets an infinite length of time: +** Set the length of time to wait before giving up using the `--timeout` option with the `oc adm drain` command. A +value of `0` sets an infinite length of time. + [source,terminal] ---- $ oc adm drain --timeout=5s ---- -** Delete pods even if there are pods using `emptyDir` volumes by setting the `--delete-emptydir-data` flag to `true`. Local data is deleted when the node -is drained: +** Delete pods even if there are pods using `emptyDir` volumes by setting the `--delete-emptydir-data=true` option with the `oc adm drain` command. Local data is deleted when the node +is drained. + [source,terminal] ---- $ oc adm drain --delete-emptydir-data=true ---- -** List objects that will be migrated without actually performing the evacuation, -using the `--dry-run` option set to `true`: +** List objects that would be migrated without actually performing the evacuation, +by using the `--dry-run=true` option with the `oc adm drain` command: + [source,terminal] ---- @@ -106,10 +106,10 @@ $ oc adm drain --dry-run=true ---- + Instead of specifying specific node names (for example, ` `), you -can use the `--selector=` option to evacuate pods on selected +can use the `--selector=` option with the `oc adm drain` command to evacuate pods on selected nodes. -. Mark the node as schedulable when done. +. Mark the node as schedulable when done by using the following command. + [source,terminal] ---- diff --git a/modules/nodes-nodes-working-marking.adoc b/modules/nodes-nodes-working-marking.adoc index 6fda48df5633..56f64b4e264c 100644 --- a/modules/nodes-nodes-working-marking.adoc +++ b/modules/nodes-nodes-working-marking.adoc @@ -6,10 +6,14 @@ [id="nodes-nodes-working-marking_{context}"] = Understanding how to mark nodes as unschedulable or schedulable +[role="_abstract"] +You can mark a node as unschedulable in order to block any new pods from being scheduled on the node. + +When you mark a node as unschedulable, existing pods on the node are not affected. + By default, healthy nodes with a `Ready` status are marked as schedulable, which means that you can place new pods on the -node. Manually marking a node as unschedulable blocks any new pods from being -scheduled on the node. Existing pods on the node are not affected. +node. * The following command marks a node or nodes as unschedulable: + @@ -42,5 +46,5 @@ node1.example.com kubernetes.io/hostname=node1.example.com Ready,Schedul $ oc adm uncordon ---- + -Alternatively, instead of specifying specific node names (for example, ``), you can use the `--selector=` option to mark selected +Instead of specifying specific node names (for example, ``), you can use the `--selector=` option to mark selected nodes as schedulable or unschedulable. diff --git a/modules/nodes-nodes-working-updating.adoc b/modules/nodes-nodes-working-updating.adoc index 45886e8c5c04..b431d094c5c1 100644 --- a/modules/nodes-nodes-working-updating.adoc +++ b/modules/nodes-nodes-working-updating.adoc @@ -6,7 +6,8 @@ [id="nodes-nodes-working-updating_{context}"] = Understanding how to update labels on nodes -You can update any label on a node. +[role="_abstract"] +You can update any label on a node in order to adapt your cluster to evolving needs. Node labels are not persisted after a node is deleted even if the node is backed up by a Machine. @@ -66,4 +67,4 @@ $ oc label pods --all status=unhealthy In {product-title} 4.12 and later, newly installed clusters include both the `node-role.kubernetes.io/control-plane` and `node-role.kubernetes.io/master` labels on control plane nodes by default. In {product-title} versions earlier than 4.12, the `node-role.kubernetes.io/control-plane` label is not added by default. Therefore, you must manually add the `node-role.kubernetes.io/control-plane` label to control plane nodes in clusters upgraded from earlier versions. -==== \ No newline at end of file +==== diff --git a/modules/sno-clusters-reboot-without-drain.adoc b/modules/sno-clusters-reboot-without-drain.adoc index d31dc8890650..b34aa642e341 100644 --- a/modules/sno-clusters-reboot-without-drain.adoc +++ b/modules/sno-clusters-reboot-without-drain.adoc @@ -6,6 +6,9 @@ [id="sno-clusters-reboot-without-drain_{context}"] = Handling errors in {sno} clusters when the node reboots without draining application pods +[role="_abstract"] +You can remove failed pods from a node by using the `--field-selector status.phase=Failed` flag with the `oc delete pods` command. + In {sno} clusters and in {product-title} clusters in general, a situation can arise where a node reboot occurs without first draining the node. This can occur where an application pod requesting devices fails with the `UnexpectedAdmissionError` error. `Deployment`, `ReplicaSet`, or `DaemonSet` errors are reported because the application pods that require those devices start before the pod serving those devices. You cannot control the order of pod restarts. While this behavior is to be expected, it can cause a pod to remain on the cluster even though it has failed to deploy successfully. The pod continues to report `UnexpectedAdmissionError`. This issue is mitigated by the fact that application pods are typically included in a `Deployment`, `ReplicaSet`, or `DaemonSet`. If a pod is in this error state, it is of little concern because another instance should be running. Belonging to a `Deployment`, `ReplicaSet`, or `DaemonSet` guarantees the successful creation and execution of subsequent pods and ensures the successful deployment of the application. diff --git a/modules/update-vsphere-virtual-hardware-on-compute-nodes.adoc b/modules/update-vsphere-virtual-hardware-on-compute-nodes.adoc index 3487d25d0762..4e0c12507102 100644 --- a/modules/update-vsphere-virtual-hardware-on-compute-nodes.adoc +++ b/modules/update-vsphere-virtual-hardware-on-compute-nodes.adoc @@ -52,7 +52,7 @@ $ oc adm cordon $ oc adm drain [--pod-selector=] ---- + -See the "Understanding how to evacuate pods on nodes" section for other options to evacuate pods from a node. +See the "Evacuating pods on nodes" section for other options to evacuate pods from a node. . Shut down the virtual machine (VM) associated with the compute node. Do this in the vSphere client by right-clicking the VM and selecting *Power* -> *Shut Down Guest OS*. Do not shut down the VM using *Power Off* because it might not shut down safely. diff --git a/nodes/nodes/nodes-nodes-viewing.adoc b/nodes/nodes/nodes-nodes-viewing.adoc index 8ee4b9eeb752..9e1eb055895c 100644 --- a/nodes/nodes/nodes-nodes-viewing.adoc +++ b/nodes/nodes/nodes-nodes-viewing.adoc @@ -6,6 +6,7 @@ include::_attributes/common-attributes.adoc[] toc::[] +[role="_abstract"] You can list all the nodes in your cluster to obtain information such as status, age, memory usage, and details about the nodes. When you perform node management operations, the CLI interacts with node objects that are representations of actual node hosts. @@ -23,23 +24,19 @@ endif::openshift-rosa,openshift-rosa-hcp[] include::modules/nodes-nodes-viewing-listing.adoc[leveloffset=+1] -ifndef::openshift-rosa,openshift-rosa-hcp,openshift-dedicated[] - -[role="_additional-resources"] -.Additional resources - -* xref:../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working-updating_nodes-nodes-working[Understanding how to update labels on nodes] - -endif::openshift-rosa,openshift-rosa-hcp,openshift-dedicated[] - include::modules/nodes-nodes-viewing-listing-pods.adoc[leveloffset=+1] include::modules/nodes-nodes-viewing-memory.adoc[leveloffset=+1] -.Additional resources +[role="_additional-resources"] +[id="additional-resources_{context}"] +== Additional resources +ifndef::openshift-rosa,openshift-rosa-hcp,openshift-dedicated[] +* xref:../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working-updating_nodes-nodes-working[Understanding how to update labels on nodes] +endif::openshift-rosa,openshift-rosa-hcp,openshift-dedicated[] ifdef::openshift-rosa[] -* xref:../../rosa_architecture/rosa_policy_service_definition/rosa-service-definition.adoc#rosa-sdpolicy-node-lifecycle_rosa-service-definition[Node lifecycle]. +* xref:../../rosa_architecture/rosa_policy_service_definition/rosa-service-definition.adoc#rosa-sdpolicy-node-lifecycle_rosa-service-definition[Node lifecycle] endif::openshift-rosa[] ifdef::openshift-rosa-hcp[] -* xref:../../rosa_architecture/rosa_policy_service_definition/rosa-hcp-service-definition.adoc#rosa-sdpolicy-node-lifecycle_rosa-hcp-service-definition[Node lifecycle]. -endif::openshift-rosa-hcp[] \ No newline at end of file +* xref:../../rosa_architecture/rosa_policy_service_definition/rosa-hcp-service-definition.adoc#rosa-sdpolicy-node-lifecycle_rosa-hcp-service-definition[Node lifecycle] +endif::openshift-rosa-hcp[] diff --git a/nodes/nodes/nodes-nodes-working.adoc b/nodes/nodes/nodes-nodes-working.adoc index 0a4fd9311c26..b10c821d5003 100644 --- a/nodes/nodes/nodes-nodes-working.adoc +++ b/nodes/nodes/nodes-nodes-working.adoc @@ -7,6 +7,7 @@ include::_attributes/common-attributes.adoc[] toc::[] +[role="_abstract"] As an administrator, you can perform several tasks to make your clusters more efficient. ifdef::openshift-rosa,openshift-rosa-hcp[] You can use the `oc adm` command to cordon, uncordon, and drain a specific node. @@ -34,20 +35,15 @@ include::modules/nodes-nodes-working-marking.adoc[leveloffset=+1] include::modules/sno-clusters-reboot-without-drain.adoc[leveloffset=+1] -[role="_additional-resources"] -.Additional resources - -* xref:../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working-evacuating_nodes-nodes-working[Understanding how to evacuate pods on nodes] - -== Deleting nodes +include::modules/nodes-nodes-working-deleting.adoc[leveloffset=+1] -include::modules/nodes-nodes-working-deleting.adoc[leveloffset=+2] +include::modules/nodes-nodes-working-deleting-bare-metal.adoc[leveloffset=+2] [role="_additional-resources"] -.Additional resources +[id="additional-resources_{context}"] +== Additional resources +* xref:../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working-evacuating_nodes-nodes-working[Evacuating pods on nodes] * xref:../../machine_management/manually-scaling-machineset.adoc#machineset-manually-scaling-manually-scaling-machineset[Manually scaling a compute machine set] -include::modules/nodes-nodes-working-deleting-bare-metal.adoc[leveloffset=+2] - endif::openshift-rosa,openshift-rosa-hcp[] diff --git a/nodes/nodes/nodes-remediating-fencing-maintaining-rhwa.adoc b/nodes/nodes/nodes-remediating-fencing-maintaining-rhwa.adoc index a0bc606fcddc..aa628fe4e646 100644 --- a/nodes/nodes/nodes-remediating-fencing-maintaining-rhwa.adoc +++ b/nodes/nodes/nodes-remediating-fencing-maintaining-rhwa.adoc @@ -6,6 +6,9 @@ include::_attributes/common-attributes.adoc[] toc::[] -When node-level failures occur, such as the kernel hangs or network interface controllers (NICs) fail, the work required from the cluster does not decrease, and workloads from affected nodes need to be restarted somewhere. Failures affecting these workloads risk data loss, corruption, or both. It is important to isolate the node, known as `fencing`, before initiating recovery of the workload, known as `remediation`, and recovery of the node. +[role="_abstract"] +When node-level failures occur, due to issues such as kernel hangs or network issues, it is important to isolate the node, known as _fencing_, before initiating recovery of the workload, known as _remediation_, and then you can attempt to recover the node. + +During node failures, the work required from the cluster does not decrease and workloads from affected nodes need to be restarted somewhere. Failures affecting these workloads risk data loss, corruption, or both. For more information on remediation, fencing, and maintaining nodes, see the link:https://access.redhat.com/documentation/en-us/workload_availability_for_red_hat_openshift[Workload Availability for Red Hat OpenShift] documentation. diff --git a/nodes/nodes/nodes-update-boot-images.adoc b/nodes/nodes/nodes-update-boot-images.adoc index e1665000ff41..e14c57fbc2ef 100644 --- a/nodes/nodes/nodes-update-boot-images.adoc +++ b/nodes/nodes/nodes-update-boot-images.adoc @@ -6,22 +6,24 @@ include::_attributes/common-attributes.adoc[] toc::[] +[role="_abstract"] +include::snippets/mco-update-boot-images-abstract.adoc[] + include::snippets/mco-update-boot-images-intro.adoc[] :FeatureName: Boot image management on {vmw-short} include::snippets/technology-preview.adoc[] -[role="_additional-resources"] -.Additional resources -* xref:../../nodes/clusters/nodes-cluster-enabling-features.adoc#nodes-cluster-enabling-features[Enabling features using feature gates] - include::modules/mco-update-boot-images-about.adoc[leveloffset=+1] +include::modules/mco-update-boot-images-disable.adoc[leveloffset=+1] + +include::modules/mco-update-boot-images-configuring.adoc[leveloffset=+1] + [role="_additional-resources"] -.Additional resources +[id="additional-resources_{context}"] +== Additional resources +* xref:../../nodes/clusters/nodes-cluster-enabling-features.adoc#nodes-cluster-enabling-features[Enabling features using feature gates] * xref:../../machine_configuration/mco-update-boot-images.adoc#mco-update-boot-images-disable_machine-configs-configure[Disabling boot image management] * xref:../../machine_configuration/mco-update-boot-images.adoc#mco-update-boot-images-configuring_machine-configs-configure[Enabling boot image management] -include::modules/mco-update-boot-images-disable.adoc[leveloffset=+1] - -include::modules/mco-update-boot-images-configuring.adoc[leveloffset=+1] diff --git a/snippets/mco-update-boot-images-abstract.adoc b/snippets/mco-update-boot-images-abstract.adoc new file mode 100644 index 000000000000..15c0997b0190 --- /dev/null +++ b/snippets/mco-update-boot-images-abstract.adoc @@ -0,0 +1,7 @@ +// +// * machine_configuration/mco-update-boot-images.adoc +// * nodes/nodes/nodes-update-boot-images.adoc + +:_mod-docs-content-type: SNIPPET + +For supported platforms, the Machine Config Operator (MCO) can manage and update the boot image on each node to ensure the {op-system-first} version of the boot image matches the {op-system-first} version appropriate for your cluster. diff --git a/updating/updating_a_cluster/updating-hardware-on-nodes-running-on-vsphere.adoc b/updating/updating_a_cluster/updating-hardware-on-nodes-running-on-vsphere.adoc index 9ee342b2b0b7..63fbec07bbc1 100644 --- a/updating/updating_a_cluster/updating-hardware-on-nodes-running-on-vsphere.adoc +++ b/updating/updating_a_cluster/updating-hardware-on-nodes-running-on-vsphere.adoc @@ -45,7 +45,7 @@ include::modules/update-vsphere-virtual-hardware-on-template.adoc[leveloffset=+2 [role="_additional-resources"] .Additional resources -* xref:../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working-evacuating_nodes-nodes-working[Understanding how to evacuate pods on nodes] +* xref:../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working-evacuating_nodes-nodes-working[Evacuating pods on nodes] // Scheduling an update for virtual hardware on vSphere include::modules/scheduling-virtual-hardware-update-on-vsphere.adoc[leveloffset=+1]