RKE/Kubernetes is good about recovering from a cluster shutdown and requires little intervention, though there is a specific order in which things should be powered back on to minimize errors. Etcd is our primary concern because the rest of the services are stateless. Etcd uses a write-ahead log (WAL) to store certain updates before applying them. If a member crashes and restarts between snapshots, it can locally recover transactions done since the last snapshot by looking at the content of the WAL. NOTE: Etcd uses fdatasync to flush writes from cache to disk.
- Prerequisites
- RKE v0.3.2
- Latest kubectl
- 7 VMs (2 core, 4GB of RAM, 20GB root)
- SSH access to root on all nodes
- Internet access to github and docker hub.
- Running Docker Install Script
- Edit the cluster.yml to include your node IPs and S3 settings
vi ./cluster.yml
- Stand up the cluster
bash ./build.sh
- Verify the cluster is up and healthy
bash ./verify.sh
- Break the cluster
bash ./break.sh
-
Power on any storage devices if applicable. Check with your storage vendor to properly power on your storage devices and verify that they are ready.
-
For each etcd node:
- Power on the system/start the instance.
- Log into the system via ssh.
- Ensure docker has started sudo service docker status or sudo systemctl status docker
- Ensure etcd and kubelet’s status shows Up in Docker sudo docker ps
-
For each control plane node:
- Power on the system/start the instance.
- Log into the system via ssh.
- Ensure docker has started sudo service docker status or sudo systemctl status docker
- Ensure kube-apiserver, kube-scheduler, kube-controller-manager, and kubelet’s status shows Up in Docker sudo docker ps
-
For each worker node:
- Power on the system/start the instance.
- Log into the system via ssh.
- Ensure docker has started sudo service docker status or sudo systemctl status docker
- Ensure kubelet’s status shows Up in Docker sudo docker ps
- Log into the Rancher UI (or use kubectl) and check your various projects to ensure workloads have started as expected. This may take a few minutes, depending on the number of workloads and your server capacity.
- How to shutdown a Kubernetes cluster
- You should make sure async is enabled on your storage subsystem.