This is a collection of etcd tools to do long and tedious tasks. Currently there is a restore tool for restoring a snapshot to a single node and a join tool for rejoining other members after the restore has been completed on the single node. This has been tested on RKE deployed clusters, Rancher deployed clusters (tested on aws) and Rancher custom clusters. These tools assume you have not changed node IP's or removed the cluster from the Rancher interface. If either of these things have been done, your cluster will not be in a healthy state after restore.
- Take an etcd snapshot before starting, using one of the following commands (only one will work):
docker exec etcd etcdctl snapshot save /tmp/snapshot.db && docker cp etcd:/tmp/snapshot.db .
docker exec etcd sh -c "etcdctl snapshot --endpoints=\$ETCDCTL_ENDPOINT save /tmp/snapshot.db" && docker cp etcd:/tmp/snapshot.db .
- Stop etcd on all nodes except for the one you are restoring:
docker update --restart=no etcd && docker stop etcd
- Run the restore:
curl -LO
bash ./ </path/to/snapshot>
You can also restore lost quorum with the following command instead:
- Rejoin etcd nodes by running the following commands. SSH key is optional if you have a default one already set on your ssh account. You must run these on each of the nodes that you shutdown etcd on in step 2. This should not be run on the node that you performed the restore on.
Automatic mode:
curl -LO
bash ./ <ssh user> <recently restored etcd node IP> [path to ssh key for remote box]
Manual mode (good for scenarios where you can't setup ssh keys between etcd nodes):
curl -LO
NOTE: If you are using to rejoin a node to a cluster that wasn't recently restored with, then you will want to make sure the etcd node is not a member of the etcd cluster before running the script. If it is a member then it will fail to rejoin. Examples below for clusters that require --endpoints and clusters that don't.
docker exec etcd sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT member list"
docker exec etcd sh -c "etcdctl --endpoints=\$ETCDCTL_ENDPOINT member remove <id>"
docker exec etcd sh -c "etcdctl member list"
docker exec etcd sh -c "etcdctl member remove <id>"
- Restart kubelet and kube-apiserver on all servers where it has not been restarted for you by the script already.
docker restart kubelet kube-apiserver
Generate and copy. This method is quickest if you have a password login you can use on the remote end.
ssh-keygen -t rsa -b 4096 -f ~/.ssh/etcd -N "" <<< y >/dev/null
ssh-copy-id -i ~/.ssh/etcd user@host
Generate and manual copy. This method is quickest if you have ssh sessions open already and no other way to login directly without a key.
ssh-keygen -t rsa -b 4096 -f ~/.ssh/etcd -N "" <<< y >/dev/null
cat ~/.ssh/
Copy output and on the other host paste it in like so
cat >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
- If or fails for any reason, your etcd container may be renamed to something like etcd-old--2020-03-20--031746. Before re-running either script, ensure that the etcd container exists. If it does not, docker rename the most recently time stamped etcd-old container back to etcd. Depending on where the failure occurred, you may have to remove the pending etcd node from the recently restored node. It is recommended to run an "docker exec etcd etcdctl member list" on the recently restored node after a failure. If you see the pending node, go ahead and remove it with "docker exec etcd etcdctl member remove ".