-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kops cluster upgrade from 1.28.7 to 1.29.2 - warmpool instances join cluster and remain in notReady state #16871
Comments
Hi, I attempted to troubleshoot the issue by performing the following steps:
|
We have the same issue!!! |
Hi @hakman, @johngmyers sorry for the direct message, just last time you helped to solve the issue quickly :). We are heavily relying on Kops(having 40+ clusters) and using Warmpool. In the recent releases of 1.29 the Warpools have been changed with the following PRs, which brought the mentioned issue.
Would appreciate to take a look and fix them! If there is any way we can support you in making it happen quickly, please let us know. |
Any update ? |
Can you SSH into an instance that is still Warming and dump the logs from It could be related to #16213 or https://github.com/kubernetes/kops/pull/16460/files#diff-0e14cc1cc6d0d21dacab069a7fe628d8c3fc3287a0fb3ad4468194d613a88a5e |
Hi @rifelpet, Thank you for the reply, you can find the log file in the attachment. Best regards, |
based on your logs, nodeup is definitely skipping the warmpool logic. Just to confirm, can you run this on an instance that is still Warming and paste its output here?
|
Hi @rifelpet, Here is the output from the command that you sent:
Also I am attaching the It does say this at the end of the After that the machine is powered off, but it still stays in kubernetes cluster:
Best regards, |
I believe I know what the issue is, can you test a kops build from this PR? If you can run the kops CLI on linux amd64, download the kops binary from here:
Otherwise you'll need to checkout the branch, run Set this environment variable:
Then run your normal |
/kind bug
1. What
kops
version are you running? The commandkops version
, will displaythis information.
1.29.2 (git-v1.29.2)
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.v1.29.6
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
After editing kops config with the new k8s version I ran the following commands:
kops get assets --copy --state $KOPS_REMOTE_STATE
kops update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE --allow-kops-downgrade
kops update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE
kops rolling-update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE
kops rolling-update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE --post-drain-delay 75s --drain-timeout 30m
5. What happened after the commands executed?
The cluster initiation of the upgrade went smoothly. The master nodes were successfully updated; however, an issue arose during the update process of the warmPool autoscaling groups. The update became stuck as instances were being added to the cluster instead of simply undergoing warming up and subsequent powering off.
The following error was appearing in the kops update logs:
I1002 12:02:19.415658 31 instancegroups.go:565] Cluster did not pass validation, will retry in "30s": node "i-04b854ec78e845f96" of role "node" is not ready, system-node-critical pod "aws-node-4chll" is pending, system-node-critical pod "ebs-csi-node-wcz74" is pending, system-node-critical pod "efs-csi-node-7q2j8" is pending, system-node-critical pod "kube-proxy-i-04b854ec78e845f96" is pending, system-node-critical pod "node-local-dns-mdvq7" is pending.
Those nodes in the Kubernetes cluster were displayed as 'NotReady,SchedulingDisabled' when using the 'kubectl get nodes' command. I waited for 10 minutes, but there was no progress. Subsequently, I resorted to manually deleting the problematic nodes. This action successfully resolved the issue, allowing the cluster upgrade process to resume smoothly.
After completing the upgrade, I conducted another test by manually removing warmed-up nodes from the AWS console. This action led to the creation of new warmup nodes, which were subsequently added to the k8s cluster. These newly added nodes remained in a 'NotReady, SchedulingDisabled' state until I removed them manually.
Autoscaler logs for one of those nodes:
1002 13:02:34.149584 1 pre_filtering_processor.go:57] Node i-0cfcda3548f955e05 should not be processed by cluster autoscaler (no node group config)
And the relevant log line from the kops-controler:
E1002 13:02:10.796429 1 controller.go:329] "msg"="Reconciler error" "error"="error identifying node \"i-0cfcda3548f955e05\": found instance \"i-0cfcda3548f955e05\", but state is \"stopped\"" "Node"={"name":"i-0cfcda3548f955e05"} "controller"="node" "controllerGroup"="" "controllerKind"="Node" "name"="i-0cfcda3548f955e05" "namespace"="" "reconcileID"="b532008b-db8f-4273-90ad-f0bf9d40858c"
Also kube-system pods are pending to be created on those nodes for some reason:
6. What did you expect to happen?
I anticipate the warmup nodes to be activated and subsequently shut down without being integrated into the cluster.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
The text was updated successfully, but these errors were encountered: