-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rotating cloud instances with PVCs in a StatefulSet #181
Comments
@joekohlsdorf I guess we are facing the same issue: #65 (comment) How do you plan solve that? |
There is no Kubernetes-official way right now because Kubernetes will not unbind or delete PVCs. It's up to the users to recover from this situation. I have a plan to write a cloud controller to handle this automatically. When a new node with a different name is created to replace the old node (e.g. auto-scaling group in AWS), PVs belonging to the old node are invalid. PVCs must be deleted, then the scheduler can find feasible PVs on other nodes. By the way, if pods have already been recreated on node deletion and are stuck at pending, they must be recreated to trigger StatefulSet to create PVCs again. In GKE, this is a little different because the managed instance group recreates the underlying instance but uses the old node name. |
What I did is I wrote a janitor which every 20 seconds looks for pending pods which have PVCs bound to PVs on dead hosts and will remove the PVC if necessary. It then deletes the pending pod to get it scheduled again. My nodes for this service are static and have labels, this way I can be sure that the host isn't just rebooting. I know that my service runs on X nodes but if I see X nodes online and a PV on a node that doesn't exist I know it's dead and no coming back. If this doesn't happen on GKE maybe some workaround could be found with custom node tags. You could have an ASG for every node so tags would stay the same even if a node dies. |
Thank for your answers, @cofyc and @joekohlsdorf . @joekohlsdorf could you provide your janitor with us? I'll be glad if you can. |
@NickrenREN may have written a similar controller in the past |
Well I certainly would strongly advise against doing what I did but here is the unedited janitor I hacked up. Please only use it as a reference, I had to get this done in a time crunch. |
Hi, @joekohlsdorf . Thank you for script. |
There is an issue that is similiar to this. Actually we have implemented NodeFencing feature (external controller and agent) in our own production environment. |
@NickrenREN Are you using that implementation https://github.com/kvaps/kube-fencing ? If yes, what kind of agent to dealing with PV/PVC issues? Thank you. |
@nerddelphi No, our fencing controller and agent are implemented by ourselves. |
The design above is for bare metals, and for cloud providers, it may be a little bit different |
I am sorry that I am entering this discussion even though I am not a Kubernetes expert as you. I am running a simple Kubernetes Cluster with only a view Nodes. I guess this is a complete different environment as that ones you discuss here, but let me describe my scenario to give you different view on the problem:
I understand all your concerns about the data and what can happen to it if a volume is automatically detached. From my point of view, it is not Kubernetes' job to interfere in my data management. PLEASE give us a switch with which we can switch off this behavior and get terminating pods detached from a volume. |
@rsoika Thanks for your input. |
@NickrenREN Thanks for your clarification. So there is no self-healing mechanism in Kubernetes for this scenario? |
@rsoika For now, yes |
@rsoika Since the progress of "Node Shutdown Taint" feature is slow, we are considering creating new proposal and projects to opensource "NodeFencing" solution. It can be another option. |
@joekohlsdorf Hi. Are yours PVs (bound to the deleted PVC) deleted as well? In my cluster (GKE) they are with status RELEASED, even after its PVC be deleted by a janitor and my StorageClass/ReclaimPolicy be DELETE. Are you experiencing that behavior? I guess I wouldn't billed for a non-existent localssd, so I should do a way to delete theses RELEASEDs PVs, also. @cofyc @NickrenREN Is that behavior normal/expected? Shouldn't previous PVs be deleted automatically, once its PVCs don't exist anymore? Thanks. |
If nodes which these PVs belong to do not exist anymore, you need to delete these PVs manually because no local-volume-provisioner can run on these nodes and recycle them. |
@nerddelphi For now, k8s controller will just send Delete events (setting deletion timestamp), and as @cofyc said, the drivers(or kubelet) on the broken node break down too, so it won't do the cleanup work. |
Is the feature of NodeFencing official planned or is it still only in discussion? https://github.com/kvaps/kube-fencing |
IIRC, NodeFencing was discussed before but we didn't reach an agreement 😓 |
Ok, guys. Thank you! |
1 similar comment
Ok, guys. Thank you! |
@NickrenREN can you share the discussion about the NodeFencing feature? I would like to better understand the backgrounds. |
It was originally discussed here: kubernetes/kubernetes#65392 And also, there are several KEPs there, but didn't get merged: We didn't reach an agreement, and if needed, i'd like to reopen the discussion. |
I can't believe that this is true. I invested so much time to migrate from docker-swarm to kubernetes. Now I had to learn that kubernetes is not a self-healing system as promoted form everywhere. I think I understand the discussion and concerns about the pros and cons very well. But I am personally not on a level that I can discuss this in the refered groups. It is absolutely strange: I makes no sense to setup a Ceph cluster and connect it to my kubernetes cluster because of this limitation. I am running a small environment with about 100 PODs on 5 virtual nodes hosted by my cloud provider (Hetzner). My only hope is now that the Longhorn Team will solve this issue in there storage solution without the help from the kubernetes framework. I can't believe that Kubernets is only focusing on stateless services.... |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
/lifecycle frozen |
/cc @oomichi |
Does this seem like a solution Azure Kubernetes Service implemented on their own Container Storage Interface (CSI)? https://github.com/kubernetes-sigs/azuredisk-csi-driver/tree/main_v2 |
Online you can find a bunch of examples (even in the official docs) which show how to use the local-volume-provisioner in combination with PhysicalVolumeClaims in a StatefulSet.
All works fine until a node goes away and your cloud provider brings up a new one, be it due to an issue on their side or due to you bringing up some new nodes because you are upgrading K8s.
What happens in this case is that the PVC stays bound to a PV which no longer exists. This prohibits the pod in the StatefulSet from coming up until you manually delete the PVC. Now this makes sense because there is no way of knowing if the node was shut down for maintenance and if it will come back later or if it's gone forever.
However I'd just like the node to be assumed dead because I'm never going to reboot nodes intentionally, I'll just roll the cluster. If the pod can be scheduled on another node I know 100% that the node was replaced (due to my affinity settings).
Is there any official way of dealing with this or any config option I'm overseeing?
I can write a job which takes care of this but surely others must have hit this issue?!
The text was updated successfully, but these errors were encountered: