Rotating cloud instances with PVCs in a StatefulSet #181

joekohlsdorf · 2020-03-14T21:05:57Z

Online you can find a bunch of examples (even in the official docs) which show how to use the local-volume-provisioner in combination with PhysicalVolumeClaims in a StatefulSet.

All works fine until a node goes away and your cloud provider brings up a new one, be it due to an issue on their side or due to you bringing up some new nodes because you are upgrading K8s.

What happens in this case is that the PVC stays bound to a PV which no longer exists. This prohibits the pod in the StatefulSet from coming up until you manually delete the PVC. Now this makes sense because there is no way of knowing if the node was shut down for maintenance and if it will come back later or if it's gone forever.

However I'd just like the node to be assumed dead because I'm never going to reboot nodes intentionally, I'll just roll the cluster. If the pod can be scheduled on another node I know 100% that the node was replaced (due to my affinity settings).
Is there any official way of dealing with this or any config option I'm overseeing?

I can write a job which takes care of this but surely others must have hit this issue?!

nerddelphi · 2020-04-10T05:53:46Z

@joekohlsdorf I guess we are facing the same issue: #65 (comment)

How do you plan solve that?

cofyc · 2020-04-10T06:13:22Z

There is no Kubernetes-official way right now because Kubernetes will not unbind or delete PVCs. It's up to the users to recover from this situation. I have a plan to write a cloud controller to handle this automatically.

When a new node with a different name is created to replace the old node (e.g. auto-scaling group in AWS), PVs belonging to the old node are invalid. PVCs must be deleted, then the scheduler can find feasible PVs on other nodes. By the way, if pods have already been recreated on node deletion and are stuck at pending, they must be recreated to trigger StatefulSet to create PVCs again.

In GKE, this is a little different because the managed instance group recreates the underlying instance but uses the old node name.

joekohlsdorf · 2020-04-10T06:18:54Z

What I did is I wrote a janitor which every 20 seconds looks for pending pods which have PVCs bound to PVs on dead hosts and will remove the PVC if necessary. It then deletes the pending pod to get it scheduled again.

My nodes for this service are static and have labels, this way I can be sure that the host isn't just rebooting. I know that my service runs on X nodes but if I see X nodes online and a PV on a node that doesn't exist I know it's dead and no coming back.

If this doesn't happen on GKE maybe some workaround could be found with custom node tags. You could have an ASG for every node so tags would stay the same even if a node dies.

nerddelphi · 2020-04-10T14:42:20Z

Thank for your answers, @cofyc and @joekohlsdorf .

@joekohlsdorf could you provide your janitor with us? I'll be glad if you can.

msau42 · 2020-04-10T14:57:29Z

@NickrenREN may have written a similar controller in the past

joekohlsdorf · 2020-04-10T22:01:53Z

Well I certainly would strongly advise against doing what I did but here is the unedited janitor I hacked up. Please only use it as a reference, I had to get this done in a time crunch.
https://gist.github.com/joekohlsdorf/2658f03b1e1b6194ebe6b61bd8770647

nerddelphi · 2020-04-11T00:51:08Z

Hi, @joekohlsdorf . Thank you for script.

NickrenREN · 2020-04-13T05:55:10Z

There is an issue that is similiar to this.
Some guys and I propose to introduce NodeFencing to solve this because it suit for both Cloud Providers and Bare metals and the reaction is relatively simple.
But others decide to take NodeShutdown taint method, there is an ongoing proposal: kubernetes/enhancements#1116.

Actually we have implemented NodeFencing feature (external controller and agent) in our own production environment.

nerddelphi · 2020-04-13T16:08:57Z

@NickrenREN Are you using that implementation https://github.com/kvaps/kube-fencing ? If yes, what kind of agent to dealing with PV/PVC issues?
My clusters are on GKE.

Thank you.

NickrenREN · 2020-04-14T03:59:41Z

@nerddelphi No, our fencing controller and agent are implemented by ourselves.
Agent is designed to shut down machines forcefully, the control logic, race conditions and cleanup work are done by controller.

NickrenREN · 2020-04-14T04:01:34Z

The design above is for bare metals, and for cloud providers, it may be a little bit different

rsoika · 2020-04-18T08:33:40Z

I am sorry that I am entering this discussion even though I am not a Kubernetes expert as you.
But I have been dealing with this problem for a some weeks and I also followed this long running discussion.

I am running a simple Kubernetes Cluster with only a view Nodes. I guess this is a complete different environment as that ones you discuss here, but let me describe my scenario to give you different view on the problem:

I have setup a distributed storage based on Ceph or Longhorn (same behavior for both).
I deploy a PostgresDB using a persistence volume claim.
I kill (for testing) the node on which the Database POD is running
Now I run into this problem that Kubernetes gets stuck while restarting the database POD on a new node, because the broken POD get not detached from the volume.
I have to manually delete the volumeattachment to get rid of this situation

I understand all your concerns about the data and what can happen to it if a volume is automatically detached.
But I - as the administrator of my cluster - trust in my Longhorn or Ceph Cluster. And of course, something can always go wrong, but that's my job to secure my data.

From my point of view, it is not Kubernetes' job to interfere in my data management. PLEASE give us a switch with which we can switch off this behavior and get terminating pods detached from a volume.

NickrenREN · 2020-04-20T05:24:55Z

@rsoika Thanks for your input.
IIUIC, your scenario is the case NodeFencing can solve. If the node is dead (or Unknown), it will be forced to shut down and we do not expect it to be back again. As you described, data management isn't kubernetes' job, so the reaction is easy: go ahead and detach the volume forcefully.
And of course, if you want to bring you node back, you need to do the cleanup work first (this is also the work of kubernetes relevant team).

rsoika · 2020-04-20T06:04:11Z

@NickrenREN Thanks for your clarification. So there is no self-healing mechanism in Kubernetes for this scenario?

NickrenREN · 2020-04-20T06:06:15Z

@rsoika For now, yes

NickrenREN · 2020-04-20T06:10:11Z

@rsoika Since the progress of "Node Shutdown Taint" feature is slow, we are considering creating new proposal and projects to opensource "NodeFencing" solution. It can be another option.

nerddelphi · 2020-04-20T16:32:29Z

@joekohlsdorf Hi.

Are yours PVs (bound to the deleted PVC) deleted as well? In my cluster (GKE) they are with status RELEASED, even after its PVC be deleted by a janitor and my StorageClass/ReclaimPolicy be DELETE.

Are you experiencing that behavior?

I guess I wouldn't billed for a non-existent localssd, so I should do a way to delete theses RELEASEDs PVs, also.

@cofyc @NickrenREN Is that behavior normal/expected? Shouldn't previous PVs be deleted automatically, once its PVCs don't exist anymore?

Thanks.

cofyc · 2020-04-21T03:00:02Z

If nodes which these PVs belong to do not exist anymore, you need to delete these PVs manually because no local-volume-provisioner can run on these nodes and recycle them.

NickrenREN · 2020-04-21T05:13:02Z

@nerddelphi For now, k8s controller will just send Delete events (setting deletion timestamp), and as @cofyc said, the drivers(or kubelet) on the broken node break down too, so it won't do the cleanup work.
But with NodeFencing feature, these PVs can be released automatically (forcefully).

rsoika · 2020-04-21T10:42:41Z

Is the feature of NodeFencing official planned or is it still only in discussion?
I found these projects that seems to address the problem:

https://github.com/kvaps/kube-fencing
https://github.com/rootfs/node-fencing

NickrenREN · 2020-04-21T13:31:42Z

IIRC, NodeFencing was discussed before but we didn't reach an agreement 😓

nerddelphi · 2020-04-21T15:31:55Z

Ok, guys. Thank you!

nerddelphi · 2020-04-21T15:32:50Z

Ok, guys. Thank you!

rsoika · 2020-04-21T16:03:29Z

@NickrenREN can you share the discussion about the NodeFencing feature? I would like to better understand the backgrounds.

NickrenREN · 2020-04-22T06:24:36Z

It was originally discussed here: kubernetes/kubernetes#65392
We also discussed it several times offline on slack.

And also, there are several KEPs there, but didn't get merged:
kubernetes/community#2763
kubernetes/community#1416

We didn't reach an agreement, and if needed, i'd like to reopen the discussion.

rsoika · 2020-04-22T08:22:57Z

I can't believe that this is true. I invested so much time to migrate from docker-swarm to kubernetes. Now I had to learn that kubernetes is not a self-healing system as promoted form everywhere. I think I understand the discussion and concerns about the pros and cons very well. But I am personally not on a level that I can discuss this in the refered groups.

It is absolutely strange: I makes no sense to setup a Ceph cluster and connect it to my kubernetes cluster because of this limitation. I am running a small environment with about 100 PODs on 5 virtual nodes hosted by my cloud provider (Hetzner).
I can be sure that if my cloud provider has a problem in one of its data centers (which are spread on different locations in Germany) my applications running on this node will stuck in termination state. My customers will call me because they can no longer work. I have to figure out all the affected volumeattachments and delete them manually. This is of course no solution. We are a small company with no 7x24 admin team.

My only hope is now that the Longhorn Team will solve this issue in there storage solution without the help from the kubernetes framework.

I can't believe that Kubernets is only focusing on stateless services....
I am not only talking about databases like postgres but also about services like Apache-Solr for fulltext search indexes or the Spacy project for ML-Services. All these services need in the end a data volume. If you see a way to re-energise this discussion, I would like to support you.

fejta-bot · 2020-07-21T09:13:35Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

cofyc · 2020-07-21T09:15:20Z

/remove-lifecycle stale
/lifecycle fronzen

cofyc · 2020-07-21T09:15:47Z

/lifecycle frozen

oomichi · 2021-06-07T17:09:17Z

/cc @oomichi

eduardobr · 2022-06-29T14:10:14Z

Does this seem like a solution Azure Kubernetes Service implemented on their own Container Storage Interface (CSI)?
https://azure.microsoft.com/da-dk/updates/public-preview-azure-disk-csi-driver-v2-in-aks/

https://github.com/kubernetes-sigs/azuredisk-csi-driver/tree/main_v2

rsoika mentioned this issue Apr 17, 2020

[FEATURE] Improve node failure handling longhorn/longhorn#1105

Closed

cofyc mentioned this issue May 27, 2020

3.0.0 enhancements #194

Open

cofyc mentioned this issue Jun 19, 2020

Recover workloads using local PVs from data loss in the cloud #202

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 21, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 21, 2020

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rotating cloud instances with PVCs in a StatefulSet #181

Rotating cloud instances with PVCs in a StatefulSet #181

joekohlsdorf commented Mar 14, 2020

nerddelphi commented Apr 10, 2020

cofyc commented Apr 10, 2020

joekohlsdorf commented Apr 10, 2020 •

edited

Loading

nerddelphi commented Apr 10, 2020

msau42 commented Apr 10, 2020

joekohlsdorf commented Apr 10, 2020

nerddelphi commented Apr 11, 2020

NickrenREN commented Apr 13, 2020 •

edited

Loading

nerddelphi commented Apr 13, 2020 •

edited

Loading

NickrenREN commented Apr 14, 2020

NickrenREN commented Apr 14, 2020

rsoika commented Apr 18, 2020

NickrenREN commented Apr 20, 2020 •

edited

Loading

rsoika commented Apr 20, 2020

NickrenREN commented Apr 20, 2020

NickrenREN commented Apr 20, 2020

nerddelphi commented Apr 20, 2020

cofyc commented Apr 21, 2020

NickrenREN commented Apr 21, 2020 •

edited

Loading

rsoika commented Apr 21, 2020

NickrenREN commented Apr 21, 2020

nerddelphi commented Apr 21, 2020

nerddelphi commented Apr 21, 2020

rsoika commented Apr 21, 2020

NickrenREN commented Apr 22, 2020

rsoika commented Apr 22, 2020

fejta-bot commented Jul 21, 2020

cofyc commented Jul 21, 2020

cofyc commented Jul 21, 2020

oomichi commented Jun 7, 2021

eduardobr commented Jun 29, 2022 •

edited

Loading

Rotating cloud instances with PVCs in a StatefulSet #181

Rotating cloud instances with PVCs in a StatefulSet #181

Comments

joekohlsdorf commented Mar 14, 2020

nerddelphi commented Apr 10, 2020

cofyc commented Apr 10, 2020

joekohlsdorf commented Apr 10, 2020 • edited Loading

nerddelphi commented Apr 10, 2020

msau42 commented Apr 10, 2020

joekohlsdorf commented Apr 10, 2020

nerddelphi commented Apr 11, 2020

NickrenREN commented Apr 13, 2020 • edited Loading

nerddelphi commented Apr 13, 2020 • edited Loading

NickrenREN commented Apr 14, 2020

NickrenREN commented Apr 14, 2020

rsoika commented Apr 18, 2020

NickrenREN commented Apr 20, 2020 • edited Loading

rsoika commented Apr 20, 2020

NickrenREN commented Apr 20, 2020

NickrenREN commented Apr 20, 2020

nerddelphi commented Apr 20, 2020

cofyc commented Apr 21, 2020

NickrenREN commented Apr 21, 2020 • edited Loading

rsoika commented Apr 21, 2020

NickrenREN commented Apr 21, 2020

nerddelphi commented Apr 21, 2020

nerddelphi commented Apr 21, 2020

rsoika commented Apr 21, 2020

NickrenREN commented Apr 22, 2020

rsoika commented Apr 22, 2020

fejta-bot commented Jul 21, 2020

cofyc commented Jul 21, 2020

cofyc commented Jul 21, 2020

oomichi commented Jun 7, 2021

eduardobr commented Jun 29, 2022 • edited Loading

joekohlsdorf commented Apr 10, 2020 •

edited

Loading

NickrenREN commented Apr 13, 2020 •

edited

Loading

nerddelphi commented Apr 13, 2020 •

edited

Loading

NickrenREN commented Apr 20, 2020 •

edited

Loading

NickrenREN commented Apr 21, 2020 •

edited

Loading

eduardobr commented Jun 29, 2022 •

edited

Loading