Local volume health monitoring #10

msau42 · 2018-12-19T01:08:29Z

Migrating from kubernetes-retired/external-storage#817

/kind feature

msau42 · 2018-12-19T01:10:32Z

To summarize, I think we can potentially split this up into these main areas:

A common way to report PV health
Daemonset controller that monitors local disks per node
Cluster controller that monitors nodes that get deleted
Workload controller that reacts to PV health

yanniszark · 2019-01-07T23:05:39Z

Link to the effort of @NickrenREN so far for a local storage monitor:

Issue
PR

I agree with the first three but I am unsure about the last one.
Some thoughts:

A common way to report PV health.

How to expose: annotations can be used on the PV objects to indicate certain conditions, like the mount point disappearing. The local storage design doc proposes adding taints to PV objects that will enable Pods to avoid binding specific PVs. (A new PV taint will be introduced to handle unhealthy volumes. The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy.)
What to include: I can think of 2 generic conditions at the moment, corresponding to different failure scenarios:
1. Availability: Indicates if the disk is available for operations. If the mount point disappears, that means the PV is not available. If the nodes defined the local PV is on is deleted, then the PV is not available (maybe this should be split into a separate condition?). Each storage provider can also determine if a PV it provisioned is available and update this condition.
2. Health: Indicates an estimate for the health of the disk based on metrics (eg SMART).

A problem here is that in order to take action in the case of an unhealthy PV, you need to manually bind PVs and PVCs. If a PV is reported to be unhealthy and you want to prevent a Pod from using it, there isn't anything you can do other than manually binding the PVC. A mechanism like PV taints, as proposed in the local storage document, would be very helpful here. Please correct me if there's something I'm missing here.

Daemonset controller that monitors local disks per node

At first, the DaemonSet can watch the mount points and collect smart data, then make the above conditions available through annotations on PV objects. We should be careful to avoid scenarios like node repair on GKE where a failed node will come back with the same name and disks mounted in the same places, but without any data. When that happens, the PVs will still work but now point to empty volumes. Instead, the PVs should not work as the underlying disks are essentially different, they are just mounted at the same points. To prevent this type of scenario, the Daemonset could create symlinks for each discovered directory, include the filesystem UUID in the symlink name and pass the path to the symlink in the PV object.

Cluster controller that monitors nodes that get deleted

We should be careful to check all PVs on the startup of the controller. If a Node is deleted before the controller is started then we need to clear PVs belonging to that Node. If a Node is deleted and at the same time this controller crashes, a subsequent list+watch will not return the deleted Node and we won't be notified that something happened.

Workload controller that reacts to PV health

As I mentioned, I don't see any real way of reacting. Deleting the PV doesn't work, because the local provisioner will recreate it. Recreating the PVC+Pod using it will not work because it could still end up binded to that PV. Is there anything we can do here? The local volume doc recommends introducing taints for PVs.

msau42 · 2019-01-07T23:35:30Z

cc @gnufied
Detecting disks properly is something that the local storage operator may be addressing

fejta-bot · 2019-04-28T04:07:57Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

cofyc · 2019-04-28T04:08:15Z

/remove-lifecycle stale

NickrenREN · 2019-04-29T06:37:36Z

Thanks for removing the stale label @cofyc , and thanks for your comment @yanniszark
I agree on most of yanniszark's opinions. And I am updating the PV monitor design doc, here is the link: https://docs.google.com/document/d/1WG51DZZeXyP50EKyhECYg5m5KEB4F0AhYBhA555_Gs4/edit.

We only focus on monitoring mechanism at the first stage, reaction is not in the scope of that doc.
Comments are welcome, thanks

Will submit a new PR for storage monitor later.

NickrenREN · 2019-05-30T08:41:56Z

PR submitted, comments are welcome, thanks
kubernetes/enhancements#1077

fejta-bot · 2019-08-28T09:11:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

NickrenREN · 2019-08-30T10:09:37Z

/remove-lifecycle stale

fejta-bot · 2019-11-28T11:08:02Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

msau42 · 2019-12-24T01:00:53Z

/lifecycle frozen

Rebase with upstream

arianvp · 2021-04-12T09:39:20Z

Now that the required logic in kubelet seems to have landed in 1.21; are there any concrete plans adding this to the static provisioner? https://kubernetes.io/blog/2021/04/08/kubernetes-1-21-release-announcement/#persistentvolume-health-monitor

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 19, 2018

msau42 mentioned this issue Dec 19, 2018

[local-volume] New controller to handle node deletion kubernetes-retired/external-storage#817

Closed

msau42 changed the title ~~New controller to handle node deletion~~ Local volume health monitoring Dec 19, 2018

This was referenced Dec 19, 2018

[local-volume] remove pv if disk/dir is unmounted/removed kubernetes-retired/external-storage#280

Closed

Where can we place the local storage monitor ? kubernetes-retired/external-storage#517

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 28, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 28, 2019

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 24, 2019

jsafrane pushed a commit to jsafrane/sig-storage-local-static-provisioner that referenced this issue Apr 14, 2020

Merge pull request kubernetes-sigs#10 from gnufied/rebase-with-upstream

08410ba

Rebase with upstream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local volume health monitoring #10

Local volume health monitoring #10

msau42 commented Dec 19, 2018

msau42 commented Dec 19, 2018

yanniszark commented Jan 7, 2019

msau42 commented Jan 7, 2019

fejta-bot commented Apr 28, 2019

cofyc commented Apr 28, 2019

NickrenREN commented Apr 29, 2019 •

edited

Loading

NickrenREN commented May 30, 2019

fejta-bot commented Aug 28, 2019

NickrenREN commented Aug 30, 2019

fejta-bot commented Nov 28, 2019

msau42 commented Dec 24, 2019

arianvp commented Apr 12, 2021

Local volume health monitoring #10

Local volume health monitoring #10

Comments

msau42 commented Dec 19, 2018

msau42 commented Dec 19, 2018

yanniszark commented Jan 7, 2019

A common way to report PV health.

Daemonset controller that monitors local disks per node

Cluster controller that monitors nodes that get deleted

Workload controller that reacts to PV health

msau42 commented Jan 7, 2019

fejta-bot commented Apr 28, 2019

cofyc commented Apr 28, 2019

NickrenREN commented Apr 29, 2019 • edited Loading

NickrenREN commented May 30, 2019

fejta-bot commented Aug 28, 2019

NickrenREN commented Aug 30, 2019

fejta-bot commented Nov 28, 2019

msau42 commented Dec 24, 2019

arianvp commented Apr 12, 2021

NickrenREN commented Apr 29, 2019 •

edited

Loading