You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Tools for generating]: https://github.com/ekalinin/github-markdown-toc
51
66
67
+
## Release Signoff Checklist
68
+
69
+
Items marked with (R) are required *prior to targeting to a milestone / release*.
70
+
71
+
-[X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
72
+
-[X] (R) KEP approvers have approved the KEP status as `implementable`
73
+
-[X] (R) Design details are appropriately documented
74
+
-[X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
75
+
-[X] (R) Graduation criteria is in place
76
+
-[X] (R) Production readiness review completed
77
+
-[X] (R) Production readiness review approved
78
+
-[ ] "Implementation History" section is up-to-date for milestone
79
+
-[ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
80
+
-[ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
81
+
52
82
## Summary
53
83
54
84
This proposal applies to the use of quotas for ephemeral-storage
@@ -544,6 +574,9 @@ required elsewhere:
544
574
future allow adding additional data without having to change code
545
575
other than that which uses the new information.
546
576
577
+
### Test Plan
578
+
579
+
547
580
#### Testing Strategy
548
581
549
582
The quota code is by an large not very amendable to unit tests. While
@@ -555,6 +588,40 @@ manager, particularly under stress). It also requires setup in the
555
588
form of a prepared filesystem. It would be better served by
556
589
appropriate end to end tests.
557
590
591
+
[x] I/we understand the owners of the involved components may require updates to
592
+
existing tests to make this code solid enough prior to committing the changes necessary
593
+
to implement this enhancement.
594
+
595
+
##### Prerequisite testing updates
596
+
597
+
<!--
598
+
Based on reviewers feedback describe what additional tests need to be added prior
599
+
implementing this enhancement to ensure the enhancements have also solid foundations.
600
+
-->
601
+
602
+
##### Unit tests
603
+
604
+
The main unit test is in package under `pkg/volume/util/fsquota/`.
605
+
606
+
- `pkg/volume/util/fsquota/`: `2022-06-20`- `73%`
607
+
- - project.go 75.7%
608
+
- - quota.go 100%
609
+
- - quota_linux.go 70.6%
610
+
611
+
See details in https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit&include-filter-by-regex=fsquota.
612
+
613
+
##### Integration tests
614
+
615
+
N/A
616
+
617
+
##### e2e tests
618
+
619
+
e2e evolution (LocalStorageCapacityIsolationQuotaMonitoring [Slow] [Serial] [Disruptive] [Feature:LocalStorageCapacityIsolationQuota][NodeFeature:LSCIQuotaMonitoring]) can be found in [`test/e2e_node/quota_lsci_test.go`](https://github.com/kubernetes/kubernetes/blob/8cd689e40d253e520b1698d4bcf33992f0ae1d20/test/e2e_node/quota_lsci_test.go#L93-L103)
620
+
621
+
The e2e tests are slow and serial and we will not promote them to be conformance test then.
622
+
There is no failure history or flakes in https://storage.googleapis.com/k8s-triage/index.html?test=LocalStorageCapacityIsolationQuotaMonitoring
623
+
624
+
558
625
### Risks and Mitigations
559
626
560
627
* The SIG raised the possibility of a container being unable to exit
@@ -610,7 +677,7 @@ The following criteria applies to
610
677
- Unit test coverage
611
678
- Node e2e test
612
679
613
-
### Phase 2: Beta (target 1.16)
680
+
### Phase 2: Beta (target 1.25)
614
681
615
682
- User feedback
616
683
- Benchmarks to determine latency and overhead of using quotas
@@ -629,7 +696,7 @@ files. The operations performed were as follows, in sequence:
629
696
630
697
* *Create Files*: Create 4K directories each containing 2K files as
631
698
described, in depth-first order.
632
-
699
+
633
700
* *du*: run `du` immediately after creating the files.
634
701
635
702
* *quota*: where applicable, run `xfs_quota` immediately after `du`.
@@ -640,10 +707,10 @@ files. The operations performed were as follows, in sequence:
640
707
641
708
* *du (after remount)*: run `mount -o remount <filesystem>`
642
709
immediately followed by `du`.
643
-
710
+
644
711
* *quota (after remount)*: run `mount -o remount <filesystem>`
645
712
immediately followed by `xfs_quota`.
646
-
713
+
647
714
* *unmount*: `umount` the filesystem.
648
715
649
716
* *mount*: `mount` the filesystem.
@@ -653,7 +720,7 @@ files. The operations performed were as follows, in sequence:
653
720
654
721
* *du after umount/mount*: run `du` after unmounting and
655
722
mounting the filesystem.
656
-
723
+
657
724
* *Remove Files*: remove the test files.
658
725
659
726
The test was performed on four separate filesystems:
@@ -709,11 +776,168 @@ and are not reported here.
709
776
| du after umount/mount | 66.0 | 82.4 | 29.2 | 28.1 |
710
777
| Remove Files | 188.6 | 156.6 | 90.4 | 81.8 |
711
778
779
+
## Production Readiness Review Questionnaire
780
+
781
+
### Feature Enablement and Rollback
782
+
783
+
###### How can this feature be enabled / disabled in a live cluster?
784
+
785
+
- [x] Feature gate (also fill in values in `kep.yaml`)
* **Are there any missing metrics that would be useful to have to improve observability of this feature? **
862
+
863
+
- Yes, there are no histogram metrics for each volume. The above metric was grouped by volume types because
864
+
the cost for every volume is too expensive. As a result, users cannot figure out if the feature is used by
865
+
a workload directly by the metrics. A cluster-admin can check kubelet configuration on each node. If the
866
+
feature gate is disabled, workloads on that node will not use it.
867
+
For example, run `xfs_quota -x -c 'report -h' /dev/sdc` to check quota settings in the device.
868
+
Check `spec.containers[].resources.limits.ephemeral-storage` of each container to compare.
869
+
870
+
871
+
### Dependencies
872
+
* **Does this feature depend on any specific services running in the cluster? **
873
+
874
+
- Yes, the feature depneds on project quotas. Once quotas are enabled, the xfs_quota tool can be used to
875
+
set limits and report on disk usage.
876
+
877
+
878
+
### Scalability
879
+
* **Will enabling / using this feature result in any new API calls?**
880
+
- No.
881
+
882
+
* **Will enabling / using this feature result in introducing new API types?**
883
+
- No.
884
+
885
+
* **Will enabling / using this feature result in any new calls to the cloud
886
+
provider?**
887
+
- No.
888
+
889
+
* **Will enabling / using this feature result in increasing size or count of
890
+
the existing API objects?**
891
+
- No.
892
+
893
+
* **Will enabling / using this feature result in increasing time taken by any
894
+
operations covered by [existing SLIs/SLOs]?**
895
+
- No.
896
+
897
+
* **Will enabling / using this feature result in non-negligible increase of
898
+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
899
+
- Yes. It will use less CPU time and IO during ephemeral storage monitoring. `kubelet` now allows use of XFS quotas (on XFS and suitably configured ext4fs filesystems) to monitor storage consumption for ephemeral storage (currently for emptydir volumes only). This method of monitoring consumption is faster and more accurate than the old method of walking the filesystem tree. It does not enforce limits, only monitors consumption.
900
+
901
+
### Troubleshooting
902
+
903
+
<!--
904
+
This section must be completed when targeting beta to a release.
905
+
The Troubleshooting section currently serves the `Playbook` role. We may consider
906
+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
907
+
details). For now, we leave it here.
908
+
-->
909
+
910
+
###### How does this feature react if the API server and/or etcd is unavailable?
911
+
912
+
###### What are other known failure modes?
913
+
914
+
1. If the ephemeral storage limitation is reached, the pod will be evicted by kubelet.
915
+
916
+
2. It should skip when the image is not configured correctly (unsupported FS or quota not enabled).
917
+
918
+
3. For "out of space" failure, kublet eviction should be triggered.
919
+
920
+
921
+
###### What steps should be taken if SLOs are not being met to determine the problem?
922
+
923
+
If the metrics shows some problems, we can check the log and quota dir with below commands.
924
+
- There will be warning logs([after the # is merged](https://github.com/kubernetes/kubernetes/pull/107490)) if volume calculation took too long than 1 second
925
+
- If quota is enabled, you can find the volume information and the process time with `time repquota -P /var/lib/kubelet -s -v`
926
+
712
927
## Implementation History
713
928
714
929
### Version 1.15
715
930
716
-
` LocalStorageCapacityIsolationFSMonitoring`implemented at Alpha
931
+
- `LocalStorageCapacityIsolationFSMonitoring`implemented at Alpha
932
+
933
+
### Version 1.24
934
+
935
+
- `kubelet_volume_metric_collection_duration_seconds`metrics was added
936
+
- A bug that quota cannot work after kubelet restarted, was fixed
937
+
938
+
### Version 1.25
939
+
940
+
- Plan to promote `LocalStorageCapacityIsolationFSMonitoring` to Beta
0 commit comments