Skip to content

Commit 34e93ca

Browse files
authored
Merge pull request #5200 from tallclair/ippr
KEP-1287: cleanup errors & add changes from implementation
2 parents 8e02558 + 55d40b3 commit 34e93ca

File tree

1 file changed

+53
-45
lines changed
  • keps/sig-node/1287-in-place-update-pod-resources

1 file changed

+53
-45
lines changed

keps/sig-node/1287-in-place-update-pod-resources/README.md

Lines changed: 53 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
- [Atomic Resizes](#atomic-resizes)
3333
- [Actuating Resizes](#actuating-resizes)
3434
- [Memory Limit Decreases](#memory-limit-decreases)
35+
- [Swap](#swap)
3536
- [Sidecars](#sidecars)
3637
- [QOS Class](#qos-class)
3738
- [Resource Quota](#resource-quota)
@@ -298,7 +299,7 @@ The `ResizePolicy` field is immutable.
298299

299300
#### Resize Status
300301

301-
Resize status will be tracked via 2 new pod conditions: `PodResizePending` and `PodResizing`.
302+
Resize status will be tracked via 2 new pod conditions: `PodResizePending` and `PodResizeInProgress`.
302303

303304
**PodResizePending** will track states where the spec has been resized, but the Kubelet has not yet
304305
allocated the resources. There are two reasons associated with this condition:
@@ -313,8 +314,8 @@ admitted. `lastTransitionTime` will be populated with the time the condition was
313314
will always be `True` when the condition is present - if there is no longer a pending resized
314315
(either the resize was allocated or reverted), the condition will be removed.
315316

316-
**PodResizing** will track in-progress resizes, and should be present whenever allocated resources
317-
!= acknowledged resources (see [Resource States](#resource-states)). For successful synchronous
317+
**PodResizeInProgress** will track in-progress resizes, and should be present whenever allocated resources
318+
!= actuated resources (see [Resource States](#resource-states)). For successful synchronous
318319
resizes, this condition should be short lived, and `reason` and `message` will be left blank. If an
319320
error occurs while actuating the resize, the `reason` will be set to `Error`, and `message` will be
320321
populated with the error message. In the future, this condition will also be used for long-running
@@ -364,11 +365,6 @@ message UpdatePodSandboxResourcesRequest {
364365
LinuxContainerResources overhead = 2;
365366
// Optional resources represents the sum of container resources for this sandbox
366367
LinuxContainerResources resources = 3;
367-
368-
// Unstructured key-value map holding arbitrary additional information for
369-
// sandbox resources updating. This can be used for specifying experimental
370-
// resources to update or other options to use when updating the sandbox.
371-
map<string, string> annotations = 4;
372368
}
373369
374370
message UpdatePodSandboxResourcesResponse {}
@@ -419,7 +415,7 @@ The Kubelet now tracks 4 sets of resources for each pod/container:
419415
- Reported in the API through the `.status.containerStatuses[i].allocatedResources` field
420416
(allocated requests only)
421417
- Persisted locally on the node (requests + limits) in a checkpoint file
422-
3. Acknowledged resources
418+
3. Actuated resources
423419
- The resource configuration that the Kubelet passed to the runtime to actuate
424420
- Not reported in the API
425421
- Persisted locally on the node in a checkpoint file
@@ -428,11 +424,12 @@ The Kubelet now tracks 4 sets of resources for each pod/container:
428424
- The actual resource configuration the containers are running with, reported by the runtime,
429425
typically read directly from the cgroup configuration
430426
- Reported in the API via the `.status.conatinerStatuses[i].resources` field
427+
- _Note: for non-running contiainers `.status.conatinerStatuses[i].resources` will be the Allocated resources._
431428

432429
Changes are always propogated through these 4 resource states in order:
433430

434431
```
435-
Desired --> Allocated --> Acknowledged --> Actual
432+
Desired --> Allocated --> Actuated --> Actual
436433
```
437434

438435

@@ -512,7 +509,7 @@ This is intentionally hitting various edge-cases for demonstration.
512509
1. kubelet runs the pod and updates the API
513510
- `spec.containers[0].resources.requests[cpu]` = 1
514511
- `status.containerStatuses[0].allocatedResources[cpu]` = 1
515-
- `acknowledged[cpu]` = 1
512+
- `actuated[cpu]` = 1
516513
- `status.containerStatuses[0].resources.requests[cpu]` = 1
517514
- actual CPU shares = 1024
518515

@@ -521,100 +518,100 @@ This is intentionally hitting various edge-cases for demonstration.
521518
`requests`, ResourceQuota not exceeded, etc) and accepts the operation
522519
- `spec.containers[0].resources.requests[cpu]` = 1.5
523520
- `status.containerStatuses[0].allocatedResources[cpu]` = 1
524-
- `acknowledged[cpu]` = 1
521+
- `actuated[cpu]` = 1
525522
- `status.containerStatuses[0].resources.requests[cpu]` = 1
526523
- actual CPU shares = 1024
527524

528525
1. Kubelet Restarts!
529-
- The allocated & acknowledged resources are read back from checkpoint
526+
- The allocated & actuated resources are read back from checkpoint
530527
- Pods are resynced from the API server, but admitted based on the allocated resources
531528
- `spec.containers[0].resources.requests[cpu]` = 1.5
532529
- `status.containerStatuses[0].allocatedResources[cpu]` = 1
533-
- `acknowledged[cpu]` = 1
530+
- `actuated[cpu]` = 1
534531
- `status.containerStatuses[0].resources.requests[cpu]` = 1
535532
- actual CPU shares = 1024
536533

537534
1. Kubelet syncs the pod, sees resize #1 and admits it
538535
- `spec.containers[0].resources.requests[cpu]` = 1.5
539536
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
540-
- `acknowledged[cpu]` = 1
537+
- `actuated[cpu]` = 1
541538
- `status.containerStatuses[0].resources.requests[cpu]` = 1
542-
- `status.conditions[type==PodResizing]` added
539+
- `status.conditions[type==PodResizeInProgress]` added
543540
- actual CPU shares = 1024
544541

545542
1. Resize #2: cpu = 2
546543
- apiserver validates the request and accepts the operation
547544
- `spec.containers[0].resources.requests[cpu]` = 2
548545
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
549546
- `status.containerStatuses[0].resources.requests[cpu]` = 1
550-
- `status.conditions[type==PodResizing]`
547+
- `status.conditions[type==PodResizeInProgress]`
551548
- actual CPU shares = 1024
552549

553550
1. Container runtime applied cpu=1.5
554551
- `spec.containers[0].resources.requests[cpu]` = 2
555552
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
556-
- `acknowledged[cpu]` = 1.5
553+
- `actuated[cpu]` = 1.5
557554
- `status.containerStatuses[0].resources.requests[cpu]` = 1
558-
- `status.conditions[type==PodResizing]`
555+
- `status.conditions[type==PodResizeInProgress]`
559556
- actual CPU shares = 1536
560557

561558
1. kubelet syncs the pod, and sees resize #2 (cpu = 2)
562559
- kubelet decides this is feasible, but currently insufficient available resources
563560
- `spec.containers[0].resources.requests[cpu]` = 2
564561
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
565-
- `acknowledged[cpu]` = 1.5
562+
- `actuated[cpu]` = 1.5
566563
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
567564
- `status.conditions[type==PodResizePending].type` = `"Deferred"`
568-
- `status.conditions[type==PodResizing]` removed
565+
- `status.conditions[type==PodResizeInProgress]` removed
569566
- actual CPU shares = 1536
570567

571568
1. Resize #3: cpu = 1.6
572569
- apiserver validates the request and accepts the operation
573570
- `spec.containers[0].resources.requests[cpu]` = 1.6
574571
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
575-
- `acknowledged[cpu]` = 1.5
572+
- `actuated[cpu]` = 1.5
576573
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
577574
- `status.conditions[type==PodResizePending].type` = `"Deferred"`
578575
- actual CPU shares = 1536
579576

580577
1. Kubelet syncs the pod, and sees resize #3 and admits it
581578
- `spec.containers[0].resources.requests[cpu]` = 1.6
582579
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
583-
- `acknowledged[cpu]` = 1.5
580+
- `actuated[cpu]` = 1.5
584581
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
585582
- `status.conditions[type==PodResizePending]` removed
586-
- `status.conditions[type==PodResizing]` added
583+
- `status.conditions[type==PodResizeInProgress]` added
587584
- actual CPU shares = 1536
588585

589586
1. Container runtime applied cpu=1.6
590587
- `spec.containers[0].resources.requests[cpu]` = 1.6
591588
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
592-
- `acknowledged[cpu]` = 1.6
589+
- `actuated[cpu]` = 1.6
593590
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
594-
- `status.conditions[type==PodResizing]`
591+
- `status.conditions[type==PodResizeInProgress]`
595592
- actual CPU shares = 1638
596593

597594
1. Kubelet syncs the pod
598595
- `spec.containers[0].resources.requests[cpu]` = 1.6
599596
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
600-
- `acknowledged[cpu]` = 1.6
597+
- `actuated[cpu]` = 1.6
601598
- `status.containerStatuses[0].resources.requests[cpu]` = 1.6
602-
- `status.conditions[type==PodResizing]` removed
599+
- `status.conditions[type==PodResizeInProgress]` removed
603600
- actual CPU shares = 1638
604601

605602
1. Resize #4: cpu = 100
606603
- apiserver validates the request and accepts the operation
607604
- `spec.containers[0].resources.requests[cpu]` = 100
608605
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
609-
- `acknowledged[cpu]` = 1.6
606+
- `actuated[cpu]` = 1.6
610607
- `status.containerStatuses[0].resources.requests[cpu]` = 1.6
611608
- actual CPU shares = 1638
612609

613610
1. Kubelet syncs the pod, and sees resize #4
614611
- this node does not have 100 CPUs, so kubelet cannot admit it
615612
- `spec.containers[0].resources.requests[cpu]` = 100
616613
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
617-
- `acknowledged[cpu]` = 1.6
614+
- `actuated[cpu]` = 1.6
618615
- `status.containerStatuses[0].resources.requests[cpu]` = 1.6
619616
- `status.conditions[type==PodResizePending].type` = `"Infeasible"`
620617
- actual CPU shares = 1638
@@ -707,7 +704,7 @@ Impacts of a restart outside of resource configuration are out of scope.
707704
- Restart before checkpointing: pod goes through admission again as if new
708705
- Restart after checkpointing: pod goes through admission using the allocated resources
709706
1. Kubelet creates a container
710-
- Resources acknowledged after CreateContainer call succeeds
707+
- Resources actuated after CreateContainer call succeeds
711708
- Restart before acknowledgement: Kubelet issues a superfluous UpdatePodResources request
712709
- Restart after acknowledgement: No resize needed
713710
1. Container starts, triggering a pod sync event
@@ -721,19 +718,19 @@ Impacts of a restart outside of resource configuration are out of scope.
721718
1. Updated pod is synced: Check if pod can be admitted
722719
- No: add `PodResizePending` condition with type `Deferred`, no change to allocated resources
723720
- Restart: redo admission check, still deferred.
724-
- Yes: add `PodResizing` condition, update allocated checkpoint
721+
- Yes: add `PodResizeInProgress` condition, update allocated checkpoint
725722
- Restart before update: readmit, then update allocated
726-
- Restart after update: allocated != acknowledged --> proceed with resize
727-
1. Allocated != Acknowledged
728-
- Trigger an `UpdateContainerResources` CRI call, then update Acknowledged resources on success
729-
- Restart before CRI call: allocated != acknowledged, will still trigger the update call
730-
- Restart after CRI call, before acknowledged update: will redo update call
731-
- Restart after acknowledged update: allocated == acknowledged, condition removed
732-
- In all restart cases, `LastTransitionTime` is propagated from the old pod status `PodResizing`
723+
- Restart after update: allocated != actuated --> proceed with resize
724+
1. Allocated != Actuated
725+
- Trigger an `UpdateContainerResources` CRI call, then update Actuated resources on success
726+
- Restart before CRI call: allocated != actuated, will still trigger the update call
727+
- Restart after CRI call, before actuated update: will redo update call
728+
- Restart after actuated update: allocated == actuated, condition removed
729+
- In all restart cases, `LastTransitionTime` is propagated from the old pod status `PodResizeInProgress`
733730
condition, and remains unchanged.
734731
1. PLEG updates PodStatus cache, triggers pod sync
735-
- Pod status updated with actual resources, `PodResizing` condition removed
736-
- Desired == Allocated == Acknowledged, no resize changes needed.
732+
- Pod status updated with actual resources, `PodResizeInProgress` condition removed
733+
- Desired == Allocated == Actuated, no resize changes needed.
737734

738735
#### Notes
739736

@@ -793,10 +790,10 @@ a pod or container. Examples include:
793790
Therefore the Kubelet cannot reliably compare desired & actual resources to know whether to trigger
794791
a resize (a level-triggered approach).
795792

796-
To accommodate this, the Kubelet stores the set of "acknowledged" resources per container.
797-
Acknowledged resources represent the resource configuration that was passed to the runtime (either
793+
To accommodate this, the Kubelet stores the set of "actuated" resources per container.
794+
Actuated resources represent the resource configuration that was passed to the runtime (either
798795
via a CreateContainer or UpdateContainerResources call) and received a successful response. The
799-
acknowledged resources are checkpointed alongside the allocated resources to persist across
796+
actuated resources are checkpointed alongside the allocated resources to persist across
800797
restarts. There is the possibility that a poorly timed restart could lead to a resize request being
801798
repeated, so `UpdateContainerResources` must be idempotent.
802799

@@ -819,6 +816,15 @@ future, but the design of how limit decreases will be approached is still undeci
819816

820817
Memory limit decreases with `RestartRequired` are still allowed.
821818

819+
### Swap
820+
821+
Currently (v1.33), if swap is enabled & configured, burstable pods are allocated swap based on their
822+
memory requests. Since resizing swap requires more thought and additional design, we will forbid
823+
resizing memory requests of such containers for now. Since the API server is not privy to the node's
824+
swap configuration, this will be surfaced as resizes being marked `Infeasible`.
825+
826+
We try to relax this restriction in the future.
827+
822828
### Sidecars
823829

824830
Sidecars, a.k.a. restartable InitContainers can be resized the same as regular containers. There are
@@ -900,6 +906,8 @@ This will be reconsidered post-beta as a future enhancement.
900906
1. Handle pod-scoped resources (https://github.com/kubernetes/enhancements/pull/1592)
901907
1. Explore periodic resyncing of resources. That is, periodically issue resize requests to the
902908
runtime even if the allocated resources haven't changed.
909+
1. Allow resizing containers with swap allocated.
910+
1. Prioritize resizes when resources are freed, or at least make ordering deterministic.
903911

904912
#### Mutable QOS Class "Shape"
905913

@@ -1537,7 +1545,7 @@ _This section must be completed when targeting beta graduation to a release._
15371545
- Rename ResizeRestartPolicy `NotRequired` to `PreferNoRestart`,
15381546
and update CRI `UpdateContainerResources` contract
15391547
- Add back `AllocatedResources` field to resolve a scheduler corner case
1540-
- Introduce Acknowledged resources for actuation
1548+
- Introduce Actuated resources for actuation
15411549

15421550
## Drawbacks
15431551

0 commit comments

Comments
 (0)