Skip to content

Commit fc42cb6

Browse files
committed
kep: multi-cluster workload
1 parent 5b46511 commit fc42cb6

File tree

1 file changed

+220
-0
lines changed
  • enhancements/sig-architecture/18-workload-scheduling

1 file changed

+220
-0
lines changed
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
# Multi-Cluster Workload Scheduling
2+
3+
## Release Signoff Checklist
4+
5+
- [ ] Enhancement is `implementable`
6+
- [ ] Design details are appropriately documented from clear requirements
7+
- [ ] Test plan is defined
8+
- [ ] Graduation criteria for dev preview, tech preview, GA
9+
- [ ] User-facing documentation is created in [website](https://github.com/open-cluster-management-io/open-cluster-management-io.github.io/)
10+
11+
## Summary
12+
13+
This proposal will be adding a new mutli-cluster workload functionality to OCM
14+
platform as either a built-in module or a pluggable addon and a new multi-cluster
15+
workload API will be added under a new API group `workload.open-cluster-management.io`
16+
as the manipulating interface for users. Note that the only requirement for
17+
the adopted local workload (e.g. Deployment, ReplicaSet) in the spoke cluster will
18+
be implementing the generic [scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource)
19+
subresource, so the new multi-cluster workload controller will be scaling up/down
20+
the local workloads regardless of whether it's a built-in workload API or custom
21+
workload developed via CRD.
22+
23+
24+
## Motivation
25+
26+
### Goals
27+
28+
#### Controlling Replicas Distribution
29+
30+
In some cases, we may want to specify a total number of replicas for a multi-cluster
31+
workload and let the controller do the rest of replicas distribution for us according
32+
to different strategies such as (1) even (max-min) distribution (2) weighted
33+
(proportional) distribution. The distribution should be updated reactively by
34+
watching the selected list of clusters via the output from `PlacementDecision` API.
35+
Note that the computed distribution here will be an "expected" number while the
36+
actual distribution may diverge from the expectation depending on the allocatable
37+
resource or the liveness of the replicas which is elaborated in the following section.
38+
39+
#### Dynamic Replicas Balancing
40+
41+
The term "balance" or "re-schedule" here infers the process of transferring a replicas
42+
temporarily from one cluster to another. There are some cases when we need to trigger
43+
the process of replicas balancing:
44+
45+
- When the distributed local workload fails to provision effective instances over a
46+
period of time.
47+
- When the distributed local workload is manually scaled down on purpose.
48+
49+
The process of replicas transferring can be either "bursty" or "conservative":
50+
51+
- __Bursty__: Increasing the replicas for one cluster then decrease the other.
52+
- __Conservative__: Decrease first then increase.
53+
54+
#### Adopting Arbitrary Workload Types
55+
56+
Given the fact that more and more third-party extended workload API are emerging beyond
57+
the kubernetes community, our multi-cluster workload controller should not raise any
58+
additional requirement on the managing workload API except for enabling the standard
59+
[scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource)
60+
subresource via the CRD. Hence, to scale up or down the local workload, the controller
61+
will be simply updating/patching the subresource regardless of its concrete types.
62+
63+
### Non Goals
64+
65+
- This KEP will not cover the distribution of special workloads such as `Job`.
66+
- This KEP will not cover the distribution of satellite resources around the workload
67+
such as `ConfigMap`, `ServiceAccount`.
68+
69+
## Proposal
70+
71+
### Abstraction
72+
73+
To understand the functionalities of the multi-cluster workload easier, we can start by
74+
defining the boundary of the controller's abstraction as a black box.
75+
76+
#### "ClusterSet" and "Placement"
77+
78+
#### Controller "Input"
79+
80+
The source of information input for the multi-cluster workload controller will be:
81+
82+
- __Cluster Topology__: The `PlacementDesicision` is dynamically computed according to
83+
the following knowledge from OCM's built-in APIs:
84+
- The existence and availability of the managed clusters.
85+
- The "hub-defined" attributes attached to the cluster model via labelling. e.g. the
86+
clusterset, the feature label or other custom labels which will be read by the
87+
placement controller.
88+
- The "spoke-reported" attributes i.e. "ClusterClaim" which is collected and reported
89+
by the spoke agent.
90+
91+
- __API Prescription__: There will be a new API named "ElasticWorkload" or
92+
"ManagedWorkload" that prescribes the necessary information for workload distribution
93+
such as the content of the target workload, the expected total number of replicas, etc.
94+
95+
- __Effective Distributed Local Workload__: The new controller also need to capture the
96+
events from local clusters so that it can take actions e.g. when the instance crushes
97+
or tainted unexpectedly.
98+
99+
#### Controller "Output"
100+
101+
The new controller will be applying the latest state of the workload towards the selected
102+
clusters and tuning its replicas on demand. As a matter of implementation, the workload
103+
applying will be executed via the stable `ManifestWork` api.
104+
105+
### API Spec
106+
107+
```yaml
108+
apiVerion: scheduling.open-cluster-management.io/v1
109+
kind: ElasticWorkload
110+
spec:
111+
# The target namespace to deploy the workload in the spoke cluster.
112+
spokeNamespace: default
113+
# The content of target workload, supporting:
114+
# - Inline: Embedding a static manifest.
115+
# - Import: Referencing an existing workload resource. (Note that
116+
# the replicas should always be set to 0 to avoid wasting
117+
# capacities in the hub cluster.)
118+
target:
119+
type: [ Inline | Import ]
120+
inline: ...
121+
import: ...
122+
# Referencing an OCM's placement policy in the same namespace as where
123+
# this elastic workload resource lives.
124+
placementRef:
125+
name: ...
126+
# DistributionStrategy controls the expected replicas distribution
127+
# across the selected clusters from the placement api above. The supported
128+
# distributing strategy will be:
129+
# - Even: Filling the min replicas upon every round. i.e. max-min
130+
# - Weighted: Setting a default weight and overriding the weight for a
131+
# few clusters on demand.
132+
distributionStrategy:
133+
totalReplicas: 10
134+
type: [ Even | Propotional ]
135+
# BalanceStrategy prescribes the balancing/re-scheduling behavior of the
136+
# controller when the effective distributed replicas doesn't meet the
137+
# expectation within a period of "hesitation" time. The supported types
138+
# will be:
139+
# - None: Do not reschedule at any time.
140+
# - LimitRange: The reschedule is allowed within a range of numbers. The
141+
# replicas scheduler will be trying the best to control the
142+
# managed replicas within the range:
143+
# * "min": when the controller is attempting to transfer a replicas,
144+
# those clusters under the "min" will be the primary choices.
145+
# * "max": the controller will exclude the cluster exceeding the "max"
146+
# from the list of candidates upon re-scheduling.
147+
# - Classful: A classful prioritized rescheduling policy.
148+
# * "assured": similar to "min" above
149+
# * "softLimit": those cluster (assured < # of replicas <= softLimit)
150+
# will be considered as secondary choice in the candicates.
151+
# Generally the "softLimit" can be considered as a
152+
# recommended watermark of replicas upon re-scheduling.
153+
# * "hardLimit": similar to "max" above.
154+
balanceStrategy:
155+
type: [ None | LimitRange | Classful ]
156+
limitRange:
157+
min: ...
158+
max: ...
159+
classful:
160+
assured: ...
161+
softLimit: ...
162+
hardLimit: ...
163+
status:
164+
# 以ManifestWork形式分发给各个托管集群
165+
manifestWorks: ...
166+
```
167+
168+
### Details
169+
170+
#### When "Distribution strategy" and "Balance strategy" conflicts
171+
172+
The "Distribution strategy" works prior to "Balance strategy", so the latter can
173+
be considered as an overriding patch upon the former. The controller will always
174+
be honoring the balance strategy. The following list is a few possible examples
175+
when the two fields conflicts when combining "Weighted" distributor and
176+
"LimitRange" re-balancer:
177+
178+
###### Some expected replicas exceeds the max watermark
179+
- Conditions:
180+
- Selected # of Clusters: 2
181+
- Distribution: 1:2 weighted distribution under 6 total replicas.
182+
- Balance: LimitRange within 2-3
183+
184+
Result: The initial expected distribution shall be 2:4 while the re-balancing will
185+
reset the distribution to 3:3 in the end.
186+
187+
###### All expected replicas exceeds the max watermark
188+
- Conditions:
189+
- Selected # of Clusters: 2
190+
- Distribution: 1:2 weighted distribution under 6 total replicas.
191+
- Balance: LimitRange within 1-2
192+
193+
Result: The initial expected distribution shall be 2:4 while the re-balancing will
194+
reset the distribution to 2:2 even if the sum can't reach the total replicas.
195+
196+
###### All expected replicas doesn't reach the min watermark
197+
- Conditions:
198+
- Selected # of Clusters: 2
199+
- Distribution: 1:2 weighted distribution under 6 total replicas.
200+
- Balance: LimitRange within 5-10
201+
202+
Result: The initial expected distribution shall be 2:4 while the re-balancing will
203+
reset the distribution to 5:5 regardless of distribution strategy.
204+
205+
#### Workload Manifest Status Collection
206+
207+
Overall in OCM, there are 3 feasible ways of collecting the status from the spoke
208+
clusters:
209+
210+
1. List-Watching: The kubefed fashion status collection. It violates with OCM's
211+
pull-based architecture and will be likely the bottleneck of
212+
scalability when the # of managed clusters grows.
213+
2. Polling Get: Getting resources at a fixed interval which is costing less but losing
214+
promptness on the other hand.
215+
3. Delegate to `ManifestWork`: See the new status collection functionalities WIP in
216+
[#30](https://github.com/open-cluster-management-io/enhancements/pull/30)
217+
218+
This proposal will be supporting (2) and (3) in the end and leave the choice to users.
219+
220+

0 commit comments

Comments
 (0)