|
| 1 | +# Multi-Cluster Workload Scheduling |
| 2 | + |
| 3 | +## Release Signoff Checklist |
| 4 | + |
| 5 | +- [ ] Enhancement is `implementable` |
| 6 | +- [ ] Design details are appropriately documented from clear requirements |
| 7 | +- [ ] Test plan is defined |
| 8 | +- [ ] Graduation criteria for dev preview, tech preview, GA |
| 9 | +- [ ] User-facing documentation is created in [website](https://github.com/open-cluster-management-io/open-cluster-management-io.github.io/) |
| 10 | + |
| 11 | +## Summary |
| 12 | + |
| 13 | +This proposal will be adding a new mutli-cluster workload functionality to OCM |
| 14 | +platform as either a built-in module or a pluggable addon and a new multi-cluster |
| 15 | +workload API will be added under a new API group `workload.open-cluster-management.io` |
| 16 | +as the manipulating interface for users. Note that the only requirement for |
| 17 | +the adopted local workload (e.g. Deployment, ReplicaSet) in the spoke cluster will |
| 18 | +be implementing the generic [scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource) |
| 19 | +subresource, so the new multi-cluster workload controller will be scaling up/down |
| 20 | +the local workloads regardless of whether it's a built-in workload API or custom |
| 21 | +workload developed via CRD. |
| 22 | + |
| 23 | + |
| 24 | +## Motivation |
| 25 | + |
| 26 | +### Goals |
| 27 | + |
| 28 | +#### Controlling Replicas Distribution |
| 29 | + |
| 30 | +In some cases, we may want to specify a total number of replicas for a multi-cluster |
| 31 | +workload and let the controller do the rest of replicas distribution for us according |
| 32 | +to different strategies such as (1) even (max-min) distribution (2) weighted |
| 33 | +(proportional) distribution. The distribution should be updated reactively by |
| 34 | +watching the selected list of clusters via the output from `PlacementDecision` API. |
| 35 | +Note that the computed distribution here will be an "expected" number while the |
| 36 | +actual distribution may diverge from the expectation depending on the allocatable |
| 37 | +resource or the liveness of the replicas which is elaborated in the following section. |
| 38 | + |
| 39 | +#### Dynamic Replicas Balancing |
| 40 | + |
| 41 | +The term "balance" or "re-schedule" here infers the process of transferring a replicas |
| 42 | +temporarily from one cluster to another. There are some cases when we need to trigger |
| 43 | +the process of replicas balancing: |
| 44 | + |
| 45 | +- When the distributed local workload fails to provision effective instances over a |
| 46 | + period of time. |
| 47 | +- When the distributed local workload is manually scaled down on purpose. |
| 48 | + |
| 49 | +The process of replicas transferring can be either "bursty" or "conservative": |
| 50 | + |
| 51 | +- __Bursty__: Increasing the replicas for one cluster then decrease the other. |
| 52 | +- __Conservative__: Decrease first then increase. |
| 53 | + |
| 54 | +#### Adopting Arbitrary Workload Types |
| 55 | + |
| 56 | +Given the fact that more and more third-party extended workload API are emerging beyond |
| 57 | +the kubernetes community, our multi-cluster workload controller should not raise any |
| 58 | +additional requirement on the managing workload API except for enabling the standard |
| 59 | +[scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource) |
| 60 | +subresource via the CRD. Hence, to scale up or down the local workload, the controller |
| 61 | +will be simply updating/patching the subresource regardless of its concrete types. |
| 62 | + |
| 63 | +### Non Goals |
| 64 | + |
| 65 | +- This KEP will not cover the distribution of special workloads such as `Job`. |
| 66 | +- This KEP will not cover the distribution of satellite resources around the workload |
| 67 | + such as `ConfigMap`, `ServiceAccount`. |
| 68 | + |
| 69 | +## Proposal |
| 70 | + |
| 71 | +### Abstraction |
| 72 | + |
| 73 | +To understand the functionalities of the multi-cluster workload easier, we can start by |
| 74 | +defining the boundary of the controller's abstraction as a black box. |
| 75 | + |
| 76 | +#### "ClusterSet" and "Placement" |
| 77 | + |
| 78 | +#### Controller "Input" |
| 79 | + |
| 80 | +The source of information input for the multi-cluster workload controller will be: |
| 81 | + |
| 82 | +- __Cluster Topology__: The `PlacementDesicision` is dynamically computed according to |
| 83 | + the following knowledge from OCM's built-in APIs: |
| 84 | + - The existence and availability of the managed clusters. |
| 85 | + - The "hub-defined" attributes attached to the cluster model via labelling. e.g. the |
| 86 | + clusterset, the feature label or other custom labels which will be read by the |
| 87 | + placement controller. |
| 88 | + - The "spoke-reported" attributes i.e. "ClusterClaim" which is collected and reported |
| 89 | + by the spoke agent. |
| 90 | + |
| 91 | +- __API Prescription__: There will be a new API named "ElasticWorkload" or |
| 92 | + "ManagedWorkload" that prescribes the necessary information for workload distribution |
| 93 | + such as the content of the target workload, the expected total number of replicas, etc. |
| 94 | + |
| 95 | +- __Effective Distributed Local Workload__: The new controller also need to capture the |
| 96 | + events from local clusters so that it can take actions e.g. when the instance crushes |
| 97 | + or tainted unexpectedly. |
| 98 | + |
| 99 | +#### Controller "Output" |
| 100 | + |
| 101 | +The new controller will be applying the latest state of the workload towards the selected |
| 102 | +clusters and tuning its replicas on demand. As a matter of implementation, the workload |
| 103 | +applying will be executed via the stable `ManifestWork` api. |
| 104 | + |
| 105 | +### API Spec |
| 106 | + |
| 107 | +```yaml |
| 108 | +apiVerion: scheduling.open-cluster-management.io/v1 |
| 109 | +kind: ElasticWorkload |
| 110 | +spec: |
| 111 | + # The target namespace to deploy the workload in the spoke cluster. |
| 112 | + spokeNamespace: default |
| 113 | + # The content of target workload, supporting: |
| 114 | + # - Inline: Embedding a static manifest. |
| 115 | + # - Import: Referencing an existing workload resource. (Note that |
| 116 | + # the replicas should always be set to 0 to avoid wasting |
| 117 | + # capacities in the hub cluster.) |
| 118 | + target: |
| 119 | + type: [ Inline | Import ] |
| 120 | + inline: ... |
| 121 | + import: ... |
| 122 | + # Referencing an OCM's placement policy in the same namespace as where |
| 123 | + # this elastic workload resource lives. |
| 124 | + placementRef: |
| 125 | + name: ... |
| 126 | + # DistributionStrategy controls the expected replicas distribution |
| 127 | + # across the selected clusters from the placement api above. The supported |
| 128 | + # distributing strategy will be: |
| 129 | + # - Even: Filling the min replicas upon every round. i.e. max-min |
| 130 | + # - Weighted: Setting a default weight and overriding the weight for a |
| 131 | + # few clusters on demand. |
| 132 | + distributionStrategy: |
| 133 | + totalReplicas: 10 |
| 134 | + type: [ Even | Propotional ] |
| 135 | + # BalanceStrategy prescribes the balancing/re-scheduling behavior of the |
| 136 | + # controller when the effective distributed replicas doesn't meet the |
| 137 | + # expectation within a period of "hesitation" time. The supported types |
| 138 | + # will be: |
| 139 | + # - None: Do not reschedule at any time. |
| 140 | + # - LimitRange: The reschedule is allowed within a range of numbers. The |
| 141 | + # replicas scheduler will be trying the best to control the |
| 142 | + # managed replicas within the range: |
| 143 | + # * "min": when the controller is attempting to transfer a replicas, |
| 144 | + # those clusters under the "min" will be the primary choices. |
| 145 | + # * "max": the controller will exclude the cluster exceeding the "max" |
| 146 | + # from the list of candidates upon re-scheduling. |
| 147 | + # - Classful: A classful prioritized rescheduling policy. |
| 148 | + # * "assured": similar to "min" above |
| 149 | + # * "softLimit": those cluster (assured < # of replicas <= softLimit) |
| 150 | + # will be considered as secondary choice in the candicates. |
| 151 | + # Generally the "softLimit" can be considered as a |
| 152 | + # recommended watermark of replicas upon re-scheduling. |
| 153 | + # * "hardLimit": similar to "max" above. |
| 154 | + balanceStrategy: |
| 155 | + type: [ None | LimitRange | Classful ] |
| 156 | + limitRange: |
| 157 | + min: ... |
| 158 | + max: ... |
| 159 | + classful: |
| 160 | + assured: ... |
| 161 | + softLimit: ... |
| 162 | + hardLimit: ... |
| 163 | +status: |
| 164 | + # 以ManifestWork形式分发给各个托管集群 |
| 165 | + manifestWorks: ... |
| 166 | +``` |
| 167 | +
|
| 168 | +### Details |
| 169 | +
|
| 170 | +#### When "Distribution strategy" and "Balance strategy" conflicts |
| 171 | +
|
| 172 | +The "Distribution strategy" works prior to "Balance strategy", so the latter can |
| 173 | +be considered as an overriding patch upon the former. The controller will always |
| 174 | +be honoring the balance strategy. The following list is a few possible examples |
| 175 | +when the two fields conflicts when combining "Weighted" distributor and |
| 176 | +"LimitRange" re-balancer: |
| 177 | +
|
| 178 | +###### Some expected replicas exceeds the max watermark |
| 179 | +- Conditions: |
| 180 | + - Selected # of Clusters: 2 |
| 181 | + - Distribution: 1:2 weighted distribution under 6 total replicas. |
| 182 | + - Balance: LimitRange within 2-3 |
| 183 | + |
| 184 | + Result: The initial expected distribution shall be 2:4 while the re-balancing will |
| 185 | + reset the distribution to 3:3 in the end. |
| 186 | +
|
| 187 | +###### All expected replicas exceeds the max watermark |
| 188 | +- Conditions: |
| 189 | + - Selected # of Clusters: 2 |
| 190 | + - Distribution: 1:2 weighted distribution under 6 total replicas. |
| 191 | + - Balance: LimitRange within 1-2 |
| 192 | +
|
| 193 | + Result: The initial expected distribution shall be 2:4 while the re-balancing will |
| 194 | + reset the distribution to 2:2 even if the sum can't reach the total replicas. |
| 195 | +
|
| 196 | +###### All expected replicas doesn't reach the min watermark |
| 197 | +- Conditions: |
| 198 | + - Selected # of Clusters: 2 |
| 199 | + - Distribution: 1:2 weighted distribution under 6 total replicas. |
| 200 | + - Balance: LimitRange within 5-10 |
| 201 | +
|
| 202 | + Result: The initial expected distribution shall be 2:4 while the re-balancing will |
| 203 | + reset the distribution to 5:5 regardless of distribution strategy. |
| 204 | +
|
| 205 | +#### Workload Manifest Status Collection |
| 206 | +
|
| 207 | +Overall in OCM, there are 3 feasible ways of collecting the status from the spoke |
| 208 | +clusters: |
| 209 | +
|
| 210 | +1. List-Watching: The kubefed fashion status collection. It violates with OCM's |
| 211 | + pull-based architecture and will be likely the bottleneck of |
| 212 | + scalability when the # of managed clusters grows. |
| 213 | +2. Polling Get: Getting resources at a fixed interval which is costing less but losing |
| 214 | + promptness on the other hand. |
| 215 | +3. Delegate to `ManifestWork`: See the new status collection functionalities WIP in |
| 216 | + [#30](https://github.com/open-cluster-management-io/enhancements/pull/30) |
| 217 | + |
| 218 | +This proposal will be supporting (2) and (3) in the end and leave the choice to users. |
| 219 | + |
| 220 | + |
0 commit comments