LeaderWorkerSet doesn't support gang-scheduling #167

xgchena · 2024-06-28T04:05:34Z

What happened:

It seems that the LeaderWorkerSet doesn't support gang-scheduling of a group of pods. If more replicas are scheduled at the same time, and there are not enough capacity to host them all, then the scheduler may prioritize scheduling of leader pods, and leave their worker pods pending forever.

What you expected to happen:

LeaderWorkerSet should support gang-scheduling, i.e. the pods of a group are either scheduled all together, or nothing.

How to reproduce it (as minimally and precisely as possible):

I have tried with the vllm example with an EKS cluster which has 4 nodes, each node has 1 GPU and enough resources to meet the requests of the pods. The example manifest uses size 2 and replicas 2, in total 4 pods.

It works fine for the initial deployment, as each node can host one pod.

$ kubectl apply -f lws.yaml
$ kubectl get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     1/1     Running   0          10m
vllm-0-1   1/1     Running   0          10m
vllm-1     1/1     Running   0          3m43s
vllm-1-1   1/1     Running   0          3m43s

But scheduling problem shows up if lws is scaled in and then scaled out to more replicas than the available nodes can host. See below, the first group was good, but the rest of the nodes were used to schedule two leader pods who workers were pending forever due to no capacity. Eventually only one group was working.

$ kubectl scale --replicas=0 lws/vllm

$ kubectl get pods
No resources found in default namespace.

$ kubectl scale --replicas=4 lws/vllm
$ k get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     0/1     Running   0          7s
vllm-0-1   1/1     Running   0          7s
vllm-1     0/1     Running   0          7s
vllm-1-1   0/1     Pending   0          7s
vllm-2     0/1     Running   0          7s
vllm-2-1   0/1     Pending   0          7s
vllm-3     0/1     Pending   0          7s
vllm-3-1   0/1     Pending   0          7s

The problem is more obvious in the following example when the scheduler prioritized scheduling the first 4 leader pods. Eventually no group was working.

$ kubectl scale --replicas=0 lws/vllm

$ kubectl get pods
No resources found in default namespace.

$ kubectl scale --replicas=10 lws/vllm
$ k get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     0/1     Running   0          5s
vllm-0-1   0/1     Pending   0          4s
vllm-1     0/1     Running   0          5s
vllm-1-1   0/1     Pending   0          4s
vllm-2     0/1     Running   0          5s
vllm-2-1   0/1     Pending   0          4s
vllm-3     0/1     Running   0          5s
vllm-3-1   0/1     Pending   0          4s
vllm-4     0/1     Pending   0          4s
vllm-4-1   0/1     Pending   0          4s
vllm-5     0/1     Pending   0          4s
vllm-5-1   0/1     Pending   0          4s
vllm-6     0/1     Pending   0          4s
vllm-6-1   0/1     Pending   0          4s
vllm-7     0/1     Pending   0          4s
vllm-7-1   0/1     Pending   0          3s
vllm-8     0/1     Pending   0          4s
vllm-8-1   0/1     Pending   0          3s
vllm-9     0/1     Pending   0          4s
vllm-9-1   0/1     Pending   0          4s

The expected behavior is that the first two groups should be scheduled (pods vllm-0, vllm-0-1, vllm-1, and vllm-1-1).

Anything else we need to know?:

Also tried with the co-scheduling plugin, but grouping all the Pods by the same static pod-group label is the same as no PodGroup.

Environment:

Kubernetes version (use kubectl version): v1.29.3
LWS version (use git describe --tags --dirty --always): v0.3.0-8-ga4c468e
Cloud provider or hardware configuration: AWS EKS (server version v1.29.4-eks-036c24b), node instance type g4dn.2xlarge
OS (e.g: cat /etc/os-release): Amazon Linux 2
Kernel (e.g. uname -a): 5.10.218
Install tools: N/A
Others: N/A

The text was updated successfully, but these errors were encountered:

liurupeng · 2024-07-08T18:37:23Z

@xgchena sry for the late reply since I was on vacation last week. It's expected that when you have 2 replicas with size 4 and only 4 nodes, then two leaders can be scheduled, causing some workers not be scheduled. In this case, it's recommended to scale the pod group based on the number of available nodes or use cluster autoscaler to provision new nodes automatically when there are unscheduled pods. We could improve scheduling to accommodate as many pod groups as possible to avoid the case you described but it wouldn't be simple. And we need a real use case that we must add the gang-scheduling support. Since there is a workaround for this, I will add a feature label and would wait for a use case to prioritize.

kerthcet · 2024-07-09T02:41:46Z

Generally, gang scheduling needs the support of scheduler. The upstream has the co-scheduliing plugin and an ongoing proposal about gang scheduling kubernetes/enhancements#4671, which I will try to push next release.

xgchena · 2024-07-10T21:42:56Z

Thank you both for the responses.

xgchena · 2024-07-10T21:44:08Z

Hi Rupeng, regarding your comments,

It's expected that when you have 2 replicas with size 4 and only 4 nodes, then two leaders can be scheduled, causing some workers not be scheduled. In this case, it's recommended to scale the pod group based on the number of available nodes or use cluster autoscaler to provision new nodes automatically when there are unscheduled pods. We could improve scheduling to accommodate as many pod groups as possible to avoid the case you described but it wouldn't be simple. And we need a real use case that we must add the gang-scheduling support. Since there is a workaround for this, I will add a feature label and would wait for a use case to prioritize.

Multi-host inference is often used to resolve the problem that a model is too large to be deployed to a single instance, not even using the most advanced instance types (like those with 8 GPUs). In real world, there is capacity constraint on advanced instance types.

Example user scenario 1: As an user, I have created a cluster using an advanced instance type. To save cost, I chose the on-demand pool (shared by all the users). I have deployed a large model using LWS, and set up Horizontal Autoscaler and Cluster Autoscaler. Initially the cluster has 2 nodes (which host one replica of the model). Then due to traffic increase, the Horizontal Autoscaler set the replicas to 3, in turn, the Cluster Autoscaler requests for 4 more nodes. However, the on-demand pool only has 2 nodes available. If the controller schedules 2 new leader pods to the 2 new nodes, then the 2 worker pods will be pending until other users release 2 nodes to the on-demand pool, which may happen late and impact availability.
Example user scenario 2: As an user, I have reserved hosts to ensure high level of assurance in obtaining capacity, and the reservation pool is shared by multiple clusters, i.e. I have created my own "on-demand pool" for my clusters, and the same issue can happen when the pool is fully utilized.

Both are real use cases.

xgchena · 2024-07-11T01:02:20Z

Hi Kante, thank you for the sharing and glad to know there is already a solution on the way.

Regarding the co-scheduling plugin, actually I have tried it, copied from the issue description

Anything else we need to know?:
Also tried with the co-scheduling plugin, but grouping all the Pods by the same static pod-group label is the same as no PodGroup.

Based on the vllm example, see the screenshot below. The problem with the approach is that only one PodGroup can be defined/used to group to all the pods.

By "next release" I guess you mean the next release of Kubernetes. Before it is available, I'm wondering if it is doable to fork the lws controller and create a PodGroup for each replica at runtime, as a short-term workaround.

kerthcet · 2024-07-11T03:09:56Z

Thanks for your feedbacks, it values a lot.

By "next release" I guess you mean the next release of Kubernetes.

Yes, there're still some gaps need to fix.

I'm wondering if it is doable to fork the lws controller and create a PodGroup for each replica at runtime, as a short-term workaround.

I guess this is one approach available because based on the co-scheduling design, the podGroup needs to be created manually.

However, we still not quite work smoothly with co-scheduling plugin, we have some features like startup policy and exclusive placement, which requires to create the worker pods once leader pod is ready, this will lead to dead lock with gang, because leader pod will not be scheduled if minMember not meet. This is a valid use case for gang scheduling design.

dims · 2024-09-16T16:33:36Z

@kerthcet still thinking of making progress for the kubernetes/enhancements#4671 KEP?

kerthcet · 2024-09-18T03:11:17Z

@kerthcet still thinking of making progress for the kubernetes/enhancements#4671 KEP?

Yes, but maybe next kubernetes release cycle, I'm rushing for a new milestone for my project, which may take two or three weeks and then I'm sure I'll miss the deadline of KEP code freeze. 🥶

k8s-triage-robot · 2024-12-17T03:12:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet · 2024-12-26T11:30:18Z

/remove-lifecycle stale

xgchena added the kind/bug Categorizes issue or PR as related to a bug. label Jun 28, 2024

liurupeng added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 8, 2024

vie-serendipity mentioned this issue Aug 6, 2024

add kep-162 Colocated Placement #168

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LeaderWorkerSet doesn't support gang-scheduling #167

LeaderWorkerSet doesn't support gang-scheduling #167

xgchena commented Jun 28, 2024

liurupeng commented Jul 8, 2024 •

edited

Loading

kerthcet commented Jul 9, 2024

xgchena commented Jul 10, 2024

xgchena commented Jul 10, 2024

xgchena commented Jul 11, 2024

kerthcet commented Jul 11, 2024 •

edited

Loading

dims commented Sep 16, 2024

kerthcet commented Sep 18, 2024

k8s-triage-robot commented Dec 17, 2024

kerthcet commented Dec 26, 2024

LeaderWorkerSet doesn't support gang-scheduling #167

LeaderWorkerSet doesn't support gang-scheduling #167

Comments

xgchena commented Jun 28, 2024

liurupeng commented Jul 8, 2024 • edited Loading

kerthcet commented Jul 9, 2024

xgchena commented Jul 10, 2024

xgchena commented Jul 10, 2024

xgchena commented Jul 11, 2024

kerthcet commented Jul 11, 2024 • edited Loading

dims commented Sep 16, 2024

kerthcet commented Sep 18, 2024

k8s-triage-robot commented Dec 17, 2024

kerthcet commented Dec 26, 2024

liurupeng commented Jul 8, 2024 •

edited

Loading

kerthcet commented Jul 11, 2024 •

edited

Loading