Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LeaderWorkerSet doesn't support gang-scheduling #167

Open
xgchena opened this issue Jun 28, 2024 · 10 comments
Open

LeaderWorkerSet doesn't support gang-scheduling #167

xgchena opened this issue Jun 28, 2024 · 10 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@xgchena
Copy link
Contributor

xgchena commented Jun 28, 2024

What happened:

It seems that the LeaderWorkerSet doesn't support gang-scheduling of a group of pods. If more replicas are scheduled at the same time, and there are not enough capacity to host them all, then the scheduler may prioritize scheduling of leader pods, and leave their worker pods pending forever.

What you expected to happen:

LeaderWorkerSet should support gang-scheduling, i.e. the pods of a group are either scheduled all together, or nothing.

How to reproduce it (as minimally and precisely as possible):

I have tried with the vllm example with an EKS cluster which has 4 nodes, each node has 1 GPU and enough resources to meet the requests of the pods. The example manifest uses size 2 and replicas 2, in total 4 pods.

  • It works fine for the initial deployment, as each node can host one pod.
$ kubectl apply -f lws.yaml
$ kubectl get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     1/1     Running   0          10m
vllm-0-1   1/1     Running   0          10m
vllm-1     1/1     Running   0          3m43s
vllm-1-1   1/1     Running   0          3m43s
  • But scheduling problem shows up if lws is scaled in and then scaled out to more replicas than the available nodes can host. See below, the first group was good, but the rest of the nodes were used to schedule two leader pods who workers were pending forever due to no capacity. Eventually only one group was working.
$ kubectl scale --replicas=0 lws/vllm

$ kubectl get pods
No resources found in default namespace.

$ kubectl scale --replicas=4 lws/vllm
$ k get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     0/1     Running   0          7s
vllm-0-1   1/1     Running   0          7s
vllm-1     0/1     Running   0          7s
vllm-1-1   0/1     Pending   0          7s
vllm-2     0/1     Running   0          7s
vllm-2-1   0/1     Pending   0          7s
vllm-3     0/1     Pending   0          7s
vllm-3-1   0/1     Pending   0          7s
  • The problem is more obvious in the following example when the scheduler prioritized scheduling the first 4 leader pods. Eventually no group was working.
$ kubectl scale --replicas=0 lws/vllm

$ kubectl get pods
No resources found in default namespace.

$ kubectl scale --replicas=10 lws/vllm
$ k get pods
NAME       READY   STATUS    RESTARTS   AGE
vllm-0     0/1     Running   0          5s
vllm-0-1   0/1     Pending   0          4s
vllm-1     0/1     Running   0          5s
vllm-1-1   0/1     Pending   0          4s
vllm-2     0/1     Running   0          5s
vllm-2-1   0/1     Pending   0          4s
vllm-3     0/1     Running   0          5s
vllm-3-1   0/1     Pending   0          4s
vllm-4     0/1     Pending   0          4s
vllm-4-1   0/1     Pending   0          4s
vllm-5     0/1     Pending   0          4s
vllm-5-1   0/1     Pending   0          4s
vllm-6     0/1     Pending   0          4s
vllm-6-1   0/1     Pending   0          4s
vllm-7     0/1     Pending   0          4s
vllm-7-1   0/1     Pending   0          3s
vllm-8     0/1     Pending   0          4s
vllm-8-1   0/1     Pending   0          3s
vllm-9     0/1     Pending   0          4s
vllm-9-1   0/1     Pending   0          4s

The expected behavior is that the first two groups should be scheduled (pods vllm-0, vllm-0-1, vllm-1, and vllm-1-1).

Anything else we need to know?:

Also tried with the co-scheduling plugin, but grouping all the Pods by the same static pod-group label is the same as no PodGroup.

Environment:

  • Kubernetes version (use kubectl version): v1.29.3
  • LWS version (use git describe --tags --dirty --always): v0.3.0-8-ga4c468e
  • Cloud provider or hardware configuration: AWS EKS (server version v1.29.4-eks-036c24b), node instance type g4dn.2xlarge
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): 5.10.218
  • Install tools: N/A
  • Others: N/A
@xgchena xgchena added the kind/bug Categorizes issue or PR as related to a bug. label Jun 28, 2024
@liurupeng
Copy link
Collaborator

liurupeng commented Jul 8, 2024

@xgchena sry for the late reply since I was on vacation last week. It's expected that when you have 2 replicas with size 4 and only 4 nodes, then two leaders can be scheduled, causing some workers not be scheduled. In this case, it's recommended to scale the pod group based on the number of available nodes or use cluster autoscaler to provision new nodes automatically when there are unscheduled pods. We could improve scheduling to accommodate as many pod groups as possible to avoid the case you described but it wouldn't be simple. And we need a real use case that we must add the gang-scheduling support. Since there is a workaround for this, I will add a feature label and would wait for a use case to prioritize.

@liurupeng liurupeng added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 8, 2024
@kerthcet
Copy link
Contributor

kerthcet commented Jul 9, 2024

Generally, gang scheduling needs the support of scheduler. The upstream has the co-scheduliing plugin and an ongoing proposal about gang scheduling kubernetes/enhancements#4671, which I will try to push next release.

@xgchena
Copy link
Contributor Author

xgchena commented Jul 10, 2024

Thank you both for the responses.

@xgchena
Copy link
Contributor Author

xgchena commented Jul 10, 2024

Hi Rupeng, regarding your comments,

It's expected that when you have 2 replicas with size 4 and only 4 nodes, then two leaders can be scheduled, causing some workers not be scheduled. In this case, it's recommended to scale the pod group based on the number of available nodes or use cluster autoscaler to provision new nodes automatically when there are unscheduled pods. We could improve scheduling to accommodate as many pod groups as possible to avoid the case you described but it wouldn't be simple. And we need a real use case that we must add the gang-scheduling support. Since there is a workaround for this, I will add a feature label and would wait for a use case to prioritize.

Multi-host inference is often used to resolve the problem that a model is too large to be deployed to a single instance, not even using the most advanced instance types (like those with 8 GPUs). In real world, there is capacity constraint on advanced instance types.

  • Example user scenario 1: As an user, I have created a cluster using an advanced instance type. To save cost, I chose the on-demand pool (shared by all the users). I have deployed a large model using LWS, and set up Horizontal Autoscaler and Cluster Autoscaler. Initially the cluster has 2 nodes (which host one replica of the model). Then due to traffic increase, the Horizontal Autoscaler set the replicas to 3, in turn, the Cluster Autoscaler requests for 4 more nodes. However, the on-demand pool only has 2 nodes available. If the controller schedules 2 new leader pods to the 2 new nodes, then the 2 worker pods will be pending until other users release 2 nodes to the on-demand pool, which may happen late and impact availability.

  • Example user scenario 2: As an user, I have reserved hosts to ensure high level of assurance in obtaining capacity, and the reservation pool is shared by multiple clusters, i.e. I have created my own "on-demand pool" for my clusters, and the same issue can happen when the pool is fully utilized.

Both are real use cases.

@xgchena
Copy link
Contributor Author

xgchena commented Jul 11, 2024

Hi Kante, thank you for the sharing and glad to know there is already a solution on the way.

Regarding the co-scheduling plugin, actually I have tried it, copied from the issue description

Anything else we need to know?:
Also tried with the co-scheduling plugin, but grouping all the Pods by the same static pod-group label is the same as no PodGroup.

Based on the vllm example, see the screenshot below. The problem with the approach is that only one PodGroup can be defined/used to group to all the pods.

lws-podgroup

By "next release" I guess you mean the next release of Kubernetes. Before it is available, I'm wondering if it is doable to fork the lws controller and create a PodGroup for each replica at runtime, as a short-term workaround.

@kerthcet
Copy link
Contributor

kerthcet commented Jul 11, 2024

Thanks for your feedbacks, it values a lot.

By "next release" I guess you mean the next release of Kubernetes.

Yes, there're still some gaps need to fix.

I'm wondering if it is doable to fork the lws controller and create a PodGroup for each replica at runtime, as a short-term workaround.

I guess this is one approach available because based on the co-scheduling design, the podGroup needs to be created manually.

However, we still not quite work smoothly with co-scheduling plugin, we have some features like startup policy and exclusive placement, which requires to create the worker pods once leader pod is ready, this will lead to dead lock with gang, because leader pod will not be scheduled if minMember not meet. This is a valid use case for gang scheduling design.

@dims
Copy link
Member

dims commented Sep 16, 2024

@kerthcet still thinking of making progress for the kubernetes/enhancements#4671 KEP?

@kerthcet
Copy link
Contributor

@kerthcet still thinking of making progress for the kubernetes/enhancements#4671 KEP?

Yes, but maybe next kubernetes release cycle, I'm rushing for a new milestone for my project, which may take two or three weeks and then I'm sure I'll miss the deadline of KEP code freeze. 🥶

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2024
@kerthcet
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

6 participants