[proposal]Two better resource scheduling and allocation plugins #2298

LY-today · 2024-12-18T09:36:11Z

What is your proposal:
The NodeResourcesFit plug-in of native k8s can only adopt a type of strategy for all resources, such as MostRequestedPriority and LeastRequestedPriority. However, in industrial practice, this design does not apply to some scenarios. For example: In AI scenarios, businesses that apply for GPUs prefer to occupy the entire GPU machine first to prevent GPU fragmentation; businesses that apply for CPU & MEM are prioritized and dispersed to non-GPU machines to prevent excessive consumption of CPU & MEM on GPU machines, resulting in real tasks of applying for GPUs. Pending due to insufficient non-GPU resources
. It is therefore hoped that both strategies can be extended to address this business need.

Why is this needed:
There are related descriptions above

Is there a suggested solution, if so, please add it:

plugin-one

config：

resources: 
  nvidia.com/gpu:
    type: MostAllocated
    weight: 2
  cpu:
    type: LeastAllocated
    weight: 1
  memory:
    type: LeastAllocated
    weight: 1

config description：

node score:

finalScoreNode = [(weight1 * resource1) + (weight2 * resource2) + … + (weightN* resourceN)] /(weight1+weight2+ … +weightN)

plugin-two

config：

resources: 
- nvidia.com/gpu

config description：

node score:

finalScoreNode = (allocatablesResourcesNum - requestsResourcesNum) * framework.MaxNodeScore / allocatablesResourcesNum

The text was updated successfully, but these errors were encountered:

LY-today · 2024-12-18T09:36:59Z

/assign @LY-today

songtao98 · 2024-12-18T09:45:23Z

Nice proposal. Maybe we can discuss about whether to take this as a plugin or a strategy? PTAL @ZiMengSheng @saintube

LY-today · 2024-12-18T09:48:48Z

Nice proposal. Maybe we can discuss about whether to take this as a plugin or a strategy? PTAL @ZiMengSheng @saintube

If the community classmates approve, I can contribute MR

ZiMengSheng · 2024-12-19T01:57:13Z

What does this plugin look like in implementation details.

LY-today · 2024-12-19T02:05:12Z

What does this plugin look like in implementation details.

plugin-one

func (s *Sample) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
    // 节点快照
    nodeInfo, _ := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    
    // 100分的整数倍，例如 100*1+100*2+100*3 = 600
    var nodeScore int64
    // 例如：1+2+3 = 6
    var weightSum int64
    
    
    // 获取pod申请资源列表
    podRequest, _ := fitsRequest(computePodResourceRequest(p).Resource, nodeInfo)

    /*
    resources: 
      gpu:
        type: MostAllocated
        weight: 1
      cpu:
        type: LeastAllocated
        weight: 1
      memory:
        type: LeastAllocated
        weight: 1
    */
    // 遍历pod申请的资源类型，而不是遍历调度器配置的资源类型
    for _, requestSourceName := range podRequest {
        // 获取配置参数
        v, ok := s.args.Resources[requestSourceName]
        if !ok {
           continue
        }
        fit, err := noderesources.NewFit(
           &config.NodeResourcesFitArgs{
              ScoringStrategy: &config.ScoringStrategy{
                 Type: v.Type, // 选择MostAllocated还是LeastAllocated
                 Resources: []config.ResourceSpec{
                    {Name: string(requestSourceName), Weight: 1}, // 资源类型和权重，由于我们只想复用两种计算策略，权重封装在外层由我们业务逻辑控制，所以这里权重默认都是1
                 },
              },
           }, s.handle, plfeature.Features{})
        
        if err != nil {
           return 0, framework.NewStatus(framework.Error, err.Error())
        }
        
        // 封装原生策略
        resourceScore, _ := fit.(framework.ScorePlugin).Score(ctx, state, p, nodeName)
        // ✖️权重
        nodeScore += resourceScore * v.Weight
        // 累计权重
        weightSum += v.Weight
    }
    
    // pod资源的资源类型不在该资源策略要考虑的范畴内
    if weightSum == 0 {
       return framework.MaxNodeScore, framework.NewStatus(framework.Success, "")
    }
    
    // 能保证一定是100分以内，因为要求最高100分
    scores := nodeScore / weightSum
    
    return scores, framework.NewStatus(framework.Success, "")
 }

plugin-two

func (s *Sample) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
    //参考原生策略：pkg/scheduler/framework/plugins/noderesources/fit.go Filter函数(如何判断节点剩余资源是否满足待调度pod的资源申请)无法复用函数
    nodeInfo, _ := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    
    podRequest := computePodResourceRequest(p)
    // 获取pod申请的资源类型和节点可分配的资源类型
    podRequestResource, nodeAllocatableResource := fitsRequest(podRequest.Resource, nodeInfo)
    // node多余资源类型 = node总资源类型-pod申请资源类型
    diffNames := difference(nodeAllocatableResource, podRequestResource)
    // node多余资源类型中核心资源类型个数
    /*
    resources: 
    - nvidia.com/gpu //稀缺资源
    */
    intersectNames := intersection(diffNames, s.args.Resources)
    // node多余资源类型中核心资源类型个数越多，分数越低
    scores := resourceTypesScore(int64(len(intersectNames)), int64(len(diffNames)))
    return scores, framework.NewStatus(framework.Success, "")
}

ZiMengSheng · 2024-12-19T02:28:09Z

OK

plugin1 is like a enhancement of NodeResourcesFit and Koordinator DeviceShare Plugin has implement a likewise strategy.
plugin2 strategy can be achieved by preferred nodeAffinity, but bring extra affinity on pod. Maybe we can modify the deviceShare strategy to make gpu node prefer not cpu task.

LY-today · 2024-12-19T02:41:09Z

OK

plugin1 is like a enhancement of NodeResourcesFit and Koordinator DeviceShare Plugin has implement a likewise strategy.

plugin2 strategy can be achieved by preferred nodeAffinity, but bring extra affinity on pod. Maybe we can modify the deviceShare strategy to make gpu node prefer not cpu task.

Can I understand that the community has no intention of integrating these two plugins? Instead suggest adjusting the DeviceShare policy?

LY-today · 2024-12-19T08:57:42Z

@ZiMengSheng I took a look at RM：koordinator-sh/website#187 ,The deviceshare plugin seems to have a high access cost for traditional nvidia.com/gpu and early VGPU, and this plugin does not seem to be very mature yet. Expanding these two types of plugins may be the fastest way to generate revenue for AI scenarios at this stage.

LY-today · 2024-12-20T11:20:38Z

MR：#2302

LY-today added the kind/proposal Create a report to help us improve label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proposal]Two better resource scheduling and allocation plugins #2298

[proposal]Two better resource scheduling and allocation plugins #2298

LY-today commented Dec 18, 2024 •

edited

Loading

LY-today commented Dec 18, 2024 •

edited

Loading

songtao98 commented Dec 18, 2024

LY-today commented Dec 18, 2024

ZiMengSheng commented Dec 19, 2024

LY-today commented Dec 19, 2024

ZiMengSheng commented Dec 19, 2024

LY-today commented Dec 19, 2024 •

edited

Loading

LY-today commented Dec 19, 2024 •

edited

Loading

LY-today commented Dec 20, 2024

[proposal]Two better resource scheduling and allocation plugins #2298

[proposal]Two better resource scheduling and allocation plugins #2298

Comments

LY-today commented Dec 18, 2024 • edited Loading

plugin-one

plugin-two

LY-today commented Dec 18, 2024 • edited Loading

songtao98 commented Dec 18, 2024

LY-today commented Dec 18, 2024

ZiMengSheng commented Dec 19, 2024

LY-today commented Dec 19, 2024

plugin-one

plugin-two

ZiMengSheng commented Dec 19, 2024

LY-today commented Dec 19, 2024 • edited Loading

LY-today commented Dec 19, 2024 • edited Loading

LY-today commented Dec 20, 2024

LY-today commented Dec 18, 2024 •

edited

Loading

LY-today commented Dec 18, 2024 •

edited

Loading

LY-today commented Dec 19, 2024 •

edited

Loading

LY-today commented Dec 19, 2024 •

edited

Loading