Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal]Two better resource scheduling and allocation plugins #2298

Open
LY-today opened this issue Dec 18, 2024 · 9 comments
Open

[proposal]Two better resource scheduling and allocation plugins #2298

LY-today opened this issue Dec 18, 2024 · 9 comments
Labels
kind/proposal Create a report to help us improve

Comments

@LY-today
Copy link
Contributor

LY-today commented Dec 18, 2024

What is your proposal:
The NodeResourcesFit plug-in of native k8s can only adopt a type of strategy for all resources, such as MostRequestedPriority and LeastRequestedPriority. However, in industrial practice, this design does not apply to some scenarios. For example: In AI scenarios, businesses that apply for GPUs prefer to occupy the entire GPU machine first to prevent GPU fragmentation; businesses that apply for CPU & MEM are prioritized and dispersed to non-GPU machines to prevent excessive consumption of CPU & MEM on GPU machines, resulting in real tasks of applying for GPUs. Pending due to insufficient non-GPU resources
. It is therefore hoped that both strategies can be extended to address this business need.

Why is this needed:
There are related descriptions above

Is there a suggested solution, if so, please add it:

plugin-one

config:

resources: 
  nvidia.com/gpu:
    type: MostAllocated
    weight: 2
  cpu:
    type: LeastAllocated
    weight: 1
  memory:
    type: LeastAllocated
    weight: 1

config description:
image

node score:

finalScoreNode = [(weight1 * resource1) + (weight2 * resource2) + … + (weightN* resourceN)] /(weight1+weight2+ … +weightN)

plugin-two

config:

resources: 
- nvidia.com/gpu 

config description:
image

node score:

finalScoreNode = (allocatablesResourcesNum - requestsResourcesNum) * framework.MaxNodeScore / allocatablesResourcesNum
@LY-today LY-today added the kind/proposal Create a report to help us improve label Dec 18, 2024
@LY-today
Copy link
Contributor Author

LY-today commented Dec 18, 2024

/assign @LY-today

@songtao98
Copy link
Contributor

Nice proposal. Maybe we can discuss about whether to take this as a plugin or a strategy? PTAL @ZiMengSheng @saintube

@LY-today
Copy link
Contributor Author

Nice proposal. Maybe we can discuss about whether to take this as a plugin or a strategy? PTAL @ZiMengSheng @saintube

If the community classmates approve, I can contribute MR

@ZiMengSheng
Copy link
Contributor

What does this plugin look like in implementation details.

@LY-today
Copy link
Contributor Author

What does this plugin look like in implementation details.

plugin-one

func (s *Sample) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
    // 节点快照
    nodeInfo, _ := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    
    // 100分的整数倍,例如 100*1+100*2+100*3 = 600
    var nodeScore int64
    // 例如:1+2+3 = 6
    var weightSum int64
    
    
    // 获取pod申请资源列表
    podRequest, _ := fitsRequest(computePodResourceRequest(p).Resource, nodeInfo)

    /*
    resources: 
      gpu:
        type: MostAllocated
        weight: 1
      cpu:
        type: LeastAllocated
        weight: 1
      memory:
        type: LeastAllocated
        weight: 1
    */
    // 遍历pod申请的资源类型,而不是遍历调度器配置的资源类型
    for _, requestSourceName := range podRequest {
        // 获取配置参数
        v, ok := s.args.Resources[requestSourceName]
        if !ok {
           continue
        }
        fit, err := noderesources.NewFit(
           &config.NodeResourcesFitArgs{
              ScoringStrategy: &config.ScoringStrategy{
                 Type: v.Type, // 选择MostAllocated还是LeastAllocated
                 Resources: []config.ResourceSpec{
                    {Name: string(requestSourceName), Weight: 1}, // 资源类型和权重,由于我们只想复用两种计算策略,权重封装在外层由我们业务逻辑控制,所以这里权重默认都是1
                 },
              },
           }, s.handle, plfeature.Features{})
        
        if err != nil {
           return 0, framework.NewStatus(framework.Error, err.Error())
        }
        
        // 封装原生策略
        resourceScore, _ := fit.(framework.ScorePlugin).Score(ctx, state, p, nodeName)
        // ✖️权重
        nodeScore += resourceScore * v.Weight
        // 累计权重
        weightSum += v.Weight
    }
    
    // pod资源的资源类型不在该资源策略要考虑的范畴内
    if weightSum == 0 {
       return framework.MaxNodeScore, framework.NewStatus(framework.Success, "")
    }
    
    // 能保证一定是100分以内,因为要求最高100分
    scores := nodeScore / weightSum
    
    return scores, framework.NewStatus(framework.Success, "")
 }

plugin-two

func (s *Sample) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
    //参考原生策略:pkg/scheduler/framework/plugins/noderesources/fit.go Filter函数(如何判断节点剩余资源是否满足待调度pod的资源申请)无法复用函数
    nodeInfo, _ := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    
    podRequest := computePodResourceRequest(p)
    // 获取pod申请的资源类型和节点可分配的资源类型
    podRequestResource, nodeAllocatableResource := fitsRequest(podRequest.Resource, nodeInfo)
    // node多余资源类型 = node总资源类型-pod申请资源类型
    diffNames := difference(nodeAllocatableResource, podRequestResource)
    // node多余资源类型中核心资源类型个数
    /*
    resources: 
    - nvidia.com/gpu //稀缺资源
    */
    intersectNames := intersection(diffNames, s.args.Resources)
    // node多余资源类型中核心资源类型个数越多,分数越低
    scores := resourceTypesScore(int64(len(intersectNames)), int64(len(diffNames)))
    return scores, framework.NewStatus(framework.Success, "")
}

@ZiMengSheng
Copy link
Contributor

OK

  1. plugin1 is like a enhancement of NodeResourcesFit and Koordinator DeviceShare Plugin has implement a likewise strategy.
  2. plugin2 strategy can be achieved by preferred nodeAffinity, but bring extra affinity on pod. Maybe we can modify the deviceShare strategy to make gpu node prefer not cpu task.

@LY-today
Copy link
Contributor Author

LY-today commented Dec 19, 2024

OK

  1. plugin1 is like a enhancement of NodeResourcesFit and Koordinator DeviceShare Plugin has implement a likewise strategy.
  2. plugin2 strategy can be achieved by preferred nodeAffinity, but bring extra affinity on pod. Maybe we can modify the deviceShare strategy to make gpu node prefer not cpu task.

Can I understand that the community has no intention of integrating these two plugins? Instead suggest adjusting the DeviceShare policy?

@LY-today
Copy link
Contributor Author

LY-today commented Dec 19, 2024

@ZiMengSheng I took a look at RM:koordinator-sh/website#187 ,The deviceshare plugin seems to have a high access cost for traditional nvidia.com/gpu and early VGPU, and this plugin does not seem to be very mature yet. Expanding these two types of plugins may be the fastest way to generate revenue for AI scenarios at this stage.

@LY-today
Copy link
Contributor Author

MR:#2302

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/proposal Create a report to help us improve
Projects
None yet
Development

No branches or pull requests

3 participants