Skip to content

Conversation

@mitdesai
Copy link
Contributor

What is this PR for?

Added a new feature to implement parallelism in tryNode evaluations

By default this runs with a single thread keeping it backward compatible. But if required, we can configure the parallelism to run the node evaluations in parallel and significantly reduce the latency.

We were able to see 2-3X performance gain when running with parallelism turned on.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-3118

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@codecov
Copy link

codecov bot commented Oct 31, 2025

Codecov Report

❌ Patch coverage is 51.47059% with 66 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.17%. Comparing base (22d82c1) to head (6a96e4c).

Files with missing lines Patch % Lines
pkg/scheduler/objects/application.go 50.00% 47 Missing and 12 partials ⚠️
pkg/scheduler/objects/queue.go 0.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1043      +/-   ##
==========================================
- Coverage   81.46%   81.17%   -0.29%     
==========================================
  Files         101      101              
  Lines       13282    13411     +129     
==========================================
+ Hits        10820    10887      +67     
- Misses       2204     2254      +50     
- Partials      258      270      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wilfred-s wilfred-s requested review from manirajv06, pbacsko and wilfred-s and removed request for manirajv06 and pbacsko October 31, 2025 04:34
Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first round, my main concern (which might be invalid) is the performance overhead of creating too many goroutines for unschedulable pods.

That's something which I ask you to investigate using the recommended test case. But I'm open for collaboration on this matter.

Comment on lines +1493 to +1509
for idx, node := range batch {
wg.Add(1)
semaphore <- struct{}{}
go func(idx int, node *Node) {
defer wg.Done()
defer func() { <-semaphore }()
dryRunResult, err := sa.tryNodeDryRun(node, ask)

mu.Lock()
defer mu.Unlock()
if err != nil {
errors[idx] = err
} else if dryRunResult != nil {
candidateNodes[idx] = node
}
}(idx, node)
}
Copy link
Contributor

@pbacsko pbacsko Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, but do I have a concern with this approach. What if a large cluster (eg 5000 nodes), we have unschedulable pods? In this case, we'd create 5000 goroutines for a single request in every scheduling cycle. If we have 10 unschedulable pods, that's 50000 goroutines - once per cycle. Overall, 500k goroutines per second.
Goroutines are cheap, but not free.

This might be an extreme, but we have to think about extremes, even if they're less common.

I'd definitely think about some sort of pooling solution, essentially worker goroutines which are always running and waiting for asks to evaluate. Shouldn't be hard to implement.

Anyway, I do have a simple test case which checks performance in the shim under pkg/shim/scheduling_perf_test.go. It's called BenchmarkSchedulingThroughPut(). This could be modified to submit unschedulable pods (eg. ones with a node selector that never matches) to see how it affects performance.

Copy link
Contributor Author

@mitdesai mitdesai Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pbacsko, we do not create goroutines for each node. The parallelism is driven by the configuration: 'tryNodesThreadCount'

We will evaluate the nodes in a batches based on the threadCount. We also use the same node ordering that YuniKorn already has, so least used nodes are evaluated first.

Currently in our existing setup we use 100 threads for a 700 node cluster. This decreases the time for node evaluation at the cost of extra CPU.

Regarding an unschedulable pod in the system, even without the parallelism we will try each node for it in every cycle before bailing out. This process will just reduce the time taken for node evaluation in such case. Given that the threads count is set appropriately. Too many threads = more overhead for managing the threads. Too little = more time evaluating the nodes

@pbacsko
Copy link
Contributor

pbacsko commented Oct 31, 2025

OK, I couldn't resist and tried this locally with BenchmarkSchedulingThroughPut().

There is a significant decrease in throughput, which baffled me a bit, but then I realized what was going on. This test is a bit synthetic, so not really realistic. The first node in the B-tree is always a "hit", so this piece of code slows things down, because it always has to construct at least a single batch (eg. 24), then walk through the nodes even if it's not needed. The first node of the batch is always the one that we need, so we perform unnecessary verification of predicates for the remaining 23 nodes, essentially wasting CPU cycles.

I'm thinking about how to makes this more efficient for this case. Even if it's not always the first node which is good for us, eg. it's the 5th node, we're still evaluating too many nodes. So, we have to have some sort of a threshold to make this improvement justifiable. I think the "first node fit" is always a case whose throughput is hard (or even impossible) to match with a parallel approach, simply due to the extra overhead. But I still suggest exploring different ways to make this more performant.

@mitdesai
Copy link
Contributor Author

  • You’re right: in this synthetic test the first node is always a hit, so batching/parallel evaluation adds overhead. We construct a batch (e.g., 24) and then validate 23 unnecessary candidates, which burns CPU and reduces throughput.
  • Parallelism is unlikely to beat the “first-node-fit” path because of coordination and per-candidate predicate checks. In these cases, the optimal strategy is to use a single threaded approach.

In Short: match threads to your hit pattern. If the first candidate wins most of the time (≈90%), parallelism just adds overhead—run with a single thread and short-circuit on the first match.

Why and when parallelism helps:

  • In our environment, nodes are split across pools with taints/tolerations. A node may meet resource requirements but still be unschedulable due to taints, so sequential scans waste time on many ineligible nodes.
  • Evaluating multiple candidates concurrently reduces time-to-first-valid-node when viable nodes sit deeper in the list.

Concrete example (from the attached cluster snapshot):

Screenshot 2025-10-31 at 11 39 22 AM
  • ~700 nodes total.
  • ~310 nodes (~44%) idle at ~10% CPU/memory; the remainder sit at 50–70% utilization.
  • Spark workloads are pinned to the more utilized pool via taints.
  • With sequential evaluation, the first ~300 tryNodes are effectively dead ends; useful work begins only once we reach the tainted pool.
  • Increasing parallelism (e.g., checking 10 nodes at once) cuts selection latency by roughly 10x because the checks are CPU-bound and don’t call the Kubernetes API.

Sophisticated future enhancement

  • For synthetic or “first-node-fit” scenarios, prefer batch size = 1, single thread, and immediate exit on success.
  • If hits typically occur deeper (e.g., around the 5th candidate or later), introduce a miss threshold before enabling parallel evaluation. Only pay the parallel overhead when early probes fail consistently.
  • Consider an adaptive policy: start single-threaded; after N misses (tunable), switch to K-way parallel checks with cancellation on first success; decay back to single-threaded when top-of-list hits return.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants