[YUNIKORN-3118] Parallelize TryNode evaluations #1043

mitdesai · 2025-10-30T16:49:02Z

What is this PR for?

Added a new feature to implement parallelism in tryNode evaluations

By default this runs with a single thread keeping it backward compatible. But if required, we can configure the parallelism to run the node evaluations in parallel and significantly reduce the latency.

We were able to see 2-3X performance gain when running with parallelism turned on.

What type of PR is it?

Todos

- Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-3118

How should this be tested?

Screenshots (if appropriate)

Questions:

- The licenses files need update.
- There is breaking changes for older versions.
- It needs documentation.

codecov · 2025-10-31T01:38:35Z

Codecov Report

❌ Patch coverage is 51.47059% with 66 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.17%. Comparing base (22d82c1) to head (6a96e4c).

Files with missing lines	Patch %	Lines
pkg/scheduler/objects/application.go	50.00%	47 Missing and 12 partials ⚠️
pkg/scheduler/objects/queue.go	0.00%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1043      +/-   ##
==========================================
- Coverage   81.46%   81.17%   -0.29%     
==========================================
  Files         101      101              
  Lines       13282    13411     +129     
==========================================
+ Hits        10820    10887      +67     
- Misses       2204     2254      +50     
- Partials      258      270      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pbacsko

In the first round, my main concern (which might be invalid) is the performance overhead of creating too many goroutines for unschedulable pods.

That's something which I ask you to investigate using the recommended test case. But I'm open for collaboration on this matter.

pbacsko · 2025-10-31T08:31:40Z

pkg/scheduler/objects/application.go

+		for idx, node := range batch {
+			wg.Add(1)
+			semaphore <- struct{}{}
+			go func(idx int, node *Node) {
+				defer wg.Done()
+				defer func() { <-semaphore }()
+				dryRunResult, err := sa.tryNodeDryRun(node, ask)
+
+				mu.Lock()
+				defer mu.Unlock()
+				if err != nil {
+					errors[idx] = err
+				} else if dryRunResult != nil {
+					candidateNodes[idx] = node
+				}
+			}(idx, node)
+		}


Interesting, but do I have a concern with this approach. What if a large cluster (eg 5000 nodes), we have unschedulable pods? In this case, we'd create 5000 goroutines for a single request in every scheduling cycle. If we have 10 unschedulable pods, that's 50000 goroutines - once per cycle. Overall, 500k goroutines per second.
Goroutines are cheap, but not free.

This might be an extreme, but we have to think about extremes, even if they're less common.

I'd definitely think about some sort of pooling solution, essentially worker goroutines which are always running and waiting for asks to evaluate. Shouldn't be hard to implement.

Anyway, I do have a simple test case which checks performance in the shim under pkg/shim/scheduling_perf_test.go. It's called BenchmarkSchedulingThroughPut(). This could be modified to submit unschedulable pods (eg. ones with a node selector that never matches) to see how it affects performance.

Hi @pbacsko, we do not create goroutines for each node. The parallelism is driven by the configuration: 'tryNodesThreadCount'

We will evaluate the nodes in a batches based on the threadCount. We also use the same node ordering that YuniKorn already has, so least used nodes are evaluated first.

Currently in our existing setup we use 100 threads for a 700 node cluster. This decreases the time for node evaluation at the cost of extra CPU.

Regarding an unschedulable pod in the system, even without the parallelism we will try each node for it in every cycle before bailing out. This process will just reduce the time taken for node evaluation in such case. Given that the threads count is set appropriately. Too many threads = more overhead for managing the threads. Too little = more time evaluating the nodes

pbacsko · 2025-10-31T10:45:53Z

OK, I couldn't resist and tried this locally with BenchmarkSchedulingThroughPut().

There is a significant decrease in throughput, which baffled me a bit, but then I realized what was going on. This test is a bit synthetic, so not really realistic. The first node in the B-tree is always a "hit", so this piece of code slows things down, because it always has to construct at least a single batch (eg. 24), then walk through the nodes even if it's not needed. The first node of the batch is always the one that we need, so we perform unnecessary verification of predicates for the remaining 23 nodes, essentially wasting CPU cycles.

I'm thinking about how to makes this more efficient for this case. Even if it's not always the first node which is good for us, eg. it's the 5th node, we're still evaluating too many nodes. So, we have to have some sort of a threshold to make this improvement justifiable. I think the "first node fit" is always a case whose throughput is hard (or even impossible) to match with a parallel approach, simply due to the extra overhead. But I still suggest exploring different ways to make this more performant.

mitdesai · 2025-10-31T19:08:25Z

You’re right: in this synthetic test the first node is always a hit, so batching/parallel evaluation adds overhead. We construct a batch (e.g., 24) and then validate 23 unnecessary candidates, which burns CPU and reduces throughput.
Parallelism is unlikely to beat the “first-node-fit” path because of coordination and per-candidate predicate checks. In these cases, the optimal strategy is to use a single threaded approach.

In Short: match threads to your hit pattern. If the first candidate wins most of the time (≈90%), parallelism just adds overhead—run with a single thread and short-circuit on the first match.

Why and when parallelism helps:

In our environment, nodes are split across pools with taints/tolerations. A node may meet resource requirements but still be unschedulable due to taints, so sequential scans waste time on many ineligible nodes.
Evaluating multiple candidates concurrently reduces time-to-first-valid-node when viable nodes sit deeper in the list.

Concrete example (from the attached cluster snapshot):

~700 nodes total.
~310 nodes (~44%) idle at ~10% CPU/memory; the remainder sit at 50–70% utilization.
Spark workloads are pinned to the more utilized pool via taints.
With sequential evaluation, the first ~300 tryNodes are effectively dead ends; useful work begins only once we reach the tainted pool.
Increasing parallelism (e.g., checking 10 nodes at once) cuts selection latency by roughly 10x because the checks are CPU-bound and don’t call the Kubernetes API.

Sophisticated future enhancement

For synthetic or “first-node-fit” scenarios, prefer batch size = 1, single thread, and immediate exit on success.
If hits typically occur deeper (e.g., around the 5th candidate or later), introduce a miss threshold before enabling parallel evaluation. Only pay the parallel overhead when early probes fail consistently.
Consider an adaptive policy: start single-threaded; after N misses (tunable), switch to K-way parallel checks with cancellation on first success; decay back to single-threaded when top-of-list hits return.

[YUNIKORN-3118] Parallelize TryNode evaluations

f103b32

wilfred-s assigned mitdesai Oct 31, 2025

Add test coverage

6a96e4c

wilfred-s requested review from manirajv06, pbacsko and wilfred-s and removed request for manirajv06 and pbacsko October 31, 2025 04:34

pbacsko reviewed Oct 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[YUNIKORN-3118] Parallelize TryNode evaluations #1043

[YUNIKORN-3118] Parallelize TryNode evaluations #1043

Uh oh!

mitdesai commented Oct 30, 2025

Uh oh!

codecov bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

pbacsko left a comment

Uh oh!

pbacsko Oct 31, 2025 •

edited

Loading

Uh oh!

mitdesai Oct 31, 2025 •

edited

Loading

Uh oh!

pbacsko commented Oct 31, 2025

Uh oh!

mitdesai commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[YUNIKORN-3118] Parallelize TryNode evaluations #1043

Are you sure you want to change the base?

[YUNIKORN-3118] Parallelize TryNode evaluations #1043

Uh oh!

Conversation

mitdesai commented Oct 30, 2025

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

Uh oh!

codecov bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pbacsko left a comment

Choose a reason for hiding this comment

Uh oh!

pbacsko Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitdesai Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbacsko commented Oct 31, 2025

Uh oh!

mitdesai commented Oct 31, 2025

Why and when parallelism helps:

Concrete example (from the attached cluster snapshot):

Sophisticated future enhancement

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Oct 31, 2025 •

edited

Loading

pbacsko Oct 31, 2025 •

edited

Loading

mitdesai Oct 31, 2025 •

edited

Loading