-
Notifications
You must be signed in to change notification settings - Fork 256
[YUNIKORN-3118] Parallelize TryNode evaluations #1043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1043 +/- ##
==========================================
- Coverage 81.46% 81.17% -0.29%
==========================================
Files 101 101
Lines 13282 13411 +129
==========================================
+ Hits 10820 10887 +67
- Misses 2204 2254 +50
- Partials 258 270 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first round, my main concern (which might be invalid) is the performance overhead of creating too many goroutines for unschedulable pods.
That's something which I ask you to investigate using the recommended test case. But I'm open for collaboration on this matter.
| for idx, node := range batch { | ||
| wg.Add(1) | ||
| semaphore <- struct{}{} | ||
| go func(idx int, node *Node) { | ||
| defer wg.Done() | ||
| defer func() { <-semaphore }() | ||
| dryRunResult, err := sa.tryNodeDryRun(node, ask) | ||
|
|
||
| mu.Lock() | ||
| defer mu.Unlock() | ||
| if err != nil { | ||
| errors[idx] = err | ||
| } else if dryRunResult != nil { | ||
| candidateNodes[idx] = node | ||
| } | ||
| }(idx, node) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, but do I have a concern with this approach. What if a large cluster (eg 5000 nodes), we have unschedulable pods? In this case, we'd create 5000 goroutines for a single request in every scheduling cycle. If we have 10 unschedulable pods, that's 50000 goroutines - once per cycle. Overall, 500k goroutines per second.
Goroutines are cheap, but not free.
This might be an extreme, but we have to think about extremes, even if they're less common.
I'd definitely think about some sort of pooling solution, essentially worker goroutines which are always running and waiting for asks to evaluate. Shouldn't be hard to implement.
Anyway, I do have a simple test case which checks performance in the shim under pkg/shim/scheduling_perf_test.go. It's called BenchmarkSchedulingThroughPut(). This could be modified to submit unschedulable pods (eg. ones with a node selector that never matches) to see how it affects performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pbacsko, we do not create goroutines for each node. The parallelism is driven by the configuration: 'tryNodesThreadCount'
We will evaluate the nodes in a batches based on the threadCount. We also use the same node ordering that YuniKorn already has, so least used nodes are evaluated first.
Currently in our existing setup we use 100 threads for a 700 node cluster. This decreases the time for node evaluation at the cost of extra CPU.
Regarding an unschedulable pod in the system, even without the parallelism we will try each node for it in every cycle before bailing out. This process will just reduce the time taken for node evaluation in such case. Given that the threads count is set appropriately. Too many threads = more overhead for managing the threads. Too little = more time evaluating the nodes
|
OK, I couldn't resist and tried this locally with There is a significant decrease in throughput, which baffled me a bit, but then I realized what was going on. This test is a bit synthetic, so not really realistic. The first node in the B-tree is always a "hit", so this piece of code slows things down, because it always has to construct at least a single batch (eg. 24), then walk through the nodes even if it's not needed. The first node of the batch is always the one that we need, so we perform unnecessary verification of predicates for the remaining 23 nodes, essentially wasting CPU cycles. I'm thinking about how to makes this more efficient for this case. Even if it's not always the first node which is good for us, eg. it's the 5th node, we're still evaluating too many nodes. So, we have to have some sort of a threshold to make this improvement justifiable. I think the "first node fit" is always a case whose throughput is hard (or even impossible) to match with a parallel approach, simply due to the extra overhead. But I still suggest exploring different ways to make this more performant. |

What is this PR for?
Added a new feature to implement parallelism in tryNode evaluations
By default this runs with a single thread keeping it backward compatible. But if required, we can configure the parallelism to run the node evaluations in parallel and significantly reduce the latency.
We were able to see 2-3X performance gain when running with parallelism turned on.
What type of PR is it?
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-3118
How should this be tested?
Screenshots (if appropriate)
Questions: