Skip to content

Conversation

@adrian-lin-1-0-0
Copy link
Contributor

@adrian-lin-1-0-0 adrian-lin-1-0-0 commented Oct 23, 2025

What is this PR for?

  • Replace the DFS-based FindQueueByAppID (worst-case O(n) over all leaves) with an O(1) cached lookup.
  • Introduce a thread-safe AppQueueMapping owned by the partition and shared by all queues.

Motivation

  • The previous FindQueueByAppID could traverse all leaves in the worst case (O(n)).
  • Scheduling and preemption paths call this frequently; reducing it to O(1) significantly lowers overhead.

Key changes

  • objects/app_queue_mapping.go: new RWMutex-protected appID → *Queue index.
  • objects/queue.go:
    • Queue holds a shared AppQueueMapping.
    • NewConfiguredQueue/NewDynamicQueue/NewRecoveryQueue accept and propagate the mapping.
    • FindQueueByAppID uses the index (no tree walk).
  • scheduler/partition.go:
    • PartitionContext creates a single AppQueueMapping and passes it to queue constructors.
    • AddApplication/RemoveApplication update the mapping when apps are added/removed.

Compatibility and API notes

  • Internal constructor signatures changed:
    • NewConfiguredQueue(conf, parent, silence, appQueueMapping)
    • NewDynamicQueue(name, leaf, parent, appQueueMapping)
    • NewRecoveryQueue(parent, appQueueMapping)

Performance

  • ~50x faster and far fewer allocations.
goos: linux
goarch: amd64
pkg: github.com/apache/yunikorn-core/pkg/scheduler/objects
cpu: AMD Ryzen 7 9800X3D 8-Core Processor           
BenchmarkFindQueueByAppID_Cached-16    	15605854	        82.68 ns/op	      48 B/op	       1 allocs/op
BenchmarkFindQueueByAppID_Origin-16    	  290314	      4143 ns/op	    4630 B/op	      32 allocs/op

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

[YUNIKORN-2057] FindQueueByAppID is slow

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@baconYao
Copy link

cc @chenyulin0719, need your help. Thanks

@codecov
Copy link

codecov bot commented Oct 24, 2025

Codecov Report

❌ Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.91%. Comparing base (446c1b3) to head (19702d2).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
pkg/scheduler/objects/application.go 0.00% 0 Missing and 1 partial ⚠️
pkg/scheduler/partition.go 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1037      +/-   ##
==========================================
+ Coverage   80.78%   80.91%   +0.13%     
==========================================
  Files          98       99       +1     
  Lines       15765    12882    -2883     
==========================================
- Hits        12735    10424    -2311     
+ Misses       2771     2201     -570     
+ Partials      259      257       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 98 to 102
// appQueueMapping is a thread safe mapping from applicationID to queuePath
type appQueueMapping struct {
byAppID map[string]string
locking.RWMutex
}
Copy link
Contributor

@pbacsko pbacsko Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be maintained globally, not in a per-queue basis. I think the ideal place is PartitionContext. However, the data type has to stay in pkg/scheduler/objects to avoid circular references.

Here is my take:

  1. Make it public (AppQueueMapping)
  2. Extract this type to a separate file eg. pkg/scheduler/objects/app_queue_mapping.go, create simple unit tests for it
  3. Create a single instance inside PartitionContext when the context is created
  4. When a Queue is created, inject the instance to the Queue. Extend NewConfiguredQueue() and NewRecoveryQueue() and NewDynamicQueue() with an extra argument.
  5. Have a reference inside Queue to AppQueueMapping
  6. You can add/remove mappings in PartitionContext.AddApplication() and PartitionContext.RemoveApplication()

With this approach, everything is much simpler and we don't have to walk the Queue hierarchy or find the root queue at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried managing the global state in PartitionContext,
but FindQueueByAppID is often called together with other PartitionContext operations.
When PartitionContext holds a lock and the queue calls any getXXXX function, it can cause a deadlock.

As you suggested, making AppQueueMapping public, limiting the lock scope to AppQueueMapping itself, and injecting it into the queue should prevent this issue.

I’ll confirm the scope of changes and push an updated version.
Thanks for the review!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure would look like this:

type PartitionContext struct {
    appQueueMapping *object.AppQueueMapping
}

type Queue struct {
    appQueueMapping *AppQueueMapping
}

I'll try to inject PartitionContext.appQueueMapping when creating the queue.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be enhanced & simplified further.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits, otherwise good.

@pbacsko pbacsko closed this in fde27e5 Oct 27, 2025
@pbacsko pbacsko reopened this Oct 27, 2025
@pbacsko
Copy link
Contributor

pbacsko commented Oct 27, 2025

+1

@pbacsko pbacsko closed this Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants