Get fair-share priority scheduling working #159
Merged
shiv-tyagi merged 7 commits intoROCm:mainfrom May 8, 2026
Merged
Conversation
…duler Add GetFairshareFactors RPC to SlurmAccounting service so spurctld can periodically fetch precomputed fair-share factors from spurdbd. The controller caches these factors and uses them in pending_jobs() priority computation instead of the previous hardcoded neutral value. - Add GetFairshareFactors RPC + messages to proto - Implement handler in spurdbd wiring compute_fairshare to the new RPC - Add FairshareCache in spurctld with eager initial fetch and periodic background refresh (configurable via fairshare_refresh_secs, default 300s) - Replace hardcoded 1.0 in pending_jobs() with live cache lookup - Remove redundant fair_share_factor() from spur-sched (spurdbd owns this) - Simplify priority tests to table-driven style
PostgreSQL SUM() on BIGINT columns returns NUMERIC, causing a runtime panic when sqlx tries to decode into i64.
The get_usage RPC returns cpu_hours keyed as "user:account" but sshare looked up by bare account or user name, resulting in zero usage shown. Pre-aggregate the map by account and (user, account) before display.
The AccountingConfig struct expects `host` but the configmap had `address` and a non-existent `enabled` field, both silently ignored by serde causing spurctld to default to localhost:6819.
…messages GetUsageResponse and GetFairshareFactorsResponse used map<string, double> with "user:account" composite keys, forcing consumers to split strings and aggregate manually. Replace both with repeated message types (UsageEntry, FairshareEntry) carrying user, account, and metrics as distinct fields. Updates spurdbd server, spurctld fairshare cache, sshare, and sreport.
- Normalize target_share by dividing by total account weight sum so users consuming exactly their allocation get factor ~1.0 - Aggregate usage entries server-side by (user, account) in get_usage handler; simplify sshare/sreport clients to use direct inserts - Clamp halflife_days to 1..=365 in get_fairshare_factors - Enforce minimum 10s refresh interval in fairshare cache - Wrap refresh loop fetch in 10s timeout to avoid hangs - Fix test to use two accounts for meaningful cross-account assertions
The debug_fail helper used journalctl which only works with systemd units. The CI cluster runs processes via nohup with logs in ~/spur/log/. Also show full scontrol output, job stdout content, and remote agent logs from mi300-2.
There was a problem hiding this comment.
Pull request overview
Fixes end-to-end fair-share scheduling by addressing accounting query decoding, correcting CLI usage aggregation, and wiring spurctld to periodically fetch fair-share factors from spurdbd.
Changes:
- Updates the accounting protobuf API:
GetUsageResponsenow returns structured entries; addsGetFairshareFactorsRPC. - Fixes spurdbd usage aggregation query decoding by casting
SUM()results toBIGINT, and exposes a fair-share factors RPC. - Adds a spurctld fair-share cache with a refresh loop and uses it during priority recomputation; updates
sshare/sreportaggregation to match the new usage response shape; fixes K8s config to useaccounting.host.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| proto/slurm.proto | Adds fair-share RPC and changes usage response schema. |
| deploy/k8s/configmap.yaml | Fixes accounting config key to host for in-cluster spurdbd. |
| crates/spurdbd/src/server.rs | Returns usage as entries; adds GetFairshareFactors endpoint. |
| crates/spurdbd/src/db.rs | Casts aggregate SUM() outputs to BIGINT to avoid sqlx decode panic. |
| crates/spurctld/src/main.rs | Starts fair-share factor refresh loop during controller startup. |
| crates/spurctld/src/fairshare_cache.rs | New in-memory cache + background refresh from spurdbd RPC. |
| crates/spurctld/src/cluster.rs | Applies cached fair-share factor when recomputing effective priority. |
| crates/spur-tests/src/t24_priority.rs | Removes tests for the deleted local fair-share factor function. |
| crates/spur-sched/src/priority.rs | Removes local fair-share factor calculation; keeps effective priority + tests. |
| crates/spur-core/src/config.rs | Adds accounting.fairshare_refresh_secs config option + default. |
| crates/spur-cli/src/sshare.rs | Aggregates usage entries correctly by account and (user, account). |
| crates/spur-cli/src/sreport.rs | Aggregates usage entries correctly for account/user reporting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
powderluv
approved these changes
May 8, 2026
powderluv
pushed a commit
that referenced
this pull request
May 8, 2026
Add -D unused to the CI clippy invocation so that dead code, unused imports, unused variables, and other unused-item warnings are promoted to hard errors. This prevents unused code from silently accumulating in the codebase over time through AI-assisted or manual contributions. The codebase was cleaned up in PRs #155 through #159, so the lint group passes cleanly today and this gate can now be enforced going forward.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three bugs prevented the fair-share pipeline from working correctly end-to-end.
The
get_usagequery in spurdbd panicked at runtime because PostgreSQL'sSUM()on BIGINT columns returns NUMERIC, which sqlx cannot decode into Rust's i64. Fixed by casting aggregates explicitly to BIGINT.The
sshareCLI displayed zero usage for all accounts and users. TheGetUsageRPC returns cpu_hours keyed as "user:account" composites, but sshare was looking up by bare account name or user name. Now pre-aggregates the response map correctly before display.The K8s ConfigMap used
addressandenabledfields in the[accounting]section, but theAccountingConfigstruct only recognizeshost. Serde silently ignored the unknown fields and defaulted to localhost:6819, so the controller never reached spurdbd in-cluster.Test plan
cargo test -p spurdbd -p spur-clipasses, zero warnings with-D warningssshareoutput now correctly reflects per-account and per-user CPU hours and fair-share factors