Add option to use Satori GC #187

smoogipoo · 2025-05-14T09:28:32Z

Ongoing discussion is taking place in dotnet/runtime#96213

I've added ppy/Satori with GHA builds for the GC. The deploy script is now able to download the latest release and attach Satori based on the installation steps provided by the author.

This is exposed via the environment variable (NOTE: as opposed to App.config) USE_SATORI_GC=true.

$ USE_SATORI_GC=true dotnet run -- "" 2025.514.0 macOS arm64

Ready to deploy version 2025.514.0 on platform macOS!
19      Using working directory  /Users/smgi/Repos/osu-deploy...
19      Running dotnet tool restore...
179     Restoring previous build...
180     Running build...
180     Using working directory  /Users/smgi/Repos/osu/...
180     Running cp -r "/Users/smgi/Repos/osu-deploy/templates/osu!.app" "/Users/smgi/Repos/osu-deploy/staging/osu!.app"...
204     Using working directory  /Users/smgi/Repos/osu/...
204     Running dotnet publish -f net8.0 -r osx-arm64 -c Release -o "/Users/smgi/Repos/osu-deploy/staging/osu!.app/Contents/MacOS" -p:Version=2025.514.0 --self-contained  osu.Desktop...
2325    Downloading Satori GC release...
[network] 2025-05-14 09:04:58 [verbose]: Request to https://github.com/ppy/Satori/releases/latest/download/osx-arm64.zip successfully completed!
3439    Extracting Satori GC into staging folder...
3497    Using working directory  /Users/smgi/Repos/osu-deploy...
3497    Running touch "/Users/smgi/Repos/osu-deploy/staging/osu!.app" /Users/smgi/Repos/osu-deploy/staging...
3520    Creating release...
3521    Using working directory  /Users/smgi/Repos/osu-deploy...
3521    Running dotnet vpk [osx] pack --packTitle="osu!" --packAuthors="ppy Pty Ltd" --packId="osulazer" --packVersion="2025.514.0" --runtime="osx-arm64" --outputDir="/Users/smgi/Repos/osu-deploy/releases" --mainExe="osu!" --packDir="/Users/smgi/Repos/osu-deploy/staging/osu!.app" --channel="osx-arm64" --verbose  --signEntitlements="/Users/smgi/Repos/osu-deploy/osu.entitlements" --noInst...
18912   Done!

huoyaoyuan · 2025-05-14T14:52:58Z

I wonder your measurement result, especially around working set.

In real world application there're much more survived objects than synthesized stress test, so the difference should be smaller.

smoogipoo · 2025-05-14T15:18:27Z

It's hard to quantify because everything's so dynamic, but I'm seeing probably a ~20% increase in RSS.

Rough results look like (64GB total system memory, linux-x64):

Scenario	WKS	Satori
menu	600MB	700MB
song select	875MB	1GB
start of gameplay	885MB	1.1GB
end of gameplay	900MB	1.1GB
results	860MB (gc'd)	1.1GB
back to menu	960MB	1.2GB
continuously selecting maps in song select (max RSS)	1.3GB	2.2GB

This metric isn't very important for us though.

AlgorithmsAreCool · 2025-05-14T15:38:06Z

Are you using SVR or SVR-DATAS on master?

smoogipoo · 2025-05-14T15:39:29Z

WKS on master (I'll clarify in the table)

huoyaoyuan · 2025-05-14T16:46:06Z

I recall 3GB of memory consumption under WKS during debugging, but maybe I was misremembering. ~2GB with Satori is definitely good enough.

peppy · 2025-05-14T16:59:07Z

continuously selecting maps in song select (max RSS)

seems pretty high, but that's probably on us to some extent.

smoogipoo · 2025-05-14T18:00:52Z

It's high, but that only represents ~3% of total system memory. I expect this to behave differently on a more limited system, but I'm not able to test that right now.

Besides that, the difference in raw performance is staggering. Here's what I would say is a "simple" case of song select (not exactly what I tested above, but leads to similar results):

WKS:

2025-05-15.02-43-57.mp4

Satori:

2025-05-15.02-44-49.mp4

Though I say simple, this is still seemingly allocating on the order of ~500MB/sec according to dotnet-counters 🤔 (haven't taken a profiler to it yet).
Also, for reference, this is SustainedLowLatency here (not sure how that is implemented in Satori) but gameplay runs in LowLatency. I'm 99.9% sure this test fails pretty hard on Interactive from testing in recent-past. Need to expose more knobs to simplify testing GC modes...

AlgorithmsAreCool · 2025-05-14T18:28:31Z

Am i reading this correctly that you are almost doubling the average framerate?!

smoogipoo · 2025-05-14T18:40:47Z

Yeah, but this is, as I've found out now, a pretty extreme case. During gameplay we're only allocating ~2MB/sec, so the GC isn't taking much away from the average but Satori is smoothing out the P99 frame times.

I've still seen some concerning behaviours that doesn't align with the general super-low-pause-times (still not as bad as WKS), but I haven't been able to put it in words yet/dig deeper. It's something along the lines of:

We're seeing no Gen0s during gameplay. I believe this is because we're allocating so little (<2MB/sec), which is why we're able to use LowLatency during gameplay in the first place with WKS.
When we see a GC, it usually comes in as a Gen1.
That Gen1 ends up taking a significant amount of time - let's say ~3ms, somewhat comparable to WKS.
But those Gen1s are very hard to capture because they're happening once every several full gameplay sessions.

I'm not sure if any of this is a problem, or expected behaviour. I would need to test SustainedLowLatency or Interactive, though I'm concerned about compactions.

AlgorithmsAreCool · 2025-05-14T18:48:11Z

If you are talking about SustainedLowLatency mode for Satori, i don't think it is supported unless i did something wrong in my testing. When i would set the mode, it wouldn't update the actual value. Based on this, i think Satori only supports Interactive and LowLatency.

WKS supports all 4 modes as far as i understand.

I also observed zero Gen0 collections in my synthetic benchmarks for Satori in both modes, but hez2010 and huoyaoyuan both do show plenty of Gen0s, so i don't know what to make of this.

smoogipoo · 2025-05-14T18:59:00Z

If you are talking about SustainedLowLatency mode for Satori, i don't think it is supported

Yeah, you're right. I wasn't sure what the default behaviour would be - makes sense that the default behaviour is to act as LowLatency.

Here's the same test as above with Interactive:

2025-05-15.03-56-39.mp4

Looks like working set is reduced while keeping performance about the same, as expected? 👍

AlgorithmsAreCool · 2025-05-14T19:29:30Z

Keeping my eye on the FPS meters on the bottom Interactive mode seems to produce even higher FPS, although i did see some dips in there. But these are still two great options to have!

WKS Working Set = ~1300MB
Satori Interactive = ~ 2400MB
Satori LowLatency increased up to ~2700

The memory growth of Satori LL is a potential concern since it is >2x WKS. But you said this is a large memory machine that you are testing on?

smoogipoo · 2025-05-15T10:05:10Z

That was on a 64GB system. I'll have to find some time to test at lower limits but the easiest path is to get it into more people's hands in any case.

VSadov · 2025-05-16T23:12:37Z

If you are talking about SustainedLowLatency mode for Satori, i don't think it is supported

Yeah, you're right. I wasn't sure what the default behaviour would be - makes sense that the default behaviour is to act as LowLatency.

Satori can generally run in low latency mode with no ill effects, other than turned off compaction may result in higher heap watermark. So I do not know in which way a "sustainable" mode would be different.

Right now there is only one low latency mode internally and both LowLatency and SustainedLowLatency turn it on.

VSadov · 2025-05-17T01:18:05Z

We're seeing no Gen0s during gameplay. I believe this is because we're allocating so little (<2MB/sec), which is why we're able to use LowLatency during gameplay in the first place with WKS.

There are some heuristics that may decide that gen0 is not worth using. Low rate of allocations is one of such cases. Allocating below roughly160 Mb/sec is a low-allocation scenario. (no big science behind this threshold, just had to pick something reasonable for starters).
Low-allocating threads will try sharing one nursery region. That can be good for heap size and sharing will not impact throughput since allocation rate is low. The shared region is gen1. (gen0 regions have an owning thread thus can't be shared).

When we see a GC, it usually comes in as a Gen1.

That could be normal for a low-allocation scenario.

That Gen1 ends up taking a significant amount of time - let's say ~3ms, somewhat comparable to WKS.

3ms does not seem too bad. I'd expect it to be < 1ms for low-allocation scenario though.
It is not very alarming (10 ms. would be), but maybe there are ways to figure what happens there and improve.

In low latency mode blocking stage mostly deals with incremental work created by the app while concurrent GC was doing its thing. There is not a lot of incremental work in general and in low-allocating scenario would be even less. It would be mostly just a validation that all what had to be done has been done.
Also there are some chores that need to be done in blocking mode. Most are quick. Perhaps the part that there is a long time between collections plays some part.

If very curious about what happens, you can disable Gen1 - as in export DOTNET_gcGen1=0. Then we will be doing Gen2 instead of Gen1 GCs. It is a bit of a brute-force mode, but may actually have smaller pauses. There would be way more work for the concurrent GC, but for the blocking stage there could be fewer "chores". If Gen2-only mode has much lower pauses, some insights could be gained from that.

But those Gen1s are very hard to capture because they're happening once every several full gameplay sessions.

I'm not sure if any of this is a problem, or expected behaviour. I would need to test SustainedLowLatency or Interactive, though I'm concerned about compactions.

There is only one kind of low latency mode internally. And in that mode compactions do not happen, so no worries here.

https://github.com/VSadov/Satori/blob/51785a44675893aed84d67a1a0ea50ca90010a5f/src/coreclr/gc/satori/SatoriRecycler.cpp#L1202-L1205

smoogipoo added 2 commits May 14, 2025 14:21

Resolve inspection issues

72ec2a8

Add option to use Satori GC

d0d840e

smoogipoo requested a review from peppy May 14, 2025 09:31

smoogipoo mentioned this pull request May 14, 2025

Song select laggy during continuous selection ppy/osu#33140

Closed

peppy approved these changes May 15, 2025

View reviewed changes

peppy merged commit e28c5c3 into ppy:master May 15, 2025
2 checks passed

smoogipoo mentioned this pull request May 16, 2025

Rebase onto 9.0 VSadov/Satori#45

Closed

Add option to use Satori GC #187

Add option to use Satori GC #187

Conversation

smoogipoo commented May 14, 2025

Uh oh!

huoyaoyuan commented May 14, 2025

Uh oh!

smoogipoo commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlgorithmsAreCool commented May 14, 2025

Uh oh!

smoogipoo commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huoyaoyuan commented May 14, 2025

Uh oh!

peppy commented May 14, 2025

Uh oh!

smoogipoo commented May 14, 2025

Uh oh!

AlgorithmsAreCool commented May 14, 2025

Uh oh!

smoogipoo commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlgorithmsAreCool commented May 14, 2025

Uh oh!

smoogipoo commented May 14, 2025

Uh oh!

AlgorithmsAreCool commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smoogipoo commented May 15, 2025

Uh oh!

Uh oh!

VSadov commented May 16, 2025

Uh oh!

VSadov commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

smoogipoo commented May 14, 2025 •

edited

Loading

smoogipoo commented May 14, 2025 •

edited

Loading

smoogipoo commented May 14, 2025 •

edited

Loading

AlgorithmsAreCool commented May 14, 2025 •

edited

Loading

VSadov commented May 17, 2025 •

edited

Loading