Skip to content

Conversation

@smoogipoo
Copy link
Contributor

Ongoing discussion is taking place in dotnet/runtime#96213

I've added ppy/Satori with GHA builds for the GC. The deploy script is now able to download the latest release and attach Satori based on the installation steps provided by the author.

This is exposed via the environment variable (NOTE: as opposed to App.config) USE_SATORI_GC=true.

$ USE_SATORI_GC=true dotnet run -- "" 2025.514.0 macOS arm64

Ready to deploy version 2025.514.0 on platform macOS!
19      Using working directory  /Users/smgi/Repos/osu-deploy...
19      Running dotnet tool restore...
179     Restoring previous build...
180     Running build...
180     Using working directory  /Users/smgi/Repos/osu/...
180     Running cp -r "/Users/smgi/Repos/osu-deploy/templates/osu!.app" "/Users/smgi/Repos/osu-deploy/staging/osu!.app"...
204     Using working directory  /Users/smgi/Repos/osu/...
204     Running dotnet publish -f net8.0 -r osx-arm64 -c Release -o "/Users/smgi/Repos/osu-deploy/staging/osu!.app/Contents/MacOS" -p:Version=2025.514.0 --self-contained  osu.Desktop...
2325    Downloading Satori GC release...
[network] 2025-05-14 09:04:58 [verbose]: Request to https://github.com/ppy/Satori/releases/latest/download/osx-arm64.zip successfully completed!
3439    Extracting Satori GC into staging folder...
3497    Using working directory  /Users/smgi/Repos/osu-deploy...
3497    Running touch "/Users/smgi/Repos/osu-deploy/staging/osu!.app" /Users/smgi/Repos/osu-deploy/staging...
3520    Creating release...
3521    Using working directory  /Users/smgi/Repos/osu-deploy...
3521    Running dotnet vpk [osx] pack --packTitle="osu!" --packAuthors="ppy Pty Ltd" --packId="osulazer" --packVersion="2025.514.0" --runtime="osx-arm64" --outputDir="/Users/smgi/Repos/osu-deploy/releases" --mainExe="osu!" --packDir="/Users/smgi/Repos/osu-deploy/staging/osu!.app" --channel="osx-arm64" --verbose  --signEntitlements="/Users/smgi/Repos/osu-deploy/osu.entitlements" --noInst...
18912   Done!

@smoogipoo smoogipoo requested a review from peppy May 14, 2025 09:31
@huoyaoyuan
Copy link

I wonder your measurement result, especially around working set.

In real world application there're much more survived objects than synthesized stress test, so the difference should be smaller.

@smoogipoo
Copy link
Contributor Author

smoogipoo commented May 14, 2025

It's hard to quantify because everything's so dynamic, but I'm seeing probably a ~20% increase in RSS.

Rough results look like (64GB total system memory, linux-x64):

Scenario WKS Satori
menu 600MB 700MB
song select 875MB 1GB
start of gameplay 885MB 1.1GB
end of gameplay 900MB 1.1GB
results 860MB (gc'd) 1.1GB
back to menu 960MB 1.2GB
continuously selecting maps in song select (max RSS) 1.3GB 2.2GB

This metric isn't very important for us though.

@AlgorithmsAreCool
Copy link

Are you using SVR or SVR-DATAS on master?

@smoogipoo
Copy link
Contributor Author

smoogipoo commented May 14, 2025

WKS on master (I'll clarify in the table)

@huoyaoyuan
Copy link

I recall 3GB of memory consumption under WKS during debugging, but maybe I was misremembering. ~2GB with Satori is definitely good enough.

@peppy
Copy link
Member

peppy commented May 14, 2025

continuously selecting maps in song select (max RSS)

seems pretty high, but that's probably on us to some extent.

@smoogipoo
Copy link
Contributor Author

It's high, but that only represents ~3% of total system memory. I expect this to behave differently on a more limited system, but I'm not able to test that right now.

Besides that, the difference in raw performance is staggering. Here's what I would say is a "simple" case of song select (not exactly what I tested above, but leads to similar results):

WKS:

2025-05-15.02-43-57.mp4

Satori:

2025-05-15.02-44-49.mp4

Though I say simple, this is still seemingly allocating on the order of ~500MB/sec according to dotnet-counters 🤔 (haven't taken a profiler to it yet).
Also, for reference, this is SustainedLowLatency here (not sure how that is implemented in Satori) but gameplay runs in LowLatency. I'm 99.9% sure this test fails pretty hard on Interactive from testing in recent-past. Need to expose more knobs to simplify testing GC modes...

@AlgorithmsAreCool
Copy link

Am i reading this correctly that you are almost doubling the average framerate?!

@smoogipoo
Copy link
Contributor Author

smoogipoo commented May 14, 2025

Yeah, but this is, as I've found out now, a pretty extreme case. During gameplay we're only allocating ~2MB/sec, so the GC isn't taking much away from the average but Satori is smoothing out the P99 frame times.

I've still seen some concerning behaviours that doesn't align with the general super-low-pause-times (still not as bad as WKS), but I haven't been able to put it in words yet/dig deeper. It's something along the lines of:

  • We're seeing no Gen0s during gameplay. I believe this is because we're allocating so little (<2MB/sec), which is why we're able to use LowLatency during gameplay in the first place with WKS.
  • When we see a GC, it usually comes in as a Gen1.
  • That Gen1 ends up taking a significant amount of time - let's say ~3ms, somewhat comparable to WKS.
  • But those Gen1s are very hard to capture because they're happening once every several full gameplay sessions.

I'm not sure if any of this is a problem, or expected behaviour. I would need to test SustainedLowLatency or Interactive, though I'm concerned about compactions.

@AlgorithmsAreCool
Copy link

If you are talking about SustainedLowLatency mode for Satori, i don't think it is supported unless i did something wrong in my testing. When i would set the mode, it wouldn't update the actual value. Based on this, i think Satori only supports Interactive and LowLatency.

WKS supports all 4 modes as far as i understand.

I also observed zero Gen0 collections in my synthetic benchmarks for Satori in both modes, but hez2010 and huoyaoyuan both do show plenty of Gen0s, so i don't know what to make of this.

@smoogipoo
Copy link
Contributor Author

If you are talking about SustainedLowLatency mode for Satori, i don't think it is supported

Yeah, you're right. I wasn't sure what the default behaviour would be - makes sense that the default behaviour is to act as LowLatency.

Here's the same test as above with Interactive:

2025-05-15.03-56-39.mp4

Looks like working set is reduced while keeping performance about the same, as expected? 👍

@AlgorithmsAreCool
Copy link

AlgorithmsAreCool commented May 14, 2025

Keeping my eye on the FPS meters on the bottom Interactive mode seems to produce even higher FPS, although i did see some dips in there. But these are still two great options to have!

  • WKS Working Set = ~1300MB
  • Satori Interactive = ~ 2400MB
  • Satori LowLatency increased up to ~2700

The memory growth of Satori LL is a potential concern since it is >2x WKS. But you said this is a large memory machine that you are testing on?

@smoogipoo
Copy link
Contributor Author

That was on a 64GB system. I'll have to find some time to test at lower limits but the easiest path is to get it into more people's hands in any case.

@peppy peppy merged commit e28c5c3 into ppy:master May 15, 2025
2 checks passed
@VSadov
Copy link

VSadov commented May 16, 2025

If you are talking about SustainedLowLatency mode for Satori, i don't think it is supported

Yeah, you're right. I wasn't sure what the default behaviour would be - makes sense that the default behaviour is to act as LowLatency.

Satori can generally run in low latency mode with no ill effects, other than turned off compaction may result in higher heap watermark. So I do not know in which way a "sustainable" mode would be different.

Right now there is only one low latency mode internally and both LowLatency and SustainedLowLatency turn it on.

@VSadov
Copy link

VSadov commented May 17, 2025

  • We're seeing no Gen0s during gameplay. I believe this is because we're allocating so little (<2MB/sec), which is why we're able to use LowLatency during gameplay in the first place with WKS.

There are some heuristics that may decide that gen0 is not worth using. Low rate of allocations is one of such cases. Allocating below roughly160 Mb/sec is a low-allocation scenario. (no big science behind this threshold, just had to pick something reasonable for starters).
Low-allocating threads will try sharing one nursery region. That can be good for heap size and sharing will not impact throughput since allocation rate is low. The shared region is gen1. (gen0 regions have an owning thread thus can't be shared).

  • When we see a GC, it usually comes in as a Gen1.

That could be normal for a low-allocation scenario.

  • That Gen1 ends up taking a significant amount of time - let's say ~3ms, somewhat comparable to WKS.

3ms does not seem too bad. I'd expect it to be < 1ms for low-allocation scenario though.
It is not very alarming (10 ms. would be), but maybe there are ways to figure what happens there and improve.

In low latency mode blocking stage mostly deals with incremental work created by the app while concurrent GC was doing its thing. There is not a lot of incremental work in general and in low-allocating scenario would be even less. It would be mostly just a validation that all what had to be done has been done.
Also there are some chores that need to be done in blocking mode. Most are quick. Perhaps the part that there is a long time between collections plays some part.

If very curious about what happens, you can disable Gen1 - as in export DOTNET_gcGen1=0. Then we will be doing Gen2 instead of Gen1 GCs. It is a bit of a brute-force mode, but may actually have smaller pauses. There would be way more work for the concurrent GC, but for the blocking stage there could be fewer "chores". If Gen2-only mode has much lower pauses, some insights could be gained from that.

  • But those Gen1s are very hard to capture because they're happening once every several full gameplay sessions.

I'm not sure if any of this is a problem, or expected behaviour. I would need to test SustainedLowLatency or Interactive, though I'm concerned about compactions.

There is only one kind of low latency mode internally. And in that mode compactions do not happen, so no worries here.

https://github.com/VSadov/Satori/blob/51785a44675893aed84d67a1a0ea50ca90010a5f/src/coreclr/gc/satori/SatoriRecycler.cpp#L1202-L1205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants