Skip to content

perf(ci): use self-hosted macos runner if available, fallback to public runners if not#587

Merged
mikehardy merged 5 commits intomainfrom
tartelet-test
Oct 17, 2025
Merged

perf(ci): use self-hosted macos runner if available, fallback to public runners if not#587
mikehardy merged 5 commits intomainfrom
tartelet-test

Conversation

@mikehardy
Copy link
Member

@mikehardy mikehardy commented Oct 17, 2025

GitHub macos runners are performing incredibly badly at the moment

This is a change that lets us use Tartelet / Tart to do a self-hosted runner, with fallback if they're offline

https://josephduffy.co.uk/posts/self-hosting-macos-github-runners

If this should be unwound:

  • remove the github app for it in ankidroid org app settings
  • remove the org level secret with token used to access the github API to query for runner status

@mikehardy
Copy link
Member Author

Success on the first run, how about that.

Took 22mins to run the macos tests on build-quick workflow, of which 3'23" was restoring rust cache, which was a 2GB network transfer (the largest transfer in the entire workflow)

Going to eliminate the rust cache for self-hosted builds, and also alter the release workflow to use the macos-tartelet runner, and see how it goes

@mikehardy
Copy link
Member Author

mikehardy commented Oct 17, 2025

mild hiccup - expanded my local Tartelet ephemeral self-hosted runner manager to do 2 at once for this push / CI run

  • did spawn 2 local Tart VMs but one keeps trying to self-update the github runner from 2.328.0 to 2.329.0 and it resulted in a disconnect of the second runner and a fail of the job assigned to it

Additionally, the holy grail here is to use the self-hosted runner if I make it available, but fallback to a public runner if it is offline. This does not appear to be built-in functionality but there is an API endpoint that lists runners and their status, and you can use that in a pre-step to select what runner to use: https://github.com/orgs/community/discussions/20019#discussioncomment-13391511 (or this action, with this PR for organization runner support https://github.com/jimmygchen/runner-fallback-action/pull/28/files)

2nd run initial findings are that the JDK download can stall. JDK17 is in the image I'm using as a base, but 21 is not, so it has to download. I should update my base VM to have what we want by default:

  • rustup and toolchain from rust-toolchain.toml (currently 1.89.0)
  • jdk from workflow definition (currently 21)(8'40" download stall)
  • brew install mingw-w64 (15min download stall)
  • brew install MaterializeInc/crosstools/x86_64-unknown-linux-gnu (another long download stall...)
  • gradle version from source tree (currently 9)(6min download stall)

@mikehardy
Copy link
Member Author

Okay, some optimization thoughts above, as a proof of concept experiment I was able to take this from an idea to ... working in just about 90 minutes, which is promising. I don't see any technical difficulties really preventing me from dynamically using the self-hosted runner if available and fallback if not - I'm the one that implemented dynamic OS matrix construction in the first place and we have scripts that use the GitHub API in workflows already, it's all tech we're familiar with.

So I'm going to sleep now, but this seems like a valid way to un-bottleneck macOS runners in this repo which is a good thing because they're unavailability is effectively blocking all development at the moment.

@mikehardy
Copy link
Member Author

mikehardy commented Oct 17, 2025

100% CPU bound now after implementation and execution of a "VM prep" script on the local Tart VM image that serves as the clone base for Tartelet ephemeral build runners.

Performance now acceptable IMHO, just a touch slower than windows now

build-quick results:

  • self-hosted macOS 19'3"
  • windows 16'27"
  • ubuntu 12'18"
  • public macOS usually 12-13'

build-release results:

  • self-hosted macOS 43'38"
  • public macOS between 47' (fastest ever) and 3'36" (slowest ever), typical ~57'

Run parameters:

  • rust uncached (at least 12' of the time was spent compiling rust in build-quick, much more in build-release)
  • both a build-quick and build-release working concurrently/parallel
  • Apple M2 laptop that's a bit RAM-starved

Very positive result. The public runners are clearly faster but there is still room for improvement on self-hosted perf, and it's already usefully quick.

Still a couple items to resolve

  • the rust cache itself (perhaps uncached for build-quick, but cached for build-release? Perhaps a local cache and tell rust where to find it on the base image? perhaps a self-hosted-specific lower cache size? or just lower size in general?)
  • the ability to fallback to public runners if my self-hosted runners are offline so I can maintain control over my local compute without blocking all CI

@mikehardy
Copy link
Member Author

Rust caching was a nest of bugs, peeled into separate PR that is merging then I'll rebase this and re-push before moving forward #588

this makes them more or less machine-parseable for workflow runner
preparation scripts
this will install all the build pre-requisites for a Tartelet/Tart
self-hosted VM and warm up the cache with one build run

should be used on the persistent base Tart VM image, prior to starting
Tartelet which will clone that into ephemeral runners
@mikehardy mikehardy changed the title WIP - experiment with a self-hosted macos runner perf(ci): use self-hosted macos runner if available, fallback to public runners if not Oct 17, 2025
@mikehardy
Copy link
Member Author

mikehardy commented Oct 17, 2025

Okay - ready to go

Has a couple little fixes in separate commits, wasn't worth separating IMHO

If CI goes green, this current run represents the failure / runners offline test, so that means the new worst case works

And the best case is 2 more macos runners available sometimes, that work a bit faster than the public ones

They are at the organization level so are available (as is the fallback action token) in all ankidroid repos as desired.

@mikehardy mikehardy merged commit eeaded1 into main Oct 17, 2025
9 checks passed
@mikehardy mikehardy deleted the tartelet-test branch October 17, 2025 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant