👷🐧🍎🏁 Extensive CI testing #60

burgholzer · 2025-01-16T20:39:26Z

This PR experiments with broadening the type of runners that we run CI on.
In particular, it allows one to dynamically configure the workflow runs that should be enabled.
As a result, projects can decide on their own under which constellations they want to test.

On the C++ side, the currently available workflows are:

(ubuntu-24.04, ubuntu-22.04, ubuntu-24.04-arm, ubuntu-22.04-arm) x (gcc, clang) x (Release, Debug)
(macos-13, macos-14, macos-15) x (clang, gcc) x (Release, Debug)
(windows-2022, windows-2025) x (msvc, clang) x (Release, Debug)

with the following combinations being turned on by default:

ubuntu-24.04, gcc, Release,
ubuntu-22.04, gcc, Release,
ubuntu-24.04-arm, gcc, Release,
ubuntu-22.04-arm, gcc, Release,
macos-13, clang, Release,
macos-14, clang, Release,
windows-2022, msvc, Release,

On the Python side, the available workflows are

🐧 ubuntu-22.04,
🐧 ubuntu-22.04-arm,
🐧 ubuntu-24.04,
🐧 ubuntu-24.04-arm,
🍎 macos-13,
🍎 macos-14,
🍎 macos-15,
🏁 windows-2022,
🏁 windows-2025,

with the following runs being turned on by default:

🐧 ubuntu-24.04,
🐧 ubuntu-24.04-arm,
🍎 macos-13,
🍎 macos-14,
🏁 windows-2022,

A first prototype how to use this new functionality is being explored in cda-tum/mqt-core#803, where an extensive-cpp-ci and an extensive-python-ci option are added to conditionally run more-than-default testing on demand.

Old Description: Due to the amount of runs that this creates, I am rather hesitant to directly merge this. Hence, I will keep this open for now and keep the branch itself. This way, we can test the real-world consequences of this and decide then.

Signed-off-by: burgholzer <burgholzer@me.com>

marcelwa · 2025-01-17T10:10:22Z

@burgholzer I think this can prove a valuable addition to the workflow package. Particularly with respect to our recent discussions on a CI unification across @cda-tum.

burgholzer · 2025-01-17T11:40:08Z

@burgholzer I think this can prove a valuable addition to the workflow package. Particularly with respect to our recent discussions on a CI unification across @cda-tum.

Good point. The build matrix just scales extremely quickly. Given how most of our packages support all currently non-EOL Python versions, the current setup with this PR as it is results in:

5 Python versions
9 operating system (versions)
2 sessions (minimum versions and regular resolution)

on the Python side and

9 operating system versions
2 build types (Debug and Release)

on the C++ side.

In total that means at least 90 + 18 individual runs; out of which (5 * 3 * 2) + (3 * 2) = 36 are run on some macOS system. Given how the concurrency limit on macOS runners is 5, this would imply that each CI run would exclusively block the macOS runners for 7 full builds and tests.

Each of these runs has to build the C++ part of a project. The Python runs have to additionally install all Python dependencies. Then, all of these runs have to run the respective tests.

Setting all the above aside for a moment, we could go one step further in the sense that we could also run on different compilers for each operating system (similar to how fiction handles this). I could imagine it would make sense to test at least

default versions of clang and gcc available for Ubuntu and macOS
msvc and Clang on Windows

Technically, this applies to both the Python and the C++ builds. So even with the above, it would amount to another factor of 2x in the number of runs.
At some point it seems unreasonable to run ~250 individual builds per CI run.
@marcelwa any kind of ideas for how we could manage this in a better fashion?
I could imagine that a clever setup of the CI in conjunction with careful setup of the individual projects and their test suites could really help here.
And, naturally, it would help if the MQT were a monorepo like fiction.

marcelwa · 2025-01-17T12:49:22Z

In total that means at least 90 + 18 individual runs; out of which (5 * 3 * 2) + (3 * 2) = 36 are run on some macOS system. Given how the concurrency limit on macOS runners is 5, this would imply that each CI run would exclusively block the macOS runners for 7 full builds and tests.
[...]
Technically, this applies to both the Python and the C++ builds. So even with the above, it would amount to another factor of 2x in the number of runs. At some point it seems unreasonable to run ~250 individual builds per CI run. @marcelwa any kind of ideas for how we could manage this in a better fashion? I could imagine that a clever setup of the CI in conjunction with careful setup of the individual projects and their test suites could really help here. And, naturally, it would help if the MQT were a monorepo like fiction.

I see that this setup quickly grows beyond reason. Naturally, there might not be an immediate end-all-be-all solution to a complex topic like this one. Let me try to list some thoughts on the matter anyway (in no particular order):

In the MNT, we haven't faced any compiler version-specific bugs in a long time. Very occasionally, we still see diverging behavior between g++ and clang for instance (and of course between operating systems) but not in a while has there been a bug that occurred on g++ version X but not on X±1.
➡️ A careful setup might only consider a single (runner default?) version per compiler.
On macOS, the default compiler is AppleClang. You will have to go out of your way to install g++. On Apple Silicon Macs, you must pass additional compiler flags for compatibility (-arch arm64). As far as I know, Intel Macs still haven't reached end-of-life but might within a year or so.
➡️ It could be reasonable to restrict the CI to AppleClang on macOS.
Lots of recompilation overhead is already caught via compiler caches. However, we could maximize reuse by uploading entire (cleaned) build directories as artifacts and fetching them in subsequent runs. This will probably have limited to no effect when building wheels due to technical constraints.
➡️ We might want to investigate build directory caching between runs.
Some projects might not need the entire list of 5 Python versions, 9 operating system versions, and 2 sessions.
➡️ Flags to the workflow to toggle certain versions/sessions could reduce redundant workload.
In the MNT, each workflow tests first whether certain files (file types) have changed in the current PR before deciding whether to run at all. I think you're incorporating similar checks here.
➡️ Let's ponder whether there is any way of making such checks stricter.
Self-hosted runners would go a long way in reducing the wait time for other runs.
➡️ We might want to install a first self-hosted runner as a test balloon rather soon.

I'm always open to discussing this situation further.

burgholzer · 2025-01-17T14:13:29Z

Thanks for your thoughts! 🙏🏼 I'll just add mine in a similar fashion.

I agree that one particular compiler version should be sufficient per platform and compiler.
Testing clang and gcc on linux might make sense, though.
Only using the default compiler on macOS seems fine to me.
I do see value in testing both MSVC and clang under Windows as we have seen some compilation problems with MSVC in the past that required us to use clang for compilation. Making sure this does not happen again would be nice.

I am still a bit on the fence when it comes to caching. First of all, from experience, it is rather opaque and at times hard to judge whether the compiler caches really work the way they are intended to work. Especially on the Python side, it can be quite hard to get that to work properly on all operating systems. Additionally, these caches have to be restored and saved in each run, which also takes quite some time because this has to go through the network. Given how some of the caches are more than a couple of hundred megabytes large, this definitely takes non-zero time.
Caching whole build directories seems rather fragile to me. I can remember a couple of cases where subtle differences in the runners tripped up the CI when reusing build folders.
It might even be faster to not cache the compilation at all.
Any thoughts on this?
(afterthought: we are also way overboard with our cache usage on GitHub in basically all of our projects.. and the allowance for that is 10 GiB 🙃)

In general, there is a conflict of interest that we will probably never be able to resolve:
On the one hand, one would want jobs that are as fine-grained as possible so that they can be parallelized as much as possible. That creates many, but small, jobs. On the other hand, we could severely reduce the number of individual jobs by packing everything into a single workflow with lots of options. Then the total number of runs will be smaller, but each individual run will take considerably longer.
Combining everything also has the disadvantage that you potentially have to wait a rather long time until the particular thing you are currently working on is tested.

What could really make a lot of sense is to allow more customization of the individual workflows so that it can be configured on-demand which jobs to run (either automatically based on files that changed, like we already to it at the moment) or via explicit opt-outs (meaning someone would modify the CI.yml in their PR to only enable certain checks they want to run).
This kind of configuration would create quite some duplication of code in the workflows though, since one would not be able to use build matrices for the most part (they need to be statically defined if I remember correctly).

Self-hosted runners would really go a long way. Especially macOS ones.

Signed-off-by: burgholzer <burgholzer@me.com>

Added inputs for specifying runner images, compilers, and configurations across Ubuntu, macOS, and Windows workflows. Simplified matrix generation with dynamic configuration, allowing more flexibility in build environments and testing setups. Signed-off-by: burgholzer <burgholzer@me.com>

The workflow is now expected to be called separately like the `reusable-cpp-linter` workflow. Signed-off-by: burgholzer <burgholzer@me.com>

Signed-off-by: burgholzer <burgholzer@me.com>

marcelwa · 2025-01-20T15:27:20Z

I agree that one particular compiler version should be sufficient per platform and compiler. Testing clang and gcc on linux might make sense, though. Only using the default compiler on macOS seems fine to me. I do see value in testing both MSVC and clang under Windows as we have seen some compilation problems with MSVC in the past that required us to use clang for compilation. Making sure this does not happen again would be nice.

Totally agreed.

I am still a bit on the fence when it comes to caching. First of all, from experience, it is rather opaque and at times hard to judge whether the compiler caches really work the way they are intended to work. Especially on the Python side, it can be quite hard to get that to work properly on all operating systems. Additionally, these caches have to be restored and saved in each run, which also takes quite some time because this has to go through the network. Given how some of the caches are more than a couple of hundred megabytes large, this definitely takes non-zero time. Caching whole build directories seems rather fragile to me. I can remember a couple of cases where subtle differences in the runners tripped up the CI when reusing build folders. It might even be faster to not cache the compilation at all. Any thoughts on this? (afterthought: we are also way overboard with our cache usage on GitHub in basically all of our projects.. and the allowance for that is 10 GiB 🙃)

We've had similar experiences in the past. While this is unfortunate, your story validates our findings. At least, we can say now that it wasn't totally on us 😉 That doesn't help with the situation, though obviously.

In general, there is a conflict of interest that we will probably never be able to resolve: On the one hand, one would want jobs that are as fine-grained as possible so that they can be parallelized as much as possible. That creates many, but small, jobs. On the other hand, we could severely reduce the number of individual jobs by packing everything into a single workflow with lots of options. Then the total number of runs will be smaller, but each individual run will take considerably longer. Combining everything also has the disadvantage that you potentially have to wait a rather long time until the particular thing you are currently working on is tested.

Also here, I totally agree. In the end, it boils down to a trade-off. Probably, you and I cannot make this decision on a general basis, but only on a repository level. What is good for one repo might be a poor choice for another.

What could really make a lot of sense is to allow more customization of the individual workflows so that it can be configured on-demand which jobs to run (either automatically based on files that changed, like we already to it at the moment) or via explicit opt-outs (meaning someone would modify the CI.yml in their PR to only enable certain checks they want to run). This kind of configuration would create quite some duplication of code in the workflows though, since one would not be able to use build matrices for the most part (they need to be statically defined if I remember correctly).

Yes, this goes in the same direction as what I meant to say above: repository-level design decisions should be taken into account when configuring the workflows. It's unfortunate that this then generates code overhead in this repo.

Self-hosted runners would really go a long way. Especially macOS ones.

Yessir 🍎

Signed-off-by: burgholzer <burgholzer@me.com>

burgholzer · 2025-01-21T19:59:03Z

Alright, I played around a little more with this and correspondingly updated the PR description here. I believe this is a nice start in the right direction.
It definitely gives more flexibility and allows for more configuration.
Let's see if, at some point, it might also be interesting to use different compilers and build configurations for the Python package builds. I considered this overkill for now.

Python linting should be called by users. `hynek/build-and-inspect-python-package@v2` is removed as we do test our Python packages extensively anyway. To simplify the configuration, the option to run Python tests individually per Python version is removed as well. Signed-off-by: burgholzer <burgholzer@me.com>

Signed-off-by: burgholzer <burgholzer@me.com>

marcelwa · 2025-01-22T09:09:20Z

Alright, I played around a little more with this and correspondingly updated the PR description here. I believe this is a nice start in the right direction.
It definitely gives more flexibility and allows for more configuration.
Let's see if, at some point, it might also be interesting to use different compilers and build configurations for the Python package builds. I considered this overkill for now.

Great stuff! Looks very promising and not too convoluted even 🙂

## Description This PR adopts the latest version of the MQT workflows, which contains the changes from munich-quantum-toolkit/workflows#60. By default, C++ CI is now only run in Release mode for all major platforms. However, when a PR is tagged with `extensive-cpp-ci`, a large set of test runs will be started on various operating system versions using various compilers. Similarly, adding a `extensive-python-ci` label will trigger additional Python test runs on different OS versions. Note that the labels need to be present _before_ CI starts running, or the already started/completed CI may need to be rerun. Most small PRs should be fine to run with the default set of runs. The extensive CI should be enabled for infrastructure-critical PRs or, more generally, at the end of the PR cycle before merging. In addition, this update brings support for natively building `aarch64` wheels using GitHub's new Ubuntu ARM runners, which are now enabled by default. Finally, this PR drops the TestPyPI uploads given their little value over the last couple of months and their substantial runner demand. ## Checklist:  - [x] The pull request only contains commits that are related to it. - [x] I have added appropriate tests and documentation. - [x] I have made sure that all CI jobs on GitHub pass. - [x] The pull request introduces no new warnings and follows the project's style guidelines. --------- Signed-off-by: burgholzer <burgholzer@me.com>

burgholzer added 2 commits January 16, 2025 21:05

👷🐧🍎🏁 add C++ runs on additional Ubuntu, macOS, and Windows runners

8c0e58e

Signed-off-by: burgholzer <burgholzer@me.com>

👷🐧🍎🏁🐍 add Python runs on additional Ubuntu, macOS, and Windows runners

b816d95

Signed-off-by: burgholzer <burgholzer@me.com>

burgholzer self-assigned this Jan 16, 2025

burgholzer added c++ continuous integration feature python labels Jan 16, 2025

burgholzer mentioned this pull request Jan 16, 2025

👷⬆️ update MQT workflows to v1.6 cda-tum/mqt-core#803

Merged

4 tasks

burgholzer added 9 commits January 17, 2025 16:04

🔧 use uvx in favor of explicitly setting up nox

913907f

Signed-off-by: burgholzer <burgholzer@me.com>

⬆️ update default Z3 version to 4.13.4

65e6e35

Signed-off-by: burgholzer <burgholzer@me.com>

🔥 remove caching from Windows as it is rarely functional

5658480

Signed-off-by: burgholzer <burgholzer@me.com>

🔥 move coverage out of reusable-cpp-ci

3707264

The workflow is now expected to be called separately like the `reusable-cpp-linter` workflow. Signed-off-by: burgholzer <burgholzer@me.com>

⚗️ try to fix dynamic matrix generation

baea273

Signed-off-by: burgholzer <burgholzer@me.com>

⚗️ try to fix dynamic matrix generation

9c9e3d6

Signed-off-by: burgholzer <burgholzer@me.com>

✏️ naming adjustments

52de092

Signed-off-by: burgholzer <burgholzer@me.com>

🩹 fix workflow dependencies

b10d840

Signed-off-by: burgholzer <burgholzer@me.com>

burgholzer added 2 commits January 21, 2025 18:05

🔧 do not run minimums and tests in separate workflows

d19b7f8

Signed-off-by: burgholzer <burgholzer@me.com>

✨ Updated Python CI workflows to support dynamic runner configuration

70b273d

Signed-off-by: burgholzer <burgholzer@me.com>

burgholzer added 2 commits January 21, 2025 23:43

🩹 fix job name

98753da

Signed-off-by: burgholzer <burgholzer@me.com>

burgholzer merged commit 5b728e3 into main Jan 22, 2025
2 checks passed

burgholzer deleted the extensive-ci branch January 22, 2025 08:56

burgholzer mentioned this pull request Jan 22, 2025

🔥 remove unused workflow for individual Python tests #62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👷🐧🍎🏁 Extensive CI testing #60

👷🐧🍎🏁 Extensive CI testing #60

burgholzer commented Jan 16, 2025 •

edited

Loading

marcelwa commented Jan 17, 2025

burgholzer commented Jan 17, 2025

marcelwa commented Jan 17, 2025

burgholzer commented Jan 17, 2025

marcelwa commented Jan 20, 2025

burgholzer commented Jan 21, 2025

marcelwa commented Jan 22, 2025

👷🐧🍎🏁 Extensive CI testing #60

👷🐧🍎🏁 Extensive CI testing #60

Conversation

burgholzer commented Jan 16, 2025 • edited Loading

marcelwa commented Jan 17, 2025

burgholzer commented Jan 17, 2025

marcelwa commented Jan 17, 2025

burgholzer commented Jan 17, 2025

marcelwa commented Jan 20, 2025

burgholzer commented Jan 21, 2025

marcelwa commented Jan 22, 2025

burgholzer commented Jan 16, 2025 •

edited

Loading