Backport ccm main #1296

Lorak-mmk · 2025-03-25T14:43:36Z

@muzarski @wprzytula This is a draft of the CCM backport into main branch.

I mostly took the current state from branch-hackathon. One important change I made is to remove separate ccm-integration test target. Instead ccm tests are now a module in integration.
This avoids the problem with sharing utils. It should also be quicker to compile - no need to link 2 separate binaries.

Apart from that I did not really modify CCM integration. Now the question is: what do we do with it.
Are we satisfied with the API? Probably not.
If not, what should the API be like?

This is fully internal to the crate, so we can change it freely, so there is no need to spend too much time on it - we can always improve later.
Still, we should retain some reasonable level of quality, so I'd like to discuss this a bit.

On the above matter @muzarski : How should I adapt CCM given that we now support multiple TLS backends?
I see we have in mod.rs DB_TLS_CERT_PATH, DB_TLS_KEY_PATH and CA_TLS_CERT_PATH (which I removed for now because it was guarded by old feature name).
I also see that on branch-hackathon you wrote a TLS test. We should probably make a test per backend, right?
What about those vars? Do they make sense for all the tests? In that case we just need to change feature guard on CA_TLS_CERT_PATH to be activated when any backend is active - or to even make it always active because why not.

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

github-actions · 2025-03-25T14:45:50Z

cargo semver-checks found no API-breaking changes in this PR.
Checked commit: e157681

Lorak-mmk · 2025-03-25T15:48:20Z

Possible improvements / API changes after a brief glance at the code:

Should NodeStartOptions be a struct? In other words, does it make sense to enable e.g. no_wait and wait_other_notice at the same time? We need to know the exact semantics of the wait-related flags to know that (cc @fruch because I don't think this is documented anywhere in this cursed software).
Same for NodeStopOptions
We hold nodes in Arc<RwLock<>>, and methods that give user nodes return that. Maybe we could return refs / mut refs and get rid of Arc<RwLock<>>? I'm not sure.
In the future when we have custom test runner, we could make new struct (ClusterPreferences + NodePreferences) or extend *Options structs. Why? Tests may not care about some parameters, which could help test runner to provide less clusters.

muzarski · 2025-03-25T16:37:25Z

On the above matter @muzarski : How should I adapt CCM given that we now support multiple TLS backends? I see we have in mod.rs DB_TLS_CERT_PATH, DB_TLS_KEY_PATH and CA_TLS_CERT_PATH (which I removed for now because it was guarded by old feature name). I also see that on branch-hackathon you wrote a TLS test. We should probably make a test per backend, right? What about those vars? Do they make sense for all the tests? In that case we just need to change feature guard on CA_TLS_CERT_PATH to be activated when any backend is active - or to even make it always active because why not.

Notice that I implemented all of this when we had old certificates in the repository. We had to update them, because of the errors thrown by rustls.

Why old certs worked for `openssl` but did not for `rustls`?

It's because rustls supports hostname verification by default, while openssl does not. The CN (common name) in db certificate was not matching the hostname, thus rustls was throwing an error.

What changed in the certificates, compared to the previous version?

I generated the db certificates assigned to static IP (172.44.0.2 - one we currently use in CI for TLS single-node cluster). In other words, the extensions to certificate request in openssl config looked like:

[v3_req]
# Extensions to add to a certificate request
basicConstraints = CA:FALSE
subjectAltName = IP:172.44.0.2
keyUsage = nonRepudiation, digitalSignature, keyEncipherment

Thanks to that, rustls is able to verify the hostname using node's IP. It checks whether the node that we try to connect to has the same IP as the one defined in certificate (under subjectAltName).

Current state

Currently, our SSL "tests" are limited to simply running the tls-openssl and tls-rustls examples in SSL CI workflow.
On hackathon-branch, however, I removed the SSL workflow and migrated the example test to ccm. It obviously did it for openssl only, as rustls was not supported back then.

Implementing the corresponding ccm test for rustls backed and removing SSL CI workflow.

Well, this is a bit tricky. While, it was not an issue with openssl, for the reasons stated above (no hostname verification), it won't work for rustls. This is because we use dynamic IPs in ccm tests.

The temporary solution I see: migrate the tests from SSL workflow to ccm only for openssl and limit rustls "tests" to just running the example against the global cluster (as it is currently done in CI).

If we ever decide that we want to have ccm tests for rustls backend, I list the possible solutions to the dynamic IP problem:

Generate certificates on the fly, during the test. We would firstly receive some IP from the IpAllocator (or some other mechanism in the future) and then we could generate the self-signed cert for this IP. Then we could ccm updateconf and provide the path to generated certificate. There are some crates we could use, e.g. https://docs.rs/rcgen/latest/rcgen/.
Disable hostname verification during the tests. I think this can be configured via the trait: rustls::client::danger::ServerCertVerifier. I'm not entirely sure, though. This needs some research - I think this is placed in danger module for a reason. OTOH, we already use it in scylla::cloud::config and implement it for our NoCertificateVerification struct.

Lorak-mmk · 2025-03-26T06:51:14Z

Moving such workflows to CCM has 2 advantages:

We get rid of custom images
We get rid of a GHA workflow

If we move only part of it, we get neither. So I'll cherry pick commits that move auth, and skip TLS for now. We can do that in the future.

Btw is it possible to use domain names instead of ip addresses with scylla? In other words, can we have domain names instead of ip addresses in system.peers in driver-relevant columns?
If it was possible, we could use certs with hostnames instead of ips.

fruch · 2025-03-26T07:26:40Z

Moving such workflows to CCM has 2 advantages:

We get rid of custom images

We get rid of a GHA workflow

If we move only part of it, we get neither. So I'll cherry pick commits that move auth, and skip TLS for now. We can do that in the future.

Btw is it possible to use domain names instead of ip addresses with scylla? In other words, can we have domain names instead of ip addresses in system.peers in driver-relevant columns? If it was possible, we could use certs with hostnames instead of ips.

scylla can use hostnames, but then you need a dns server to map them.

I think generating certs as needed is the best approach, and also give the flexibility to try more variants as needed.
that's what we are doing in dtest, and in SCT.

Lorak-mmk · 2025-03-26T07:48:41Z

scylla can use hostnames, but then you need a dns server to map them.

I think generating certs as needed is the best approach, and also give the flexibility to try more variants as needed. that's what we are doing in dtest, and in SCT.

Is there functionality in CCM to generate certs? Or do we have to do it other way?

If Scylla can use hostnames, then we should test it too.

@fruch one other question for you. Could you describe (or point to documentation if such exists) what exactly wait-related flags do in CCM, and how do they interact if I specify more than one?

fruch · 2025-03-26T08:06:55Z

scylla can use hostnames, but then you need a dns server to map them.
I think generating certs as needed is the best approach, and also give the flexibility to try more variants as needed. that's what we are doing in dtest, and in SCT.

Is there functionality in CCM to generate certs? Or do we have to do it other way?

If Scylla can use hostnames, then we should test it too.

@fruch one other question for you. Could you describe (or point to documentation if such exists) what exactly wait-related flags do in CCM, and how do they interact if I specify more than one?

you are more then welcome to document it.

Lorak-mmk · 2025-03-26T08:17:06Z

you are more then welcome to document it.

I'd be happy to make a PR that improves descriptions, but I would have to first understand those options myself.
I don't know ccm's codebase at all, and it is not really friendly to new contributors, that's why I asked you to explain those options.

Lorak-mmk · 2025-03-26T09:23:00Z

I backported the commits that move auth to CCM. I also removed TLS support from CCM for now.

Lorak-mmk · 2025-03-26T13:45:41Z

Marking as ready. The way I see it the only improvement I can make here is better CCM API - which needs input from others, which is basically a review.

muzarski

We hold nodes in Arc<RwLock<>>, and methods that give user nodes return that. Maybe we could return refs / mut refs and get rid of Arc<RwLock<>>? I'm not sure.

As of now, there is no use case for Arc<RwLock<>> - I think we can return refs/mutrefs for now. We could always revert this in the future. It also simplifies the API - I believe NodeList is no longer necessary then. Instead, we can expose nodes_iter_[mut]() and get_node_[mut]_by_id methods on Cluster.

Currently, append_node() and add_node() methods return Arc<RwLock<>>. They could return node id instead.

scylla/tests/integration/ccm/example.rs

scylla/tests/integration/ccm/lib/ip_allocator.rs

Lorak-mmk · 2025-03-28T18:57:31Z

Addressed @muzarski 's comments
Removed all the Arc<Mutex> stuff, now we just operate on Nodes.
I decided to retain NodeList because I welcome any kind of separation and structure in this code. I made its method simpler using iterator methods.
Method that adds node return mut reference to this node. I think it is more useful than id.

Lorak-mmk · 2025-03-28T19:06:22Z

I have one more idea: we can split off another file from cluster.rs, I would call it ccm_cmd.rs.
It would be a simple wrapper over CCM, providing builder-style commands.
The purpose of this module would be to provide convenient way to call CCM, and encode all its commands and flags into Rust types.
cluster.rs would be responsible for providing user-facing API, handling config dirs etc. Its code would hopefully become cleaner.

Lorak-mmk · 2025-03-28T20:40:22Z

I did this for 2 commands as an experiment, in additional commit. I like the new version, so unless anyone has different opinion I'll convert the rest of the command to this.

@dkropachev I see that both ccm create and ccm populate accept ipprefix argument. Why? What are their respective semantics?

muzarski · 2025-03-31T14:17:09Z

I did this for 2 commands as an experiment, in additional commit. I like the new version, so unless anyone has different opinion I'll convert the rest of the command to this.

I love the idea. The code in cluster.rs looks much cleaner.

Lorak-mmk · 2025-05-10T16:02:18Z

New version of the PR. Finalized the move to a separate command builders and made a lot of other changes.
I think there is a lot of room for improvement still, but I'd like to put this up for review now anyway - it will be easier to get it to something acceptable together.

muzarski

Looks much better.

scylla/tests/integration/ccm/lib/cluster.rs

muzarski · 2025-05-13T06:05:11Z

scylla/tests/integration/ccm/lib/cluster.rs

+    fn append_node(&mut self, node_options: NodeOptions) -> &mut Node {
+        let node_name = node_options.name();
+        let node = Node::new(node_options, self.ccm_cmd.for_node(node_name));
+
+        self.nodes.push(node);
+        self.nodes.0.last_mut().unwrap()
+    }


I like that we return &mut Node here. (previously that was Arc<RwLock<_>>, correct?)

I'm not sure tbh.

Lorak-mmk · 2025-05-13T15:12:07Z

Addressed Mikolaj's comment.

wprzytula · 2025-05-14T05:52:11Z

Makefile

+.PHONY: use_cargo_lock_msrv
+use_cargo_lock_msrv:
+	mv Cargo.lock Cargo.lock.bak
+	mv Cargo.lock.msrv Cargo.lock
+
+.PHONY: restore_cargo_lock
+restore_cargo_lock:
+	mv Cargo.lock Cargo.lock.msrv
+	mv Cargo.lock.bak Cargo.lock
+
+.PHONY: test_cargo_lock_msrv
+test_cargo_lock_msrv: use_cargo_lock_msrv check restore_cargo_lock


🔧 Can you please add comments explaining use case of these commands?

I'll remove this before merging. If we add it, it should be in a separate PR.

scylla/tests/integration/ccm/lib/cli_wrapper/cluster.rs

wprzytula · 2025-05-14T06:30:00Z

scylla/tests/integration/ccm/lib/cli_wrapper/cluster.rs

+    pub(crate) fn wait_options(mut self, options: Option<NodeStartOptions>) -> Self {
+        self.wait_opts = options;
+        self
+    }
+    pub(crate) fn scylla_smp(mut self, smp: u16) -> Self {
+        self.scylla_smp = Some(smp);
+        self
+    }
+
+    pub(crate) fn scylla_mem_megabytes(mut self, mem_megabytes: u32) -> Self {
+        self.scylla_mem_megabytes = Some(mem_megabytes);
+        self
+    }


❓ Why does wait_options() accept Options and the other two functions accept non-Options?

The way I understand it being able to use None in the public interface (like Cluster::start) was a usability request from a hackathon participant, which I agree with.
Now comes the question: what to use in internal interface (cli_wrapper) that you commented on?
I chose to also use Option in order to have only a single place to handle default value (in run method).

I'm not strongly attached to this, so I can change it if you disagree.

scylla/tests/integration/ccm/lib/cluster.rs

scylla/tests/integration/ccm/lib/ip_allocator.rs

wprzytula · 2025-05-14T12:09:28Z

scylla/tests/integration/ccm/lib/ip_allocator.rs

+            _ => return Err(anyhow::anyhow!("Ipv6 addresses are not yet supported!")),
+        };
+        let subnet_id: LocalSubnetIdentifier = ipv4.into();
+
+        if !self.used_ips.remove(&subnet_id) {
+            return Err(anyhow::anyhow!(
+                "IP prefix {} was not allocated - something gone wrong!",
+                ip_prefix
+            ));


♻️ consider using anyhow::bail! as an idiomatic way.

I'm not sure if using this macro is a good idea.
Normally when glancing over a function you need to look for return and ? to find early exit points. Such macros break this assumption.

scylla/tests/integration/ccm/lib/logged_cmd.rs

wprzytula · 2025-05-14T12:26:05Z

scylla/tests/integration/ccm/lib/logged_cmd.rs

+                tracing::info!(
+                    "{:15} -> failed to wait on child process: = {}",
+                    format!("exited[{}]", run_id),
+                    e
+                );


❓ Just to make sure I undestand correctly: would the following be equivalent?

Suggested change

tracing::info!(

"{:15} -> failed to wait on child process: = {}",

format!("exited[{}]", run_id),

e

);

tracing::info!(

"exited[{:7}] -> failed to wait on child process: = {}",

run_id

e

);

I think so, and your version is more readable imo. I'll change it,

Turns out it is not the same. The difference is in where the whitespace is. In former version, it will be added after exited[run_id]. In your version, it will be added before the closing bracket, which looks weird (and fails tests).

If you replace format! with format_args! you can perhaps avoid the allocation (I suppose it doesn't allocate?), but I checked and it does not honor the directive to add the padding whitespace.

ccm module will contain tests that require ccm. It's lib submodule will contain the ccm integration. Why do it this way - which is different than what we did during a hackathon? - Old way required ugly hacks to share test utils between integration test targets, and those hacks did not work well with rust-analyzer. - One target means better compilation time CCM tests will be guardded by a cfg, so we will still be able to run the subset that we want: - All tests: run integration tests with the required cfg - Only CCM tests: as above, but filter by ccm folder - Only non-ccm test: run without the cfg

Lorak-mmk · 2025-05-29T11:44:00Z

Addressed most @wprzytula comments, and responded to rest.
Rebased on main.

Co-authored-by: Mikołaj Uzarski <[email protected]> Co-authored-by: Dmitry Kropachev <[email protected]>

For now it will run on each PR. If at some point it becomes too slow we can switch it to running manually and before release.

Co-authored-by: Mikołaj Uzarski <[email protected]>

Auth tests are now run as a part of CCM test suite.

Lorak-mmk force-pushed the backport-ccm-main branch 3 times, most recently from ad0c3aa to 6e14276 Compare March 25, 2025 15:20

Lorak-mmk force-pushed the backport-ccm-main branch from 6e14276 to aeabc5e Compare March 26, 2025 08:56

Lorak-mmk marked this pull request as ready for review March 26, 2025 13:45

Lorak-mmk requested review from wprzytula and muzarski and removed request for wprzytula March 26, 2025 13:45

muzarski assigned Lorak-mmk Mar 26, 2025

muzarski suggested changes Mar 26, 2025

View reviewed changes

scylla/tests/integration/ccm/example.rs Outdated Show resolved Hide resolved

scylla/tests/integration/ccm/example.rs Outdated Show resolved Hide resolved

scylla/tests/integration/ccm/lib/ip_allocator.rs Outdated Show resolved Hide resolved

Lorak-mmk force-pushed the backport-ccm-main branch from aeabc5e to d546b53 Compare March 28, 2025 19:01

Lorak-mmk force-pushed the backport-ccm-main branch from d546b53 to f05e4dc Compare March 28, 2025 20:38

Lorak-mmk force-pushed the backport-ccm-main branch 3 times, most recently from b605f97 to a02254c Compare March 28, 2025 20:51

wprzytula mentioned this pull request Apr 14, 2025

Merge the hackathon branch and PRs #1271

Open

Lorak-mmk force-pushed the backport-ccm-main branch 2 times, most recently from db70e01 to 8253f0d Compare May 10, 2025 15:58

Lorak-mmk requested review from wprzytula and muzarski May 10, 2025 16:02

muzarski approved these changes May 13, 2025

View reviewed changes

Lorak-mmk force-pushed the backport-ccm-main branch from 8253f0d to 3e4b32a Compare May 13, 2025 15:11

wprzytula requested changes May 14, 2025

View reviewed changes

Lorak-mmk force-pushed the backport-ccm-main branch 4 times, most recently from 7c6d135 to a939123 Compare May 14, 2025 22:05

Lorak-mmk mentioned this pull request May 15, 2025

Bump MSRV to 1.81 + minor fixes #1356

Merged

8 tasks

wprzytula added this to the 1.3.0 milestone May 28, 2025

Lorak-mmk force-pushed the backport-ccm-main branch 2 times, most recently from 2f483e0 to eadb748 Compare May 29, 2025 11:40

Lorak-mmk added 2 commits May 29, 2025 13:41

Add helpful msrv-related stuff to Makefile

4caf939

Lorak-mmk force-pushed the backport-ccm-main branch from eadb748 to b34fd43 Compare May 29, 2025 11:42

Lorak-mmk force-pushed the backport-ccm-main branch from b34fd43 to e157681 Compare May 29, 2025 11:45

Lorak-mmk and others added 5 commits May 29, 2025 13:45

Add CCM integration and example test

3a83653

Co-authored-by: Mikołaj Uzarski <[email protected]> Co-authored-by: Dmitry Kropachev <[email protected]>

Makefile: Add target to run CCM tests

21441d7

Add CCM CI pipeline

13d609a

For now it will run on each PR. If at some point it becomes too slow we can switch it to running manually and before release.

IT: Move authenticate tests to CCM

494f2b2

Co-authored-by: Mikołaj Uzarski <[email protected]>

ci: remove auth workflow and related dockerfile

e157681

Auth tests are now run as a part of CCM test suite.

Lorak-mmk requested review from wprzytula and muzarski May 29, 2025 11:52

Backport ccm main #1296

Are you sure you want to change the base?

Backport ccm main #1296

Uh oh!

Conversation

Lorak-mmk commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-review checklist

Uh oh!

github-actions bot commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lorak-mmk commented Mar 25, 2025

Uh oh!

muzarski commented Mar 25, 2025

Why old certs worked for openssl but did not for rustls?

What changed in the certificates, compared to the previous version?

Current state

Implementing the corresponding ccm test for rustls backed and removing SSL CI workflow.

Uh oh!

Lorak-mmk commented Mar 26, 2025

Uh oh!

fruch commented Mar 26, 2025

Uh oh!

Lorak-mmk commented Mar 26, 2025

Uh oh!

fruch commented Mar 26, 2025

Uh oh!

Lorak-mmk commented Mar 26, 2025

Uh oh!

Lorak-mmk commented Mar 26, 2025

Uh oh!

Lorak-mmk commented Mar 26, 2025

Uh oh!

muzarski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lorak-mmk commented Mar 28, 2025 • edited by wprzytula Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lorak-mmk commented Mar 28, 2025

Uh oh!

Lorak-mmk commented Mar 28, 2025

Uh oh!

muzarski commented Mar 31, 2025

Uh oh!

Lorak-mmk commented May 10, 2025

Uh oh!

muzarski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lorak-mmk commented May 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Lorak-mmk commented Mar 25, 2025 •

edited

Loading

github-actions bot commented Mar 25, 2025 •

edited

Loading

Why old certs worked for `openssl` but did not for `rustls`?

Lorak-mmk commented Mar 28, 2025 •

edited by wprzytula

Loading

Lorak-mmk May 29, 2025 •

edited

Loading