Description
Describe the problem
Filing this based on a discussion with @knz and @jeffswenson. It seems that there is a situation where a virtual cluster could become 2 versions behind the storage cluster. Then the virtual cluster would be inaccessible until the storage cluster is restored from a backup on an appropriate version.
To Reproduce
there's a use case that was discovered we haven't talked about:
customer initializes cluster with virt in v23.2
later, customer upgrades storage layer to v24.1 (+ version finalization in system interface)
at that point, customer does not upgrade the app tenant (either they forget or they don't know they have to, or maybe they set preserve_downgrade_option)
then, later, they upgrade to v24.2.
Now the VC is two versions behind, which we don't aim to support, and there was nothing to prevent them from doing so.
Expected behavior
The storage cluster is prevented from upgrading if the cluster version of any of its secondary tenants is lower than the storage cluster new binary min supported version.
Additional context
There is a potential wrinkle here and some nice-to-haves.
-
Wrinkle: What to do if the binary version was replaced with an inappropriate version and the storage cluster upgrade was prevented? Can we run the cluster in a "degraded upgrade" mode where the new binary runs in compatibility mode with a version that would allow all VCs to run?
-
Nice-to-haves:
- Have a dry-run upgrade option somehow that would report whether any VCs have too much version skew from the proposed upgrade version?
- It would be cool if you could give a sql command to show VCs that are already behind the storage cluster version, and also to show VCs that aren’t yet finalized.
- It would be cool if we had an upgrade mechanism where the runtime itself could test whether it could upgrade to a binary if you pass it a path to the binary.
Epic: CRDB-26691
Jira issue: CRDB-31323