Skip to content

Conversation

@shayasoolin
Copy link

@shayasoolin shayasoolin commented Nov 13, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implements the ClusterTopology controller with deletion prevention logic. The controller:

  • Automatically manages the grove.io/clustertopology finalizer on all ClusterTopology resources
  • Prevents deletion when the topology is in use by PodCliqueSet resources (via grove.io/cluster-topology-name label)
  • Prevents deletion when the topology is actively configured in the operator settings
  • Requeues deletion attempts every 30s until blockers are cleared

Follows existing Grove controller patterns and achieves 90.6% test coverage.

Special notes for your reviewer:

Dependencies:

  • Requires the TopologyConfiguration API changes (separate PR)

Deletion Prevention Logic:
The controller uses a conservative approach - it blocks deletion if any PodCliqueSet references the topology OR if the topology is configured in operator settings. The default topology name is grove-topology if not specified.

Test Coverage:
25 test cases covering creation, deletion, edge cases, and error handling. Uses fake clients following existing Grove test patterns.

Additional documentation e.g., enhancement proposals, usage docs, etc.:

To reference a ClusterTopology from a PodCliqueSet, add the grove.io/topology-name label.

To delete a ClusterTopology, ensure no PodCliqueSet resources reference it and it's not configured in the operator settings. The controller will automatically retry deletion every 30 seconds.

@shayasoolin shayasoolin force-pushed the cluster-topology-deletion-prevention branch from c09e62f to eda230e Compare November 13, 2025 08:49
@shayasoolin shayasoolin marked this pull request as ready for review November 13, 2025 12:22
"podCliqueSetCount", len(pcsList.Items))
r.eventRecorder.Eventf(ct, corev1.EventTypeWarning, groveconstants.ReasonClusterTopologyDeleteBlocked,
"Cannot delete ClusterTopology %s: referenced by %d PodCliqueSet resource(s)", ct.Name, len(pcsList.Items))
return ctrlcommon.ReconcileAfter(30*time.Second, "topology referenced by PodCliqueSet resources")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to reconcile in that case? It could lead to a infinite loop

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I changed it to be done using a watch on PCS, so we'll get an event when a PCS is deleted or changed. Thanks

newHasValue := newExists && newLabel != ""

// Label removed (had value, now doesn't) or changed (both have values but different)
return (oldHasValue && !newHasValue) || (oldHasValue && newHasValue && oldLabel != newLabel)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the event is on the new obj, so if newHasValue is false => the map function will drop the event

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get you. There are two cases in which the event is NOT dropped:

  1. Label removed (had value, now doesn't) --> (oldHasValue && !newHasValue)
  2. Label changed (both have values but different) --> (oldHasValue && newHasValue && oldLabel != newLabel)
    Which case are you referring to with the newHasValue==false?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first one -
you will get to the mapping function with the new obj => read the topology name => its empty => no reconcile

actually, the second one is also problematic -
you will get to the mapping function with the new obj => read the topology name => its the new one => if the old one is a topology that is being deleted and now no PCS using it we will not reconcile it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the implementation to use custom handler. This way, both the predicate and the mapping have both old and new objects to work with. So we return a list of both topologies - old and new, so both will be reconciled.

3. Controller reconciles:
- Detects deletion request (deletion timestamp set)
- Detects deletion request (deletion timestamp set)
- Checks if any PodCliqueSet (in any namespace) references this ClusterTopology (via `grove.io/topology-name` label)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this grove.io/cluster-topology-name now?

if err := ctrlutils.RemoveAndPatchFinalizer(ctx, r.client, ct, constants.FinalizerClusterTopology); err != nil {
logger.Error(err, "failed to remove finalizer", "ClusterTopology", ct.Name, "finalizerName", constants.FinalizerClusterTopology)
r.eventRecorder.Eventf(ct, corev1.EventTypeWarning, groveconstants.ReasonClusterTopologyDeleteFailed,
"Cannot delete ClusterTopology %s: error removing finalizer: %s from ClusterTopology: %s: %w", constants.FinalizerClusterTopology, ct.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%w is only supported by fmt.Errorf unfortunately. Pretty sure the underlying code is using fmt.Sprintf so this won't work properly. So you'll need to use a %v I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check the arguments. ct.Name isn't first, but looks like that what you want for the %s.

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if tt.topologyConfig.Name == "" {
tt.topologyConfig.Name = configv1alpha1.DefaultClusterTopologyName // TODO remove this after taking care of registering the defaults properly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, create a github issue or something else to track this outside of code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still hope to figure this out before merging this PR. Otherwise will do.

WithFinalizer(constants.FinalizerClusterTopology),
topologyConfig: configv1alpha1.ClusterTopologyConfiguration{
Enabled: true,
Name: configv1alpha1.DefaultClusterTopologyName, // TODO remove when defaults are registered properly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here...

@gflarity
Copy link
Contributor

gflarity commented Nov 14, 2025

A couple things minor things.


- Admin must satisfy BOTH conditions before deletion:
- Delete all PodCliqueSet resources that reference this ClusterTopology (check `grove.io/topology-name` label)
- Delete all PodCliqueSet resources that reference this ClusterTopology (check `grove.io/cluster-topology-name` label)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can revert this changes as I created a PR with this changes

4. Operator validates ClusterTopology CR exists
5. For existing workloads:
- Workloads with topology constraints: operator uses topology from PodCliqueSet label (`grove.io/topology-name`)
- Workloads with topology constraints: operator uses topology from PodCliqueSet label (`grove.io/cluster-topology-name`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

6. For new workloads:
- Topology constraints validated against ClusterTopology CR
- Mutation webhook adds `grove.io/topology-name` label to PodCliqueSet (see Mutation Webhook section)
- Mutation webhook adds `grove.io/cluster-topology-name` label to PodCliqueSet (see Mutation Webhook section)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

r.removeFinalizer,
}

for _, fn := range deleteStepFns {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for _, fn := range deleteStepFns {
for _, deleteStepFn := range deleteStepFns {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to be aligned with all other usages of this pattern

// Deletion is blocked if:
// 1. Any PodCliqueSet references this topology (via grove.io/cluster-topology-name label)
// 2. Topology is enabled AND this specific topology is configured
func (r *Reconciler) checkDeletionConditions(ctx context.Context, logger logr.Logger, ct *grovecorev1alpha1.ClusterTopology) ctrlcommon.ReconcileStepResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (r *Reconciler) checkDeletionConditions(ctx context.Context, logger logr.Logger, ct *grovecorev1alpha1.ClusterTopology) ctrlcommon.ReconcileStepResult {
func (r *Reconciler) checkDeletionConditions(ctx context.Context, logger logr.Logger, cTopology *grovecorev1alpha1.ClusterTopology) ctrlcommon.ReconcileStepResult {

}

// Condition 2: Check if topology is enabled and configured to use this ClusterTopology
if r.config.ClusterTopology.Enabled && r.config.ClusterTopology.Name == ct.Name {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will first check this and then go for checking if there is pcs ref it, as we prefer non api calls first

// ReasonClusterTopologyDeleteBlocked is an event reason which represents that the deletion of a ClusterTopology is blocked.
ReasonClusterTopologyDeleteBlocked = "ClusterTopologyDeleteBlocked"
// ReasonClusterTopologyDeleteSuccessful is an event reason which represents a successful deletion of a ClusterTopology.
ReasonClusterTopologyDeleteSuccessful = "ClusterTopologyDeleteSuccessful"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the controller only look on it self, there might another finlizer created by someone else, so it better to stay a way deleted successful, just finlizer removed successfully

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I'm aligned with other similar consts. When we add another finalizer we can change that, for now we're the only controller looking at our own CRD

Comment on lines +62 to +63
// Update handles PodCliqueSet update events.
// Reconciles both old and new topologies if the topology label was changed or removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont allow to change this label, so it might redundant to look for this, so no reconcile for this either

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to be resilient and independent of such design decisions. The price is low, so I think it's better to support this case as well. We may decide to allow changing topology for a PCS (it's actually not so well defined yet, as I see it.)


// getTopologyName extracts the topology name from a PodCliqueSet's labels.
// Returns empty string if the label doesn't exist or is empty.
func getTopologyName(pcs *grovecorev1alpha1.PodCliqueSet) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this to util as it usefull to the PCS controller as well


// RegisterControllers registers all controllers with the manager.
func RegisterControllers(mgr ctrl.Manager, controllerConfig configv1alpha1.ControllerConfiguration) error {
func RegisterControllers(mgr ctrl.Manager, controllerConfig configv1alpha1.ControllerConfiguration, topologyConfig configv1alpha1.OperatorConfiguration) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not only the topology config, so you need to change the name topologyConfig

@danbar2 danbar2 self-requested a review November 18, 2025 06:54
danbar2
danbar2 previously approved these changes Nov 18, 2025
@sanjaychatterjee
Copy link
Collaborator

@shayasoolin can you please rebase your changes and resolve conflicts? Also, please address the comments where @Ronkahn21 mentions that those changes are done in a separate PR.

@shayasoolin shayasoolin closed this Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants