-
Notifications
You must be signed in to change notification settings - Fork 26
Cluster topology deletion prevention #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster topology deletion prevention #259
Conversation
c09e62f to
eda230e
Compare
| "podCliqueSetCount", len(pcsList.Items)) | ||
| r.eventRecorder.Eventf(ct, corev1.EventTypeWarning, groveconstants.ReasonClusterTopologyDeleteBlocked, | ||
| "Cannot delete ClusterTopology %s: referenced by %d PodCliqueSet resource(s)", ct.Name, len(pcsList.Items)) | ||
| return ctrlcommon.ReconcileAfter(30*time.Second, "topology referenced by PodCliqueSet resources") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to reconcile in that case? It could lead to a infinite loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I changed it to be done using a watch on PCS, so we'll get an event when a PCS is deleted or changed. Thanks
| newHasValue := newExists && newLabel != "" | ||
|
|
||
| // Label removed (had value, now doesn't) or changed (both have values but different) | ||
| return (oldHasValue && !newHasValue) || (oldHasValue && newHasValue && oldLabel != newLabel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the event is on the new obj, so if newHasValue is false => the map function will drop the event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get you. There are two cases in which the event is NOT dropped:
- Label removed (had value, now doesn't) --> (oldHasValue && !newHasValue)
- Label changed (both have values but different) --> (oldHasValue && newHasValue && oldLabel != newLabel)
Which case are you referring to with the newHasValue==false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first one -
you will get to the mapping function with the new obj => read the topology name => its empty => no reconcile
actually, the second one is also problematic -
you will get to the mapping function with the new obj => read the topology name => its the new one => if the old one is a topology that is being deleted and now no PCS using it we will not reconcile it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the implementation to use custom handler. This way, both the predicate and the mapping have both old and new objects to work with. So we return a list of both topologies - old and new, so both will be reconciled.
docs/designs/topology.md
Outdated
| 3. Controller reconciles: | ||
| - Detects deletion request (deletion timestamp set) | ||
| - Detects deletion request (deletion timestamp set) | ||
| - Checks if any PodCliqueSet (in any namespace) references this ClusterTopology (via `grove.io/topology-name` label) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this grove.io/cluster-topology-name now?
| if err := ctrlutils.RemoveAndPatchFinalizer(ctx, r.client, ct, constants.FinalizerClusterTopology); err != nil { | ||
| logger.Error(err, "failed to remove finalizer", "ClusterTopology", ct.Name, "finalizerName", constants.FinalizerClusterTopology) | ||
| r.eventRecorder.Eventf(ct, corev1.EventTypeWarning, groveconstants.ReasonClusterTopologyDeleteFailed, | ||
| "Cannot delete ClusterTopology %s: error removing finalizer: %s from ClusterTopology: %s: %w", constants.FinalizerClusterTopology, ct.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%w is only supported by fmt.Errorf unfortunately. Pretty sure the underlying code is using fmt.Sprintf so this won't work properly. So you'll need to use a %v I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also check the arguments. ct.Name isn't first, but looks like that what you want for the %s.
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| if tt.topologyConfig.Name == "" { | ||
| tt.topologyConfig.Name = configv1alpha1.DefaultClusterTopologyName // TODO remove this after taking care of registering the defaults properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, create a github issue or something else to track this outside of code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still hope to figure this out before merging this PR. Otherwise will do.
| WithFinalizer(constants.FinalizerClusterTopology), | ||
| topologyConfig: configv1alpha1.ClusterTopologyConfiguration{ | ||
| Enabled: true, | ||
| Name: configv1alpha1.DefaultClusterTopologyName, // TODO remove when defaults are registered properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here...
|
A couple things minor things. |
|
|
||
| - Admin must satisfy BOTH conditions before deletion: | ||
| - Delete all PodCliqueSet resources that reference this ClusterTopology (check `grove.io/topology-name` label) | ||
| - Delete all PodCliqueSet resources that reference this ClusterTopology (check `grove.io/cluster-topology-name` label) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can revert this changes as I created a PR with this changes
| 4. Operator validates ClusterTopology CR exists | ||
| 5. For existing workloads: | ||
| - Workloads with topology constraints: operator uses topology from PodCliqueSet label (`grove.io/topology-name`) | ||
| - Workloads with topology constraints: operator uses topology from PodCliqueSet label (`grove.io/cluster-topology-name`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
| 6. For new workloads: | ||
| - Topology constraints validated against ClusterTopology CR | ||
| - Mutation webhook adds `grove.io/topology-name` label to PodCliqueSet (see Mutation Webhook section) | ||
| - Mutation webhook adds `grove.io/cluster-topology-name` label to PodCliqueSet (see Mutation Webhook section) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
| r.removeFinalizer, | ||
| } | ||
|
|
||
| for _, fn := range deleteStepFns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for _, fn := range deleteStepFns { | |
| for _, deleteStepFn := range deleteStepFns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to be aligned with all other usages of this pattern
| // Deletion is blocked if: | ||
| // 1. Any PodCliqueSet references this topology (via grove.io/cluster-topology-name label) | ||
| // 2. Topology is enabled AND this specific topology is configured | ||
| func (r *Reconciler) checkDeletionConditions(ctx context.Context, logger logr.Logger, ct *grovecorev1alpha1.ClusterTopology) ctrlcommon.ReconcileStepResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func (r *Reconciler) checkDeletionConditions(ctx context.Context, logger logr.Logger, ct *grovecorev1alpha1.ClusterTopology) ctrlcommon.ReconcileStepResult { | |
| func (r *Reconciler) checkDeletionConditions(ctx context.Context, logger logr.Logger, cTopology *grovecorev1alpha1.ClusterTopology) ctrlcommon.ReconcileStepResult { |
| } | ||
|
|
||
| // Condition 2: Check if topology is enabled and configured to use this ClusterTopology | ||
| if r.config.ClusterTopology.Enabled && r.config.ClusterTopology.Name == ct.Name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will first check this and then go for checking if there is pcs ref it, as we prefer non api calls first
| // ReasonClusterTopologyDeleteBlocked is an event reason which represents that the deletion of a ClusterTopology is blocked. | ||
| ReasonClusterTopologyDeleteBlocked = "ClusterTopologyDeleteBlocked" | ||
| // ReasonClusterTopologyDeleteSuccessful is an event reason which represents a successful deletion of a ClusterTopology. | ||
| ReasonClusterTopologyDeleteSuccessful = "ClusterTopologyDeleteSuccessful" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the controller only look on it self, there might another finlizer created by someone else, so it better to stay a way deleted successful, just finlizer removed successfully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I'm aligned with other similar consts. When we add another finalizer we can change that, for now we're the only controller looking at our own CRD
| // Update handles PodCliqueSet update events. | ||
| // Reconciles both old and new topologies if the topology label was changed or removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont allow to change this label, so it might redundant to look for this, so no reconcile for this either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to be resilient and independent of such design decisions. The price is low, so I think it's better to support this case as well. We may decide to allow changing topology for a PCS (it's actually not so well defined yet, as I see it.)
|
|
||
| // getTopologyName extracts the topology name from a PodCliqueSet's labels. | ||
| // Returns empty string if the label doesn't exist or is empty. | ||
| func getTopologyName(pcs *grovecorev1alpha1.PodCliqueSet) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this to util as it usefull to the PCS controller as well
|
|
||
| // RegisterControllers registers all controllers with the manager. | ||
| func RegisterControllers(mgr ctrl.Manager, controllerConfig configv1alpha1.ControllerConfiguration) error { | ||
| func RegisterControllers(mgr ctrl.Manager, controllerConfig configv1alpha1.ControllerConfiguration, topologyConfig configv1alpha1.OperatorConfiguration) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not only the topology config, so you need to change the name topologyConfig
|
@shayasoolin can you please rebase your changes and resolve conflicts? Also, please address the comments where @Ronkahn21 mentions that those changes are done in a separate PR. |
f98e035 to
8db89d1
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
Implements the ClusterTopology controller with deletion prevention logic. The controller:
grove.io/clustertopologyfinalizer on all ClusterTopology resourcesgrove.io/cluster-topology-namelabel)Follows existing Grove controller patterns and achieves 90.6% test coverage.
Special notes for your reviewer:
Dependencies:
Deletion Prevention Logic:
The controller uses a conservative approach - it blocks deletion if any PodCliqueSet references the topology OR if the topology is configured in operator settings. The default topology name is
grove-topologyif not specified.Test Coverage:
25 test cases covering creation, deletion, edge cases, and error handling. Uses fake clients following existing Grove test patterns.
Additional documentation e.g., enhancement proposals, usage docs, etc.: