Skip to content

Commit f98e035

Browse files
committed
cluster topology controller - deletion prevention
1 parent d3bdf93 commit f98e035

File tree

3 files changed

+14
-13
lines changed

3 files changed

+14
-13
lines changed

docs/designs/topology.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,7 @@ Deletion Workflow:
229229
2. Kubernetes blocks deletion (finalizer `grove.io/clustertopology` present)
230230
3. Controller reconciles:
231231
- Detects deletion request (deletion timestamp set)
232-
- Checks if any PodCliqueSet (in any namespace) references this ClusterTopology (via `grove.io/topology-name` label)
232+
- Checks if any PodCliqueSet (in any namespace) references this ClusterTopology (via `grove.io/cluster-topology-name` label)
233233
- Checks if topology is enabled in operator config (`topology.enabled: true`) and this specific topology is configured (considering both the default topology name and the `topology.name` setting)
234234
- If ANY PodCliqueSet references this topology OR topology is enabled and set to this topology: Keeps finalizer, deletion blocked
235235
- If NO PodCliqueSet references this topology AND topology is disabled or not set to this topology: Removes finalizer, deletion proceeds
@@ -238,7 +238,7 @@ Deletion Workflow:
238238
Key Points:
239239

240240
- Admin must satisfy BOTH conditions before deletion:
241-
- Delete all PodCliqueSet resources that reference this ClusterTopology (check `grove.io/topology-name` label)
241+
- Delete all PodCliqueSet resources that reference this ClusterTopology (check `grove.io/cluster-topology-name` label)
242242
- Disable TAS in operator config (`topology.enabled: false`) and restart operator
243243
- Controller checks both conditions before allowing deletion
244244
- Controller continuously reconciles deletion requests
@@ -280,11 +280,11 @@ topology:
280280
3. Admin restarts operator (see Startup Behavior section for details)
281281
4. Operator validates ClusterTopology CR exists
282282
5. For existing workloads:
283-
- Workloads with topology constraints: operator uses topology from PodCliqueSet label (`grove.io/topology-name`)
283+
- Workloads with topology constraints: operator uses topology from PodCliqueSet label (`grove.io/cluster-topology-name`)
284284
- Workloads without topology constraints: no impact (checked via label presence)
285285
6. For new workloads:
286286
- Topology constraints validated against ClusterTopology CR
287-
- Mutation webhook adds `grove.io/topology-name` label to PodCliqueSet (see Mutation Webhook section)
287+
- Mutation webhook adds `grove.io/cluster-topology-name` label to PodCliqueSet (see Mutation Webhook section)
288288

289289
**Disabling Topology (topology.enabled: true → false):**
290290

@@ -383,7 +383,7 @@ type PodCliqueTemplateSpec struct {
383383

384384
The mutation webhook automatically modifies PodCliqueSet resources on CREATE operations:
385385

386-
- **Label Addition**: Automatically adds `grove.io/topology-name` label to PodCliqueSet
386+
- **Label Addition**: Automatically adds `grove.io/cluster-topology-name` label to PodCliqueSet
387387
- **Label Value**: Set to the configured ClusterTopology name from operator config (e.g., "grove-topology")
388388
- **Condition**: Label only added when `topology.enabled: true` in operator configuration
389389
- **Purpose**: Enables workload-to-topology mapping for controller lifecycle management and deletion protection
@@ -407,7 +407,7 @@ The mutation webhook automatically modifies PodCliqueSet resources on CREATE ope
407407

408408
**Label Immutability:**
409409

410-
- Webhook rejects changes to `grove.io/topology-name` label on PodCliqueSet after creation
410+
- Webhook rejects changes to `grove.io/cluster-topology-name` label on PodCliqueSet after creation
411411
- Label can only be set during PodCliqueSet creation (see Mutation Webhook section above)
412412
- Ensures topology reference remains consistent throughout resource lifecycle
413413

@@ -450,7 +450,7 @@ The operator adds topology information to PodGang metadata via annotation:
450450
# Annotation added to PodGang
451451
metadata:
452452
annotations:
453-
grove.io/topology-name: "<user-configured-name>"
453+
grove.io/cluster-topology-name: "<user-configured-name>"
454454
```
455455
456456
This annotation allows the scheduler to locate the Kueue Topology resource without requiring a spec field, providing
@@ -530,7 +530,7 @@ Fields Added:
530530

531531
Annotations Added:
532532

533-
- `grove.io/topology-name: "<user-configured-name>"` - Annotation on PodGang metadata referencing topology name
533+
- `grove.io/cluster-topology-name: "<user-configured-name>"` - Annotation on PodGang metadata referencing topology name
534534

535535
Fields Removed:
536536

@@ -544,7 +544,7 @@ The operator translates Grove operator API to Grove Scheduler API with three-lev
544544

545545
**Topology Annotation:**
546546

547-
- Operator adds annotation `grove.io/topology-name: "<topology-name>"` to PodGang metadata
547+
- Operator adds annotation `grove.io/cluster-topology-name: "<topology-name>"` to PodGang metadata
548548
- Annotation value matches the ClusterTopology name from operator configuration
549549
- KAI scheduler uses this annotation to locate the corresponding Kueue Topology CRD
550550
- Annotation approach provides API flexibility for future changes without breaking spec

operator/internal/controller/clustertopology/reconciledelete.go

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ package clustertopology
1818

1919
import (
2020
"context"
21+
"fmt"
2122

2223
apicommon "github.com/ai-dynamo/grove/operator/api/common"
2324
"github.com/ai-dynamo/grove/operator/api/common/constants"
@@ -97,11 +98,11 @@ func (r *Reconciler) removeFinalizer(ctx context.Context, logger logr.Logger, ct
9798
if err := ctrlutils.RemoveAndPatchFinalizer(ctx, r.client, ct, constants.FinalizerClusterTopology); err != nil {
9899
logger.Error(err, "failed to remove finalizer", "ClusterTopology", ct.Name, "finalizerName", constants.FinalizerClusterTopology)
99100
r.eventRecorder.Eventf(ct, corev1.EventTypeWarning, groveconstants.ReasonClusterTopologyDeleteFailed,
100-
"Cannot delete ClusterTopology %s: error removing finalizer: %s from ClusterTopology: %s: %w", constants.FinalizerClusterTopology, ct.Name, err)
101-
return ctrlcommon.ReconcileWithErrors("failed to remove finalizer", err)
101+
"Failed to remove finalizer %s from ClusterTopology %s: %v", constants.FinalizerClusterTopology, ct.Name, err)
102+
return ctrlcommon.ReconcileWithErrors("error removing finalizer", fmt.Errorf("failed to remove finalizer %s from ClusterTopology %s: %v", constants.FinalizerClusterTopology, ct.Name, err))
102103
}
103104
logger.Info("Finalizer removed", "ClusterTopology", ct.Name, "finalizerName", constants.FinalizerClusterTopology)
104105
r.eventRecorder.Eventf(ct, corev1.EventTypeNormal, groveconstants.ReasonClusterTopologyDeleteSuccessful,
105-
"ClusterTopology %s deleted successfully", ct.Name)
106+
"Successfully removed finalizer %s from ClusterTopology %s", constants.FinalizerClusterTopology, ct.Name)
106107
return ctrlcommon.ContinueReconcile()
107108
}

operator/internal/controller/clustertopology/reconcilespec.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ func (r *Reconciler) ensureFinalizer(ctx context.Context, logger logr.Logger, ct
5656
if err := ctrlutils.AddAndPatchFinalizer(ctx, r.client, ct, constants.FinalizerClusterTopology); err != nil {
5757
r.eventRecorder.Eventf(ct, corev1.EventTypeWarning, groveconstants.ReasonClusterTopologyCreateOrUpdateFailed,
5858
"Failed to add finalizer %s to ClusterTopology %s: %v", constants.FinalizerClusterTopology, ct.Name, err)
59-
return ctrlcommon.ReconcileWithErrors("error adding finalizer", fmt.Errorf("failed to add finalizer: %s to ClusterTopology: %s: %w", constants.FinalizerClusterTopology, ct.Name, err))
59+
return ctrlcommon.ReconcileWithErrors("error adding finalizer", fmt.Errorf("failed to add finalizer %s to ClusterTopology %s: %v", constants.FinalizerClusterTopology, ct.Name, err))
6060
}
6161
r.eventRecorder.Eventf(ct, corev1.EventTypeNormal, groveconstants.ReasonClusterTopologyCreateOrUpdateSuccessful,
6262
"Successfully added finalizer %s to ClusterTopology %s", constants.FinalizerClusterTopology, ct.Name)

0 commit comments

Comments
 (0)