-
Notifications
You must be signed in to change notification settings - Fork 66
verify pvcs for ownerreferences pointing to VM before cleanup during DR #2345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9735efb to
9506f80
Compare
9506f80 to
dae9610
Compare
dae9610 to
c8fd4fd
Compare
internal/controller/util/vm_util.go
Outdated
| } | ||
|
|
||
| // Fetch the owner object | ||
| owner, err := dynamicClient.Resource(gvr). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sigs.k8s.io/controller-runtime Client can be used to get the resources(including CRs) and is readily available as part of VRG controller. Please check if that can used instead of dynamic client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! Agreed—using sigs.k8s.io/controller-runtime’s typed Client is the better approach since it leverages the shared cache and provides strong typing. While the dynamic client works without scheme registration by using unstructured.Unstructured, it lacks type safety, caching benefits, and requires more manual handling. I’ll refactor this to use the controller-runtime client and ensure the necessary schemes are registered.
Additionally, for ownership checks where we only need metadata (like ownerReferences), I’ll consider using PartialObjectMetadata with the typed client. This will reduce overhead compared to fetching full objects while still benefiting from caching and typed interactions.
c8fd4fd to
3e93726
Compare
| "go.uber.org/zap/zapcore" | ||
| apiextensions "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1" | ||
| "k8s.io/apimachinery/pkg/runtime" | ||
| utilruntime "k8s.io/apimachinery/pkg/util/runtime" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file effectively has no changes. Better to do git restore and continue.
3aeb241 to
36a82ac
Compare
Signed-off-by: pruthvitd <[email protected]>
36a82ac to
2d314d2
Compare
internal/controller/vrg_volrep.go
Outdated
| func (v *VRGInstance) ShouldCleanupVMForSecondary() bool { | ||
| if !v.IsDRActionInProgress() { | ||
| v.log.Info("Skip VM cleanup; reconcile as secondary") | ||
|
|
||
| return false | ||
| } | ||
|
|
||
| v.log.Info( | ||
| "DR action progressing", | ||
| "component", "VRGController", | ||
| "action", "reconcile", | ||
| "vrgName", v.instance.GetName(), | ||
| "namespace", v.instance.GetNamespace(), | ||
| "desiredState", v.instance.Spec.ReplicationState, | ||
| "currentState", v.instance.Status.State, | ||
| ) | ||
|
|
||
| vmNamespaceList := v.instance.Spec.KubeObjectProtection.RecipeParameters[recipecore.ProtectedVMNamespace] | ||
| vmList := v.instance.Spec.KubeObjectProtection.RecipeParameters[recipecore.VMList] | ||
|
|
||
| if len(vmList) > 0 { | ||
| if rmnutil.IsVMDeletionInProgress(v.ctx, v.reconciler.Client, vmList, vmNamespaceList, v.log) { | ||
| v.log.Info("VM deletion is in progress, skipping ownerreferences check") | ||
|
|
||
| return true | ||
| } | ||
|
|
||
| foundVMs, err := rmnutil.ListVMsByVMNamespace(v.ctx, v.reconciler.APIReader, | ||
| v.log, vmNamespaceList, vmList) | ||
| if len(foundVMs) == 0 || err != nil { | ||
| v.log.Info( | ||
| "No VirtualMachines found for cleanup; deletion appears complete", | ||
| "vmList", vmList, | ||
| "namespaceList", vmNamespaceList, | ||
| ) | ||
|
|
||
| return false | ||
| } | ||
| } | ||
|
|
||
| if v.IsAllProtectedPVCsOwnedByProtectedVMs() { | ||
| v.log.Info("all protected PVCs have ownerreferences to protected list of VMs") | ||
| // Cleanup VM resources | ||
| err := rmnutil.DeleteVMs(v.ctx, v.reconciler.Client, vmList, vmNamespaceList, v.log) | ||
| if err != nil { | ||
| v.log.Error(err, "Failed to delete VMs", | ||
| "vmList", vmList, | ||
| ) | ||
|
|
||
| return false | ||
| } | ||
|
|
||
| return true | ||
| } | ||
|
|
||
| v.log.Info("not all protected PVCs have ownerReferences to the protected list of VMs") | ||
|
|
||
| return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be broken down into smaller functions: checking deletion in progress, then listing VMs by namespace, then checking all PVCs’ ownerRefs. It will ensure early exits are obvious
internal/controller/util/vm_util.go
Outdated
| log.Info("Failed to fetch owner", "error", err) | ||
|
|
||
| continue // skip if not found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping silently avoids surfacing of issues which I'm not sure is the best practice.
| log.Info("Failed to fetch owner", "error", err) | |
| continue // skip if not found | |
| if k8serrors.IsNotFound(err) {{ | |
| log.V(1).Info("Owner not found during traversal", "owner", key) | |
| continue | |
| } | |
| log.Error(err, "Failed fetching owner during VM ownership resolution", "owner", key) | |
| return "", fmt.Errorf("failed to fetch owner %s: %w", key, err) |
internal/controller/util/vm_util.go
Outdated
| // IsOwnedByVM recursively traverses ownerReferences until it finds a VirtualMachine. | ||
| func IsOwnedByVM(ctx context.Context, c client.Client, obj client.Object, | ||
| owners []metav1.OwnerReference, log logr.Logger, | ||
| ) (string, error) { | ||
| for _, owner := range owners { | ||
| if owner.Kind == KindVirtualMachine && owner.APIVersion == KubeVirtAPIVersion { | ||
| return owner.Name, nil // Found VM root | ||
| } | ||
|
|
||
| // Fetch only metadata of the owner | ||
| ownerMeta := &metav1.PartialObjectMetadata{} | ||
| ownerMeta.SetGroupVersionKind(schema.FromAPIVersionAndKind(owner.APIVersion, owner.Kind)) | ||
|
|
||
| if err := c.Get(ctx, client.ObjectKey{Namespace: obj.GetNamespace(), Name: owner.Name}, ownerMeta); err != nil { | ||
| log.Info("Failed to fetch owner", "error", err) | ||
|
|
||
| continue // skip if not found | ||
| } | ||
|
|
||
| // Continue traversal with the owner | ||
| // Recursively check its ownerReferences | ||
| obj = ownerMeta | ||
|
|
||
| nestedOwners := obj.GetOwnerReferences() | ||
| if len(nestedOwners) > 0 { | ||
| vmName, err := IsOwnedByVM(ctx, c, obj, nestedOwners, log) | ||
| if err == nil { | ||
| return vmName, nil | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return "", fmt.Errorf("no VM owner found") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would implement this as an iterative function. OwnerReferences in k8s are structured as flat lists that are singly linked, so recursion doesn’t add much here and can introduce avoidable issues. K8s controllers typically traverse owner chains iteratively. It avoids stack depth issues and makes cycle detection straightforward.
| // +kubebuilder:rbac:groups="kubevirt.io",resources=virtualmachines,verbs=get;list;watch;delete | ||
| // +kubebuilder:rbac:groups="cdi.kubevirt.io",resources=datavolumes,verbs=get;list;watch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to keep these to the bare minimum. I think list verb for data volumes isn't required, as I don't see you're listing those via the k8s API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list and watch are required because the controller-runtime client uses a cache backed by informers, even for Get calls on PartialObjectMetadata. Without list, the cache cannot initialize, and the client fails. For the calls used here, these RBACs are required.
Signed-off-by: pruthvitd <[email protected]>
56d611e to
100ac46
Compare
This PR introduces improvements to the VM cleanup process as part of DR action and addresses a critical issue where Secondary VRG reconciliation stopped prematurely, leaving PVCs stuck in a Terminating state.
Key Changes
Safe VM Deletion Validation
Conditional Cleanup
Fix for Secondary VRG Reconciliation
Resolved an issue where Secondary VRG stopped reconciling before deleting associated VolumeGroupReplication (VGR) and VolumeReplication (VR) resources.
This fix ensures proper cleanup of VGR and VR resources, preventing PVCs from remaining in a Terminating state. Secondary will continue to reconcile until VRG desired state is achieved.