Skip to content

Conversation

@pruthvitd
Copy link
Member

@pruthvitd pruthvitd commented Nov 24, 2025

This PR introduces improvements to the VM cleanup process as part of DR action and addresses a critical issue where Secondary VRG reconciliation stopped prematurely, leaving PVCs stuck in a Terminating state.

Key Changes

Safe VM Deletion Validation
  • Added logic to validate whether VM resources can be safely deleted before initiating cleanup.
  • Prevents accidental deletion when PVC ownership conditions are not met.
Conditional Cleanup
  • If validation passes, VM resources are deleted.
  • If validation fails, the controller proceeds with the previous reconciliation logic to maintain consistency. That is cleanup will be expected by the user.

Fix for Secondary VRG Reconciliation

Resolved an issue where Secondary VRG stopped reconciling before deleting associated VolumeGroupReplication (VGR) and VolumeReplication (VR) resources.
This fix ensures proper cleanup of VGR and VR resources, preventing PVCs from remaining in a Terminating state. Secondary will continue to reconcile until VRG desired state is achieved.

@pruthvitd pruthvitd force-pushed the verify-vm-ownership branch 4 times, most recently from 9735efb to 9506f80 Compare November 25, 2025 18:24
@pruthvitd pruthvitd marked this pull request as ready for review November 25, 2025 18:25
@pruthvitd pruthvitd changed the title verify pvcs for ownerreferences pointing to VM verify pvcs for ownerreferences pointing to VM before cleanup during DR Nov 25, 2025
}

// Fetch the owner object
owner, err := dynamicClient.Resource(gvr).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigs.k8s.io/controller-runtime Client can be used to get the resources(including CRs) and is readily available as part of VRG controller. Please check if that can used instead of dynamic client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! Agreed—using sigs.k8s.io/controller-runtime’s typed Client is the better approach since it leverages the shared cache and provides strong typing. While the dynamic client works without scheme registration by using unstructured.Unstructured, it lacks type safety, caching benefits, and requires more manual handling. I’ll refactor this to use the controller-runtime client and ensure the necessary schemes are registered.
Additionally, for ownership checks where we only need metadata (like ownerReferences), I’ll consider using PartialObjectMetadata with the typed client. This will reduce overhead compared to fetching full objects while still benefiting from caching and typed interactions.

@pruthvitd pruthvitd requested a review from asn1809 November 28, 2025 10:50
"go.uber.org/zap/zapcore"
apiextensions "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1"
"k8s.io/apimachinery/pkg/runtime"
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file effectively has no changes. Better to do git restore and continue.

@pruthvitd pruthvitd force-pushed the verify-vm-ownership branch 3 times, most recently from 3aeb241 to 36a82ac Compare December 1, 2025 15:39
@pruthvitd pruthvitd force-pushed the verify-vm-ownership branch from 36a82ac to 2d314d2 Compare December 2, 2025 12:56
Comment on lines 2900 to 2957
func (v *VRGInstance) ShouldCleanupVMForSecondary() bool {
if !v.IsDRActionInProgress() {
v.log.Info("Skip VM cleanup; reconcile as secondary")

return false
}

v.log.Info(
"DR action progressing",
"component", "VRGController",
"action", "reconcile",
"vrgName", v.instance.GetName(),
"namespace", v.instance.GetNamespace(),
"desiredState", v.instance.Spec.ReplicationState,
"currentState", v.instance.Status.State,
)

vmNamespaceList := v.instance.Spec.KubeObjectProtection.RecipeParameters[recipecore.ProtectedVMNamespace]
vmList := v.instance.Spec.KubeObjectProtection.RecipeParameters[recipecore.VMList]

if len(vmList) > 0 {
if rmnutil.IsVMDeletionInProgress(v.ctx, v.reconciler.Client, vmList, vmNamespaceList, v.log) {
v.log.Info("VM deletion is in progress, skipping ownerreferences check")

return true
}

foundVMs, err := rmnutil.ListVMsByVMNamespace(v.ctx, v.reconciler.APIReader,
v.log, vmNamespaceList, vmList)
if len(foundVMs) == 0 || err != nil {
v.log.Info(
"No VirtualMachines found for cleanup; deletion appears complete",
"vmList", vmList,
"namespaceList", vmNamespaceList,
)

return false
}
}

if v.IsAllProtectedPVCsOwnedByProtectedVMs() {
v.log.Info("all protected PVCs have ownerreferences to protected list of VMs")
// Cleanup VM resources
err := rmnutil.DeleteVMs(v.ctx, v.reconciler.Client, vmList, vmNamespaceList, v.log)
if err != nil {
v.log.Error(err, "Failed to delete VMs",
"vmList", vmList,
)

return false
}

return true
}

v.log.Info("not all protected PVCs have ownerReferences to the protected list of VMs")

return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be broken down into smaller functions: checking deletion in progress, then listing VMs by namespace, then checking all PVCs’ ownerRefs. It will ensure early exits are obvious

Comment on lines 179 to 181
log.Info("Failed to fetch owner", "error", err)

continue // skip if not found
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping silently avoids surfacing of issues which I'm not sure is the best practice.

Suggested change
log.Info("Failed to fetch owner", "error", err)
continue // skip if not found
if k8serrors.IsNotFound(err) {{
log.V(1).Info("Owner not found during traversal", "owner", key)
continue
}
log.Error(err, "Failed fetching owner during VM ownership resolution", "owner", key)
return "", fmt.Errorf("failed to fetch owner %s: %w", key, err)

Comment on lines 165 to 198
// IsOwnedByVM recursively traverses ownerReferences until it finds a VirtualMachine.
func IsOwnedByVM(ctx context.Context, c client.Client, obj client.Object,
owners []metav1.OwnerReference, log logr.Logger,
) (string, error) {
for _, owner := range owners {
if owner.Kind == KindVirtualMachine && owner.APIVersion == KubeVirtAPIVersion {
return owner.Name, nil // Found VM root
}

// Fetch only metadata of the owner
ownerMeta := &metav1.PartialObjectMetadata{}
ownerMeta.SetGroupVersionKind(schema.FromAPIVersionAndKind(owner.APIVersion, owner.Kind))

if err := c.Get(ctx, client.ObjectKey{Namespace: obj.GetNamespace(), Name: owner.Name}, ownerMeta); err != nil {
log.Info("Failed to fetch owner", "error", err)

continue // skip if not found
}

// Continue traversal with the owner
// Recursively check its ownerReferences
obj = ownerMeta

nestedOwners := obj.GetOwnerReferences()
if len(nestedOwners) > 0 {
vmName, err := IsOwnedByVM(ctx, c, obj, nestedOwners, log)
if err == nil {
return vmName, nil
}
}
}

return "", fmt.Errorf("no VM owner found")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would implement this as an iterative function. OwnerReferences in k8s are structured as flat lists that are singly linked, so recursion doesn’t add much here and can introduce avoidable issues. K8s controllers typically traverse owner chains iteratively. It avoids stack depth issues and makes cycle detection straightforward.

Comment on lines 400 to 401
// +kubebuilder:rbac:groups="kubevirt.io",resources=virtualmachines,verbs=get;list;watch;delete
// +kubebuilder:rbac:groups="cdi.kubevirt.io",resources=datavolumes,verbs=get;list;watch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to keep these to the bare minimum. I think list verb for data volumes isn't required, as I don't see you're listing those via the k8s API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list and watch are required because the controller-runtime client uses a cache backed by informers, even for Get calls on PartialObjectMetadata. Without list, the cache cannot initialize, and the client fails. For the calls used here, these RBACs are required.

@pruthvitd pruthvitd force-pushed the verify-vm-ownership branch from 56d611e to 100ac46 Compare December 5, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants