-
-
Notifications
You must be signed in to change notification settings - Fork 225
fix(vm): detach and delete network device when its config block is removed #2260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hsnprsd
wants to merge
4
commits into
bpg:main
Choose a base branch
from
hsnprsd:remove-network-device
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a failing test for cloned VM scenario @TobiPeterG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please elaborate why the behaviour is wrong. If I understand it correctly, you create a VM with one network interface and then clone it to a VM without one. I would expect opentofu to remove the network interface from the VM when it's not in the config for the new VM. If you wanted the new VM to have the interface, you would add it in the config. What happens in this case?
More interesting is the question what happens when you clone a VM with multiple network interfaces and configure the cloned VM to also have these interfaces. Are they correctly mapped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether we want to keep the inherited network devices in the cloned VM or we expect the user to specify them. That is not my decision to make as I don't know what the intent was. If network devices should be specified whether or not they are cloned from a source VM, this PR solves the issue of removing extra network devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that we want to define the state of the infrastructure through opentofu, everything that is in the opentofu config should be there in reality and everything that is there in reality should be configured through opentofu. As a user, I would be confused when my opentofu config doesn't mention a network interface, but it actually exists on my infrastructure. But I guess it's bpg's call to make.
I would also wonder: How would you remove a network interface from a VM you clone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but this PR introduces a breaking change — anyone updating the provider would suddenly see Terraform trying to remove all network devices from cloned VMs that don’t have network devices specified.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent was that if a clone has nothing except a reference to the template, it should create an exact copy with all devices inherited from the template. There are a few tests that capture this behavior, and it’s also somewhat documented here: https://registry.terraform.io/providers/bpg/proxmox/latest/docs/guides/clone-vm
And I don't think this behaviour should be changed.
So we can’t just delete a device if the clone doesn’t have it. Unfortunately, there’s no good workaround to capture this behaviour in the legacy VM resource, since devices are represented as a list, not a map as they should be.
In an ideal world (FWK provider, v1.+), we’d be able to define something like:
in the template, and then:
in the clone to explicitly remove it.
I understand the desire to have a “universal” template that includes everything, and then trim the excess when cloning to a specific VM. However, complex three-way merging of configurations (template + clone + defaults) when adding, changing, or deleting things is a non-trivial and error-prone task.
My go-to approach is to use cloud-init to configure the base OS image the way I need, skipping cloning altogether. I know there are tons of cases where this may not work, but I just wanted to provide some rationale for why cloning is, at best, a bit dicey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do this too. As a result, I'm not really qualified to jump in on this discussion since I don't use cloning, nor have I looked into how the Proxmox API exposes cloning, but I've run into similar minor issues when importing resources from public cloud providers....
Occasionally when importing a public cloud resource, I find that running a plan immediately after results in a proposed "change". Most of the time I don't care and just run apply to commit that setting into state, but other times I find that I don't actually want that setting and I need to edit my Terraform config to explicitly add that setting so I can change it to something else. Since the import succeeded with the config I had written, I've been assuming these minor anomalies are just due to added return values from the API call, or perhaps a mismatch in default values stored in the provider vs. the API vs. the cloud provider's web UI.
To me, Proxmox cloning feels like a server-side import of a VM resource. I wonder if it's worth considering cloning in this way? Ideally, the
cloneblock would just signal to the provider to call the appropriate clone API to initially create the cloned vm, but the end result is a vm resource with all the required config a vm resource requires. Should some config be missing, modified, or added, those differences can either be applied at time of cloning, or on the next apply.I don't use cloning so I'm not really sure how the functionality works, but I suspect this matches how Proxmox works. But from a Terraform provider prospective, I think this inheritance is problematic. It seems tricky for both the provider and humans to reason about inherited config vs. explicit config and how to make changes to inherited config in the future. I've seen this issue come up occasionally in bug reports such as #1988. I think it's worth considering failing a cloned vm resource attempt if the vm resource block doesn't exactly match the source template (or with whatever minor changes that can be easily applied like a change in memory, or dropping a network interface).
Obviously this has massive backwards compatibility problems (and there are certainly nuances I'm unaware of), but I thought I'd throw out my idea that cloning could be considered a server-side resource "import". It certainly adds some friction when creating a cloned resource, but the end result would be a proper explicitly-defined vm resource that behaves like any other non-cloned vm resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate why this shouldn't change? Why wouldn't you just create a opentofu config matching the setup of the cloned VM to have an exact copy?
Aren't explicit removings defined by not specifying an attribute? That's how it work with other providers I use as well. When it's not defined to be there, it's removed.