Fix the issue where rancher-machine does not clean up vSphere VMs that are in failed status #254

jiaqiluo · 2024-07-22T20:12:49Z

Issue

rancher/rancher#44212
Possibly: rancher/rancher#46340

Problem

In general, rancher-machine takes at least three steps to provision a vSphere machine: first, create the VM instance; then customize the VM's configuration; and finally bootstrap the VM.
Previously, rancher-machine saves the VM's ID only after the bootstrapping step succeeds. If rancher-machine creates but fails to start the VM, it will not destroy the VM because the VM's uuid is missing.

Solution

Now, rancher-machine retrieves and saves the uuid as machineID as soon as the VM is created, and saves the updated Host object to the store even if the creation process fails. This ensures the machineID is always preserved so that rancher-machine can find and destroy the VM on the vSphere server side properly.

Validation

Steps:

build rancher-machine for Linux amd64 from this branch
run Rancher v2.8.5 in docker-install mode
replace the rancher-machine binary in the container by using the docker scp command
create vSphere node template with four different creation methods: Deploy from template: Data Center, Deploy from template: Content Library, Clone an existing virtual machine, and Install from boot2docker ISO(Legacy)
on the node template set the CPUs to 36 to ensure the VM can NOT be assigned to any vSpherer server
provision RKE1 vSphere node-driver clusters with those 4 node templates
provision RKE2 vSphere node-driver clusters with those 4 node templates

Results:

During the cluster provisioning process, Rancher keeps creating new nodes and removing old ones from the DS clusters, meanwhile creating new VMs and removing old ones on the vSphere server.

We also see the following messages in Rancher logs:

2024/07/23 15:33:37 [INFO] [node-controller] Removing node jiaqi-a5
2024/07/23 15:33:37 [INFO] [node-controller] About to remove jiaqi-a5
2024/07/23 15:33:38 [INFO] [node-controller] (jiaqi-a5) Using datacenter /Datacenter
2024/07/23 15:33:39 [INFO] [node-controller] (jiaqi-a5) Using network /Datacenter/network/VM Network
2024/07/23 15:33:44 [INFO] [node-controller] (jiaqi-a5) Using hostsystem /Datacenter/host/Cluster/xx.xx.xx.xx
2024/07/23 15:33:44 [INFO] [node-controller] (jiaqi-a5) Using ResourcePool /Datacenter/host/Cluster/Resources/jiaqi-resource-pool
2024/07/23 15:33:48 [INFO] [node-controller] (jiaqi-a5) Removing user-data.iso
2024/07/23 15:33:49 [INFO] [node-controller] (jiaqi-a5) Destroying VM jiaqi-a5 (ID: 42183429-a34e-3fce-7857-a912af50c0af)
2024/07/23 15:33:51 [INFO] [node-controller] Successfully removed jiaqi-a5
2024/07/23 15:33:51 [INFO] [node-controller] (jiaqi-a5) Closing plugin on server side
2024/07/23 15:33:51 [INFO] [node-controller] (jiaqi-a5) Closing plugin on server side
2024/07/23 15:33:51 [INFO] [node-controller] Removing node jiaqi-a5 done
2024/07/23 15:33:51 [DEBUG] Cleaning up [management-state/node/nodes/jiaqi-a5]

… the provision fails. In general, rancher-machine takes at least three steps to provision a vSphere machine: first, create the VM instance; then customize the VM's configuration; and finally bootstrap the VM. Previously, rancher-machine saves the VM's ID only after the bootstrapping step succeeds. If rancher-machine creates but fails to start the VM, it will not destroy the VM because the VM's uuid is missing. Now, rancher-machine retrieves and saves the uuid as machineID as soon as the VM is created, and saves the updated Host object to the store even if the creation process fails. This ensures the machineID is always preserved so that rancher-machine can find and destroy the VM on the vSphere server side properly.

Oats87 · 2024-07-25T20:08:16Z

libmachine/libmachine.go

@@ -136,6 +136,11 @@ func (api *Client) Create(h *host.Host) error {
 	log.Info("Creating machine...")

 	if err := api.performCreate(h); err != nil {
+		// it is possible that the VM is instantiated but fails to bootstrap,
+		// save the machine to the store, so the VM and associated resources can be found and destroyed later
+		if err := api.Save(h); err != nil {


I think we can be a little more nuanced with this. We should try to analyze the error we've received downstream (i.e. in the driver) to determine whether we can safely save the data we have retrieved.

Notably, I believe this will cause a subsequent rm to accidentally tear down a machine that shouldn't be torn down, if it matches the machine name specified on the create command and the local rancher-machine state has no record of it.

If we can determine what type of error we've received from govmomi, then we can more safely determine whether we should retrieve/safely store the machine ID.

I realize that it is not a good idea to save the machine when api.performCreate(h) fails because the driver being invoked by api.performCreate(h) could be any supported driver depending on the --driver flag being passed to the rancher-machine rm command. Such change will affect all drives by theory.

Each driver has different requirements for creating and removing a VM, I don't see a universal pattern for the errors or error messages returned when the creation fails.

Even more interestingly, the Amazonec2 driver removes the failed VM and associated resources if the creation fails (code).

That being said, I think instead of the above change, I should keep the changes inside the vmwarevsphere driver.

I see 2 options to implement it:

option 1: change the creation function in the vmwarevsphere driver such that it removes the VM and associated resources automatically if the creation fails.
I tend to treat this as a (breaking) behavior change rather than a bug fixing, so we have option 2.

option 2: add a new driver-level flag --vmwarevsphere-cleanup-on-creation-failure which is false by default, if the flag is passed with the value true, then rancher-machine will remove the VM and associated resource if creation fails. On Rancher side, we need to expose the flag on the Vsphere node template. For reference, this is similar to how we add the support for graceful shutdown: rancher/rancher#42308.

hi @Oats87 @snasovich, I'd like to hear your thoughts about the above ^^

No new driver flags please, they are a headache to trickle through the various release lines.

I don't see the issue with adding logic to performCreate to check the type of error and save if it's a specific type of error. For all of the other drivers, it won't save. You can see how we do this in the planner here: https://github.com/rancher/rancher/blob/release/v2.9/pkg/capr/planner/errors.go

hi @Oats87,

Notably, I believe this will cause a subsequent rm to accidentally tear down a machine that shouldn't be torn down, if it matches the machine name specified on the create command and the local rancher-machine state has no record of it.

Further testing shows that the above scenario will not happen because rancher-machine needs the machineID to destroy the VM on the vSphere side, and it doesn't have the machineID if the VM is not created by it or if the creation fails due to name collision.

In the case of name collision, the rancher-machine create -d vmwarevsphere ... xxx command will return the following error:

"Error creating machine: Error in driver during machine creation: The name 'xxx' already exists."

And more importantly, no machineID will be returned from govmomi.

Now if we run rancher-machine rm -y xxx, it will not attempt to remove the VM from vSphere because it doesn't have the machineID, we will see the following message that proves that:

"(xxx) MachineID is empty, skipping removing VM"

rancher-machine will get and save the machineID only if it creates the VM successfully, no matter if it can start the VM at a later step or not.

That being said, I think we don't need to detect the error types.

Besides that, the PR is updated such that it attempts to save the machine only if the driver is vmwarevsphere, this way all other drives will not be affected.

I hope I didn't misunderstand anything. Please let me what you think.

For the record, @Oats87 suggested another more general approach: define a new error type, which is errResourcesCreated in this case, that represents the failure where external resources have been created, return this error at the appropriate places in the vSphere driver, and then check the error types instead of the driver name to determine if we should save the machine config when the creation fails.

I attempted to implement the approach but it turned out not to work. The function api.performCreate(h) returns an error whose type is rpc.ServerError which is a string that implements the error interface. it indicates that the rancher-machine does not invoke the underlying driver directly, instead, it calls an RPC server and the server handles the requests to the underlying driver. When any error is returned from the driver, the RPC server creates and returns an error of type rpc.ServerError and wipes out all other information, including the error type, about the original error. As a result, we could not use the errors.As(err, &errResourcesCreated) method to check if the error is of the type errResourcesCreated.

That being said, we decided to stay with the current approach.

janeczku · 2024-08-15T18:10:41Z

@jiaqiluo Could you please check if this is expected to fix the following issue: rancher/rancher#46210?

jiaqiluo · 2024-08-15T18:50:49Z

@jiaqiluo Could you please check if this is expected to fix the following issue: rancher/rancher#46210?

@janeczku, yes, this PR should fix the linked issue.

jiaqiluo force-pushed the fix-vsphere branch from 28abc54 to d82f559 Compare July 23, 2024 22:26

jiaqiluo marked this pull request as ready for review July 23, 2024 22:52

jiaqiluo requested a review from a team July 23, 2024 22:52

Oats87 requested changes Jul 25, 2024

View reviewed changes

address comments

05caa37

jiaqiluo requested a review from Oats87 July 26, 2024 22:46

jakefhyde self-requested a review July 30, 2024 16:24

Oats87 approved these changes Jul 30, 2024

View reviewed changes

jiaqiluo requested a review from a team July 30, 2024 23:53

jakefhyde approved these changes Jul 31, 2024

View reviewed changes

jiaqiluo merged commit 42a1241 into rancher:master Jul 31, 2024
1 check passed

jiaqiluo deleted the fix-vsphere branch July 31, 2024 23:31

janeczku mentioned this pull request Aug 15, 2024

[BUG] Misconfigured Node provisioned on VMWare fails to delete failed node rancher/rancher#46210

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the issue where rancher-machine does not clean up vSphere VMs that are in failed status #254

Fix the issue where rancher-machine does not clean up vSphere VMs that are in failed status #254

jiaqiluo commented Jul 22, 2024 •

edited by snasovich

Loading

Oats87 Jul 25, 2024

jiaqiluo Jul 26, 2024 •

edited

Loading

jiaqiluo Jul 26, 2024

Oats87 Jul 26, 2024

jiaqiluo Jul 26, 2024 •

edited

Loading

jiaqiluo Jul 30, 2024

janeczku commented Aug 15, 2024

jiaqiluo commented Aug 15, 2024

Fix the issue where rancher-machine does not clean up vSphere VMs that are in failed status #254

Fix the issue where rancher-machine does not clean up vSphere VMs that are in failed status #254

Conversation

jiaqiluo commented Jul 22, 2024 • edited by snasovich Loading

Oats87 Jul 25, 2024

Choose a reason for hiding this comment

jiaqiluo Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

jiaqiluo Jul 26, 2024

Choose a reason for hiding this comment

Oats87 Jul 26, 2024

Choose a reason for hiding this comment

jiaqiluo Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

jiaqiluo Jul 30, 2024

Choose a reason for hiding this comment

janeczku commented Aug 15, 2024

jiaqiluo commented Aug 15, 2024

jiaqiluo commented Jul 22, 2024 •

edited by snasovich

Loading

jiaqiluo Jul 26, 2024 •

edited

Loading

jiaqiluo Jul 26, 2024 •

edited

Loading