-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the issue where rancher-machine does not clean up vSphere VMs that are in failed status #254
Conversation
… the provision fails. In general, rancher-machine takes at least three steps to provision a vSphere machine: first, create the VM instance; then customize the VM's configuration; and finally bootstrap the VM. Previously, rancher-machine saves the VM's ID only after the bootstrapping step succeeds. If rancher-machine creates but fails to start the VM, it will not destroy the VM because the VM's uuid is missing. Now, rancher-machine retrieves and saves the uuid as machineID as soon as the VM is created, and saves the updated Host object to the store even if the creation process fails. This ensures the machineID is always preserved so that rancher-machine can find and destroy the VM on the vSphere server side properly.
libmachine/libmachine.go
Outdated
@@ -136,6 +136,11 @@ func (api *Client) Create(h *host.Host) error { | |||
log.Info("Creating machine...") | |||
|
|||
if err := api.performCreate(h); err != nil { | |||
// it is possible that the VM is instantiated but fails to bootstrap, | |||
// save the machine to the store, so the VM and associated resources can be found and destroyed later | |||
if err := api.Save(h); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can be a little more nuanced with this. We should try to analyze the error we've received downstream (i.e. in the driver) to determine whether we can safely save the data we have retrieved.
Notably, I believe this will cause a subsequent rm
to accidentally tear down a machine that shouldn't be torn down, if it matches the machine name specified on the create
command and the local rancher-machine
state has no record of it.
If we can determine what type of error we've received from govmomi
, then we can more safely determine whether we should retrieve/safely store the machine ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that it is not a good idea to save the machine when api.performCreate(h)
fails because the driver being invoked by api.performCreate(h)
could be any supported driver depending on the --driver
flag being passed to the rancher-machine rm
command. Such change will affect all drives by theory.
Each driver has different requirements for creating and removing a VM, I don't see a universal pattern for the errors or error messages returned when the creation fails.
Even more interestingly, the Amazonec2
driver removes the failed VM and associated resources if the creation fails (code).
That being said, I think instead of the above change, I should keep the changes inside the vmwarevsphere driver.
I see 2 options to implement it:
option 1: change the creation function in the vmwarevsphere driver such that it removes the VM and associated resources automatically if the creation fails.
I tend to treat this as a (breaking) behavior change rather than a bug fixing, so we have option 2.
option 2: add a new driver-level flag --vmwarevsphere-cleanup-on-creation-failure
which is false by default, if the flag is passed with the value true, then rancher-machine will remove the VM and associated resource if creation fails. On Rancher side, we need to expose the flag on the Vsphere node template. For reference, this is similar to how we add the support for graceful shutdown: rancher/rancher#42308.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @Oats87 @snasovich, I'd like to hear your thoughts about the above ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No new driver flags please, they are a headache to trickle through the various release lines.
I don't see the issue with adding logic to performCreate
to check the type of error and save if it's a specific type of error. For all of the other drivers, it won't save. You can see how we do this in the planner here: https://github.com/rancher/rancher/blob/release/v2.9/pkg/capr/planner/errors.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @Oats87,
Notably, I believe this will cause a subsequent rm to accidentally tear down a machine that shouldn't be torn down, if it matches the machine name specified on the create command and the local rancher-machine state has no record of it.
Further testing shows that the above scenario will not happen because rancher-machine needs the machineID
to destroy the VM on the vSphere side, and it doesn't have the machineID
if the VM is not created by it or if the creation fails due to name collision.
In the case of name collision, the rancher-machine create -d vmwarevsphere ... xxx
command will return the following error:
"Error creating machine: Error in driver during machine creation: The name 'xxx' already exists."
And more importantly, no machineID
will be returned from govmomi
.
Now if we run rancher-machine rm -y xxx
, it will not attempt to remove the VM from vSphere because it doesn't have the machineID, we will see the following message that proves that:
"(xxx) MachineID is empty, skipping removing VM"
rancher-machine will get and save the machineID
only if it creates the VM successfully, no matter if it can start the VM at a later step or not.
That being said, I think we don't need to detect the error types.
Besides that, the PR is updated such that it attempts to save the machine only if the driver is vmwarevsphere, this way all other drives will not be affected.
I hope I didn't misunderstand anything. Please let me what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, @Oats87 suggested another more general approach: define a new error type, which is errResourcesCreated
in this case, that represents the failure where external resources have been created, return this error at the appropriate places in the vSphere driver, and then check the error types instead of the driver name to determine if we should save the machine config when the creation fails.
I attempted to implement the approach but it turned out not to work. The function api.performCreate(h)
returns an error whose type is rpc.ServerError which is a string that implements the error interface. it indicates that the rancher-machine does not invoke the underlying driver directly, instead, it calls an RPC server and the server handles the requests to the underlying driver. When any error is returned from the driver, the RPC server creates and returns an error of type rpc.ServerError
and wipes out all other information, including the error type, about the original error. As a result, we could not use the errors.As(err, &errResourcesCreated)
method to check if the error is of the type errResourcesCreated.
That being said, we decided to stay with the current approach.
@jiaqiluo Could you please check if this is expected to fix the following issue: rancher/rancher#46210? |
@janeczku, yes, this PR should fix the linked issue. |
Issue
rancher/rancher#44212
Possibly: rancher/rancher#46340
Problem
In general, rancher-machine takes at least three steps to provision a vSphere machine: first, create the VM instance; then customize the VM's configuration; and finally bootstrap the VM.
Previously, rancher-machine saves the VM's ID only after the bootstrapping step succeeds. If rancher-machine creates but fails to start the VM, it will not destroy the VM because the VM's uuid is missing.
Solution
Now, rancher-machine retrieves and saves the uuid as machineID as soon as the VM is created, and saves the updated Host object to the store even if the creation process fails. This ensures the machineID is always preserved so that rancher-machine can find and destroy the VM on the vSphere server side properly.
Validation
Steps:
docker scp
commandDeploy from template: Data Center
,Deploy from template: Content Library
,Clone an existing virtual machine
, andInstall from boot2docker ISO(Legacy)
Results:
During the cluster provisioning process, Rancher keeps creating new nodes and removing old ones from the DS clusters, meanwhile creating new VMs and removing old ones on the vSphere server.
We also see the following messages in Rancher logs: