Scale-down issues with the talos_machine_configuration_apply resource #216

TomyLobo · 2024-12-09T23:02:47Z

I'm trying to write a manifest to set up and scale a Talos cluster.
Set-up and scale-up works like a charm.
But I had issues getting both clean scale-down and terraform destroy to work with the same manifest.
The issue is with getting the talos_machine_configuration_apply resource to reset the nodes.

Here are the variants I tried:

reset = false

nodes are not removed from the Talos cluster
nodes are not removed from the etcd cluster
nodes are not removed from the Kubernetes cluster
terraform destroy is possible
scale-down is always safe, since nodes aren't actually removed from the cluster
- though there may possibly be issues with etcd quorum at some point; I haven't tried that

reset = true, graceful = true

nodes are removed from the Talos cluster
nodes are removed from the etcd cluster
nodes are not removed from the Kubernetes cluster
terraform destroy is impossible, due to Talos being unwilling to destroy the last node
scale-down is always safe, as long as talos_machine_configuration_apply depends on the corresponding VM resource

reset = true, graceful = false

nodes are removed from the Talos cluster
nodes are removed from the etcd cluster
nodes are not removed from the Kubernetes cluster
terraform destroy is possible
If all of the remaining nodes are currently out of order, scaling down can lead to loss of the cluster state

As you can see, there are issues with either of those 3 options.
And the Kubernetes node objects aren't deleted in any of these 3 options.

My workaround currently is as follows:

reset = false
additional terraform_data resource
- containing a local-exec provider that runs:
  - talosctl reset with --graceful flag set to false on node #0 and to true on all other nodes
  - kubectl delete node
- depends on:
  - a local_sensitive_file resource for the talosconfig, created from the talos_client_configuration resource
  - a local_sensitive_file resource for the kubeconfig, created from the talos_cluster_kubeconfig resource
  - the corresponding VM resource
  - NOT the talos_machine_configuration_apply resource

Results:

nodes are removed from the Talos cluster
nodes are removed from the etcd cluster
nodes are removed from the Kubernetes cluster
terraform destroy is possible, but can still fail if there is more than just node #0 left
scale-down can't lead to data loss (I think)

The dependency graph that makes all of this possible (while avoiding dependency cycles) is a bit wonky and has a corner case:
Recreating the terraform_data resource won't also recreate the talos_machine_configuration_apply resource.
This means the machine isn't reinstalled even though terraform reports completion.

So this is really just a provisional workaround.
Ideally, the provider would do both the talosctl reset and kubectl delete node steps itself and also also offer a way to reset the last node with --graceful=false.
That would eliminate the need for that terraform_data resource or the incomplete dependency graph.

Btw, I can create a test-case if needed, but I could use some advice on how to build a self-sufficient test-case first, because my manifest depends on proxmox and I don't know which local provider is good for that :)

The text was updated successfully, but these errors were encountered:

smira · 2024-12-10T09:48:49Z

Just to answer one of your questions: Kubernetes nodes are never removed automatically, as Talos itself can't even do that (e.g. for worker nodes), and a single node can't decide on Node resource removal (as it might be in use by another machine due to duplicate hostname). So removing Kubernetes node should be a job of another higher-level component (e.g. Terraform provider for Kubernetes).

reset = true & graceful = true are the correct options, but you need to distinguish in your scripts between scale-down and cluster removal. If you remove a cluster, in fact you can just destroy your VMs completely.

TomyLobo · 2024-12-10T12:34:00Z

The only way I know how to distinguish between destroy and apply is by always passing a -var argument on destroy.
But I'm also brand-new to Terraform, so there may in fact be a way to do it properly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale-down issues with the talos_machine_configuration_apply resource #216

Scale-down issues with the talos_machine_configuration_apply resource #216

TomyLobo commented Dec 9, 2024

smira commented Dec 10, 2024

TomyLobo commented Dec 10, 2024

Scale-down issues with the talos_machine_configuration_apply resource #216

Scale-down issues with the talos_machine_configuration_apply resource #216

Comments

TomyLobo commented Dec 9, 2024

smira commented Dec 10, 2024

TomyLobo commented Dec 10, 2024