Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale-down issues with the talos_machine_configuration_apply resource #216

Open
TomyLobo opened this issue Dec 9, 2024 · 2 comments
Open

Comments

@TomyLobo
Copy link

TomyLobo commented Dec 9, 2024

I'm trying to write a manifest to set up and scale a Talos cluster.
Set-up and scale-up works like a charm.
But I had issues getting both clean scale-down and terraform destroy to work with the same manifest.
The issue is with getting the talos_machine_configuration_apply resource to reset the nodes.

Here are the variants I tried:

  1. reset = false
  • nodes are not removed from the Talos cluster
  • nodes are not removed from the etcd cluster
  • nodes are not removed from the Kubernetes cluster
  • terraform destroy is possible
  • scale-down is always safe, since nodes aren't actually removed from the cluster
    • though there may possibly be issues with etcd quorum at some point; I haven't tried that
  1. reset = true, graceful = true
  • nodes are removed from the Talos cluster
  • nodes are removed from the etcd cluster
  • nodes are not removed from the Kubernetes cluster
  • terraform destroy is impossible, due to Talos being unwilling to destroy the last node
  • scale-down is always safe, as long as talos_machine_configuration_apply depends on the corresponding VM resource
  1. reset = true, graceful = false
  • nodes are removed from the Talos cluster
  • nodes are removed from the etcd cluster
  • nodes are not removed from the Kubernetes cluster
  • terraform destroy is possible
  • If all of the remaining nodes are currently out of order, scaling down can lead to loss of the cluster state

As you can see, there are issues with either of those 3 options.
And the Kubernetes node objects aren't deleted in any of these 3 options.

My workaround currently is as follows:

  • reset = false
  • additional terraform_data resource
    • containing a local-exec provider that runs:
      • talosctl reset with --graceful flag set to false on node #0 and to true on all other nodes
      • kubectl delete node
    • depends on:
      • a local_sensitive_file resource for the talosconfig, created from the talos_client_configuration resource
      • a local_sensitive_file resource for the kubeconfig, created from the talos_cluster_kubeconfig resource
      • the corresponding VM resource
      • NOT the talos_machine_configuration_apply resource

Results:

  • nodes are removed from the Talos cluster
  • nodes are removed from the etcd cluster
  • nodes are removed from the Kubernetes cluster
  • terraform destroy is possible, but can still fail if there is more than just node #0 left
  • scale-down can't lead to data loss (I think)

The dependency graph that makes all of this possible (while avoiding dependency cycles) is a bit wonky and has a corner case:
Recreating the terraform_data resource won't also recreate the talos_machine_configuration_apply resource.
This means the machine isn't reinstalled even though terraform reports completion.

So this is really just a provisional workaround.
Ideally, the provider would do both the talosctl reset and kubectl delete node steps itself and also also offer a way to reset the last node with --graceful=false.
That would eliminate the need for that terraform_data resource or the incomplete dependency graph.

Btw, I can create a test-case if needed, but I could use some advice on how to build a self-sufficient test-case first, because my manifest depends on proxmox and I don't know which local provider is good for that :)

@smira
Copy link
Member

smira commented Dec 10, 2024

Just to answer one of your questions: Kubernetes nodes are never removed automatically, as Talos itself can't even do that (e.g. for worker nodes), and a single node can't decide on Node resource removal (as it might be in use by another machine due to duplicate hostname). So removing Kubernetes node should be a job of another higher-level component (e.g. Terraform provider for Kubernetes).

reset = true & graceful = true are the correct options, but you need to distinguish in your scripts between scale-down and cluster removal. If you remove a cluster, in fact you can just destroy your VMs completely.

@TomyLobo
Copy link
Author

The only way I know how to distinguish between destroy and apply is by always passing a -var argument on destroy.
But I'm also brand-new to Terraform, so there may in fact be a way to do it properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants