You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to write a manifest to set up and scale a Talos cluster.
Set-up and scale-up works like a charm.
But I had issues getting both clean scale-down and terraform destroy to work with the same manifest.
The issue is with getting the talos_machine_configuration_apply resource to reset the nodes.
Here are the variants I tried:
reset = false
nodes are not removed from the Talos cluster
nodes are not removed from the etcd cluster
nodes are not removed from the Kubernetes cluster
terraform destroy is possible
scale-down is always safe, since nodes aren't actually removed from the cluster
though there may possibly be issues with etcd quorum at some point; I haven't tried that
reset = true, graceful = true
nodes are removed from the Talos cluster
nodes are removed from the etcd cluster
nodes are not removed from the Kubernetes cluster
terraform destroy is impossible, due to Talos being unwilling to destroy the last node
scale-down is always safe, as long as talos_machine_configuration_apply depends on the corresponding VM resource
reset = true, graceful = false
nodes are removed from the Talos cluster
nodes are removed from the etcd cluster
nodes are not removed from the Kubernetes cluster
terraform destroy is possible
If all of the remaining nodes are currently out of order, scaling down can lead to loss of the cluster state
As you can see, there are issues with either of those 3 options.
And the Kubernetes node objects aren't deleted in any of these 3 options.
My workaround currently is as follows:
reset = false
additional terraform_data resource
containing a local-exec provider that runs:
talosctl reset with --graceful flag set to false on node #0 and to true on all other nodes
kubectl delete node
depends on:
a local_sensitive_file resource for the talosconfig, created from the talos_client_configuration resource
a local_sensitive_file resource for the kubeconfig, created from the talos_cluster_kubeconfig resource
the corresponding VM resource
NOT the talos_machine_configuration_apply resource
Results:
nodes are removed from the Talos cluster
nodes are removed from the etcd cluster
nodes are removed from the Kubernetes cluster
terraform destroy is possible, but can still fail if there is more than just node #0 left
scale-down can't lead to data loss (I think)
The dependency graph that makes all of this possible (while avoiding dependency cycles) is a bit wonky and has a corner case:
Recreating the terraform_data resource won't also recreate the talos_machine_configuration_apply resource.
This means the machine isn't reinstalled even though terraform reports completion.
So this is really just a provisional workaround.
Ideally, the provider would do both the talosctl reset and kubectl delete node steps itself and also also offer a way to reset the last node with --graceful=false.
That would eliminate the need for that terraform_data resource or the incomplete dependency graph.
Btw, I can create a test-case if needed, but I could use some advice on how to build a self-sufficient test-case first, because my manifest depends on proxmox and I don't know which local provider is good for that :)
The text was updated successfully, but these errors were encountered:
Just to answer one of your questions: Kubernetes nodes are never removed automatically, as Talos itself can't even do that (e.g. for worker nodes), and a single node can't decide on Node resource removal (as it might be in use by another machine due to duplicate hostname). So removing Kubernetes node should be a job of another higher-level component (e.g. Terraform provider for Kubernetes).
reset = true & graceful = true are the correct options, but you need to distinguish in your scripts between scale-down and cluster removal. If you remove a cluster, in fact you can just destroy your VMs completely.
The only way I know how to distinguish between destroy and apply is by always passing a -var argument on destroy.
But I'm also brand-new to Terraform, so there may in fact be a way to do it properly.
I'm trying to write a manifest to set up and scale a Talos cluster.
Set-up and scale-up works like a charm.
But I had issues getting both clean scale-down and
terraform destroy
to work with the same manifest.The issue is with getting the
talos_machine_configuration_apply
resource to reset the nodes.Here are the variants I tried:
reset = false
terraform destroy
is possiblereset = true
,graceful = true
terraform destroy
is impossible, due to Talos being unwilling to destroy the last nodetalos_machine_configuration_apply
depends on the corresponding VM resourcereset = true
,graceful = false
terraform destroy
is possibleAs you can see, there are issues with either of those 3 options.
And the Kubernetes node objects aren't deleted in any of these 3 options.
My workaround currently is as follows:
reset = false
terraform_data
resourcetalosctl reset
with--graceful
flag set tofalse
on node #0 and totrue
on all other nodeskubectl delete node
local_sensitive_file
resource for the talosconfig, created from thetalos_client_configuration
resourcelocal_sensitive_file
resource for the kubeconfig, created from thetalos_cluster_kubeconfig
resourcetalos_machine_configuration_apply
resourceResults:
terraform destroy
is possible, but can still fail if there is more than just node #0 leftThe dependency graph that makes all of this possible (while avoiding dependency cycles) is a bit wonky and has a corner case:
Recreating the
terraform_data
resource won't also recreate thetalos_machine_configuration_apply
resource.This means the machine isn't reinstalled even though terraform reports completion.
So this is really just a provisional workaround.
Ideally, the provider would do both the
talosctl reset
andkubectl delete node
steps itself and also also offer a way to reset the last node with--graceful=false
.That would eliminate the need for that
terraform_data
resource or the incomplete dependency graph.Btw, I can create a test-case if needed, but I could use some advice on how to build a self-sufficient test-case first, because my manifest depends on proxmox and I don't know which local provider is good for that :)
The text was updated successfully, but these errors were encountered: