Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Inserting New Mappings into node_groups leads to ddestructive behavior #2808

Open
kenafoster opened this issue Oct 30, 2024 · 2 comments
Assignees
Labels

Comments

@kenafoster
Copy link
Contributor

Describe the bug

Modify a Nebari AWS deployment node group definition like so:

# old
amazon_web_services:
  node_groups:
    grp_1:
      ...
    grp_2:
      ...
    grp_3:
      ...

# new
amazon_web_services:
  node_groups:
    grp_1:
      ...
    grp_2:
      ...
    grp_4:
      ...
    grp_5:
      ...
    grp_3:
      ...

Nebari destroyed and recreated the third item in the mapping (so grp_3 became grp_4) then created a new fourth and fifth item

Expected behavior

Create new grp_4 and grp_5 while leaving grp_3 unchanged

OS and architecture in which you are running Nebari

MacOS / ARM Apple Silicon

How to Reproduce the problem?

Make a change in a nebari-config such as:

# old
amazon_web_services:
  node_groups:
    grp_1:
      ...
    grp_2:
      ...
    grp_3:
      ...

# new
amazon_web_services:
  node_groups:
    grp_1:
      ...
    grp_2:
      ...
    grp_4:
      ...
    grp_5:
      ...
    grp_3:
      ...

Command output

Two new node groups were added in the middle of the mapping in the example below.  But see "forces replacement" notes moving the instance types around and so destroying/recreating groups



[terraform]: 
[terraform]:   # module.kubernetes.aws_eks_node_group.main[3] must be replaced
[terraform]: -/+ resource "aws_eks_node_group" "main" {
[terraform]:       ~ ami_type               = "AL2_x86_64_GPU" -> (known after apply)
[terraform]:       ~ arn                    = "arn:aws-us-gov:eks:us-gov-west-1:xxxxxxxxxxxxxx:nodegroup/nebari-xxxxxxxxxxxxxx/gpu-tesla-g4/6ec71764-4312-3023-2f8f-da6f2d90d0eb" -> (known after apply)
[terraform]:       ~ capacity_type          = "ON_DEMAND" -> (known after apply)
[terraform]:       ~ id                     = "nebari-xxxxxxxxxxxxxx:gpu-tesla-g4" -> (known after apply)
[terraform]:       ~ instance_types         = [ # forces replacement
[terraform]:           - "g4dn.xlarge",
[terraform]:           + "r5.xlarge",
[terraform]:         ]
[terraform]:       ~ labels                 = {
[terraform]:           ~ "dedicated" = "gpu-tesla-g4" -> "r5-xlarge"
[terraform]:         }
[terraform]:       ~ node_group_name        = "gpu-tesla-g4" -> "r5-xlarge" # forces replacement
[terraform]:       + node_group_name_prefix = (known after apply)
[terraform]:       ~ release_version        = "1.29.3-20240506" -> (known after apply)
[terraform]:       ~ resources              = [
[terraform]:           - {
[terraform]:               - autoscaling_groups              = [
[terraform]:                   - {
[terraform]:                       - name = "eks-gpu-tesla-g4-6ec71764-4312-3023-2f8f-da6f2d90d0eb"
[terraform]:                     },
[terraform]:                 ]
[terraform]:               - remote_access_security_group_id = ""
[terraform]:             },
[terraform]:         ] -> (known after apply)
[terraform]:       ~ status                 = "ACTIVE" -> (known after apply)
[terraform]:       ~ tags                   = {
[terraform]:             "Environment"                                             = "xxxxxx"
[terraform]:             "Owner"                                                   = "terraform"
[terraform]:             "Project"                                                 = "nebari-xxxxxx"
[terraform]:           ~ "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "gpu-tesla-g4" -> "r5-xlarge"
[terraform]:             "propagate_at_launch"                                     = "true"
[terraform]:         }
[terraform]:       ~ tags_all               = {
[terraform]:           ~ "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "gpu-tesla-g4" -> "r5-xlarge"
[terraform]:             # (4 unchanged elements hidden)
[terraform]:         }
[terraform]:       ~ version                = "1.29" -> (known after apply)
[terraform]:         # (4 unchanged attributes hidden)
[terraform]: 
[terraform]:       - update_config {
[terraform]:           - max_unavailable            = 1 -> null
[terraform]:           - max_unavailable_percentage = 0 -> null
[terraform]:         }2
[terraform]: 
[terraform]:         # (1 unchanged block hidden)
[terraform]:     }
[terraform]: 
[terraform]:   # module.kubernetes.aws_eks_node_group.main[4] must be replaced
[terraform]: -/+ resource "aws_eks_node_group" "main" {
[terraform]:       ~ ami_type               = "AL2_x86_64_GPU" -> (known after apply)
[terraform]:       ~ arn                    = "arn:aws-us-gov:eks:us-gov-west-1:xxxxxxxxxxxxxx:nodegroup/nebari-xxxxxxxxxxxxxx/gpu-tesla-g4-4x/bec71764-4314-8153-2845-0b9137a886ea" -> (known after apply)
[terraform]:       ~ capacity_type          = "ON_DEMAND" -> (known after apply)
[terraform]:       ~ id                     = "nebari-xxxxxxxxxxxxxx:gpu-tesla-g4-4x" -> (known after apply)
[terraform]:       ~ instance_types         = [ # forces replacement
[terraform]:           - "g4dn.12xlarge",
[terraform]:           + "r5.2xlarge",
[terraform]:         ]
[terraform]:       ~ labels                 = {
[terraform]:           ~ "dedicated" = "gpu-tesla-g4-4x" -> "r5-2xlarge"
[terraform]:         }
[terraform]:       ~ node_group_name        = "gpu-tesla-g4-4x" -> "r5-2xlarge" # forces replacement
[terraform]:       + node_group_name_prefix = (known after apply)
[terraform]:       ~ release_version        = "1.29.3-20240506" -> (known after apply)
[terraform]:       ~ resources              = [
[terraform]:           - {
[terraform]:               - autoscaling_groups              = [
[terraform]:                   - {
[terraform]:                       - name = "eks-gpu-tesla-g4-4x-bec71764-4314-8153-2845-0b9137a886ea"
[terraform]:                     },
[terraform]:                 ]
[terraform]:               - remote_access_security_group_id = ""
[terraform]:             },
[terraform]:         ] -> (known after apply)
[terraform]:       ~ status                 = "ACTIVE" -> (known after apply)
[terraform]:       ~ tags                   = {
[terraform]:             "Environment"                                             = "xxxxxx"
[terraform]:             "Owner"                                                   = "terraform"
[terraform]:             "Project"                                                 = "nebari-xxxxxx"
[terraform]:           ~ "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "gpu-tesla-g4-4x" -> "r5-2xlarge"
[terraform]:             "propagate_at_launch"                                     = "true"
[terraform]:         }
[terraform]:       ~ tags_all               = {
[terraform]:           ~ "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "gpu-tesla-g4-4x" -> "r5-2xlarge"
[terraform]:             # (4 unchanged elements hidden)
[terraform]:         }
[terraform]:       ~ version                = "1.29" -> (known after apply)
[terraform]:         # (4 unchanged attributes hidden)
[terraform]: 
[terraform]:       - update_config {
[terraform]:           - max_unavailable            = 1 -> null
[terraform]:           - max_unavailable_percentage = 0 -> null
[terraform]:         }
[terraform]: 
[terraform]:         # (1 unchanged block hidden)
[terraform]:     }
[terraform]: 
[terraform]:   # module.kubernetes.aws_eks_node_group.main[5] must be replaced
[terraform]: -/+ resource "aws_eks_node_group" "main" {
[terraform]:       ~ ami_type               = "AL2_x86_64_GPU" -> (known after apply)
[terraform]:       ~ arn                    = "arn:aws-us-gov:eks:us-gov-west-1:xxxxxxxxxxxxxx:nodegroup/nebari-xxxxxxxxxxxxxx/gpu-tesla-g3-2x/eec71764-4316-dcb9-a3ee-94b286f4afff" -> (known after apply)
[terraform]:       ~ capacity_type          = "ON_DEMAND" -> (known after apply)
[terraform]:       ~ id                     = "nebari-xxxxxxxxxxxxxx:gpu-tesla-g3-2x" -> (known after apply)
[terraform]:       ~ instance_types         = [ # forces replacement
[terraform]:           - "g3.8xlarge",
[terraform]:           + "g4dn.xlarge",
[terraform]:         ]
[terraform]:       ~ labels                 = {
[terraform]:           ~ "dedicated" = "gpu-tesla-g3-2x" -> "gpu-tesla-g4"
[terraform]:         }
[terraform]:       ~ node_group_name        = "gpu-tesla-g3-2x" -> "gpu-tesla-g4" # forces replacement
[terraform]:       + node_group_name_prefix = (known after apply)
[terraform]:       ~ release_version        = "1.29.3-20240506" -> (known after apply)
[terraform]:       ~ resources              = [
[terraform]:           - {
[terraform]:               - autoscaling_groups              = [
[terraform]:                   - {
[terraform]:                       - name = "eks-gpu-tesla-g3-2x-eec71764-4316-dcb9-a3ee-94b286f4afff"
[terraform]:                     },
[terraform]:                 ]
[terraform]:               - remote_access_security_group_id = ""
[terraform]:             },
[terraform]:         ] -> (known after apply)
[terraform]:       ~ status                 = "ACTIVE" -> (known after apply)
[terraform]:       ~ tags                   = {
[terraform]:             "Environment"                                             = "xxxxxx"
[terraform]:             "Owner"                                                   = "terraform"
[terraform]:             "Project"                                                 = "nebari-xxxxxx"
[terraform]:           ~ "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "gpu-tesla-g3-2x" -> "gpu-tesla-g4"
[terraform]:             "propagate_at_launch"                                     = "true"
[terraform]:         }
[terraform]:       ~ tags_all               = {
[terraform]:           ~ "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "gpu-tesla-g3-2x" -> "gpu-tesla-g4"
[terraform]:             # (4 unchanged elements hidden)
[terraform]:         }
[terraform]:       ~ version                = "1.29" -> (known after apply)
[terraform]:         # (4 unchanged attributes hidden)
[terraform]: 
[terraform]:       - update_config {
[terraform]:           - max_unavailable            = 1 -> null
[terraform]:           - max_unavailable_percentage = 0 -> null
[terraform]:         }
[terraform]: 
[terraform]:         # (1 unchanged block hidden)
[terraform]:     }
[terraform]: 
[terraform]:   # module.kubernetes.aws_eks_node_group.main[6] will be created
[terraform]:   + resource "aws_eks_node_group" "main" {
[terraform]:       + ami_type               = (known after apply)
[terraform]:       + arn                    = (known after apply)
[terraform]:       + capacity_type          = (known after apply)
[terraform]:       + cluster_name           = "nebari-xxxxxxxxxxxxxx"
[terraform]:       + disk_size              = 50
[terraform]:       + id                     = (known after apply)
[terraform]:       + instance_types         = [
[terraform]:           + "g4dn.12xlarge",
[terraform]:         ]
[terraform]:       + labels                 = {
[terraform]:           + "dedicated" = "gpu-tesla-g4-4x"
[terraform]:         }
[terraform]:       + node_group_name        = "gpu-tesla-g4-4x"
[terraform]:       + node_group_name_prefix = (known after apply)
[terraform]:       + node_role_arn          = "arn:aws-us-gov:iam::xxxxxxxxxxxxxx:role/nebari-xxxxxxxxxxxxxx-eks-node-group-role"
[terraform]:       + release_version        = (known after apply)
[terraform]:       + resources              = (known after apply)
[terraform]:       + status                 = (known after apply)
[terraform]:       + subnet_ids             = [
[terraform]:           + "subnet-xxxxxxxxxxxxxxxxxx",
[terraform]:           + "subnet-xxxxxxxxxxxxxxxxxx",
[terraform]:         ]
[terraform]:       + tags                   = {
[terraform]:           + "Environment"                                             = "xxxxxx"
[terraform]:           + "Owner"                                                   = "terraform"
[terraform]:           + "Project"                                                 = "nebari-xxxxxx"
[terraform]:           + "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "gpu-tesla-g4-4x"
[terraform]:           + "propagate_at_launch"                                     = "true"
[terraform]:         }
[terraform]:       + tags_all               = {
[terraform]:           + "Environment"                                             = "xxxxxx"
[terraform]:           + "Owner"                                                   = "terraform"
[terraform]:           + "Project"                                                 = "nebari-xxxxxx"
[terraform]:           + "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "gpu-tesla-g4-4x"
[terraform]:           + "propagate_at_launch"                                     = "true"
[terraform]:         }
[terraform]:       + version                = (known after apply)
[terraform]: 
[terraform]:       + scaling_config {
[terraform]:           + desired_size = 0
[terraform]:           + max_size     = 5
[terraform]:           + min_size     = 0
[terraform]:         }
[terraform]:     }
[terraform]: 
[terraform]:   # module.kubernetes.aws_eks_node_group.main[7] will be created
[terraform]:   + resource "aws_eks_node_group" "main" {
[terraform]:       + ami_type               = (known after apply)
[terraform]:       + arn                    = (known after apply)
[terraform]:       + capacity_type          = (known after apply)
[terraform]:       + cluster_name           = "nebari-xxxxxxxxxxxxxx"
[terraform]:       + disk_size              = 50
[terraform]:       + id                     = (known after apply)
[terraform]:       + instance_types         = [
[terraform]:           + "g3.8xlarge",
[terraform]:         ]
[terraform]:       + labels                 = {
[terraform]:           + "dedicated" = "gpu-tesla-g3-2x"
[terraform]:         }
[terraform]:       + node_group_name        = "gpu-tesla-g3-2x"
[terraform]:       + node_group_name_prefix = (known after apply)
[terraform]:       + node_role_arn          = "arn:aws-us-gov:iam::xxxxxxxxxxxxxx:role/nebari-xxxxxxxxxxxxxx-eks-node-group-role"
[terraform]:       + release_version        = (known after apply)
[terraform]:       + resources              = (known after apply)
[terraform]:       + status                 = (known after apply)
[terraform]:       + subnet_ids             = [
[terraform]:           + "subnet-xxxxxxxxxxxxxxxxxx",
[terraform]:           + "subnet-xxxxxxxxxxxxxxxxxx",
[terraform]:         ]
[terraform]:       + tags                   = {
[terraform]:           + "Environment"                                             = "xxxxxx"
[terraform]:           + "Owner"                                                   = "terraform"
[terraform]:           + "Project"                                                 = "nebari-xxxxxx"
[terraform]:           + "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "gpu-tesla-g3-2x"
[terraform]:           + "propagate_at_launch"                                     = "true"
[terraform]:         }
[terraform]:       + tags_all               = {
[terraform]:           + "Environment"                                             = "xxxxxx"
[terraform]:           + "Owner"                                                   = "terraform"
[terraform]:           + "Project"                                                 = "nebari-xxxxxx"
[terraform]:           + "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "gpu-tesla-g3-2x"
[terraform]:           + "propagate_at_launch"                                     = "true"
[terraform]:         }
[terraform]:       + version                = (known after apply)
[terraform]: 
[terraform]:       + scaling_config {
[terraform]:           + desired_size = 0
[terraform]:           + max_size     = 5
[terraform]:           + min_size     = 0
[terraform]:         }
[terraform]:     }
[terraform]: 
[terraform]: Plan: 6 to add, 0 to change, 3 to destroy.


### Versions and dependencies used.

Nebari 2024.9.1

### Compute environment

None

### Integrations

_No response_

### Anything else?

Digging into the code, it looks like var.node_groups is a TF list type and being created with a count block.  TF's documentation about using for_each instead of count [references these "churn" problems as a reason ](https://developer.hashicorp.com/terraform/language/meta-arguments/count#when-to-use-for_each-instead-of-count)to use for_each

One major headache to taking this approach is that it would difficult to avoid it becoming destructive... a moved block would have to take into account the old list indicies vs new naming convention and I find it highly unlikely it would be generated on the fly.  Perhaps an upgrade script to manually move the state around that includes some logical checking to made sure old->new is as expected would work

The other way to fix this would be through documentation 
@kenafoster kenafoster added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage labels Oct 30, 2024
@kenafoster kenafoster changed the title [BUG] - Inserting New Mappings into node_groups leads tohttps://developer.hashicorp.com/terraform/language/meta-arguments/count#when-to-use-for_each-instead-of-count destructive behavior [BUG] - Inserting New Mappings into node_groups leads to ddestructive behavior Oct 30, 2024
@viniciusdc
Copy link
Contributor

viniciusdc commented Oct 30, 2024

Thanks for opening the issue, @kenafoster. The name of the issue is a bit misleading, though. What leads to the destruction and recreation of resources is not the inclusion of new nodes to the mapping but the rearranging of the order.

This is a curious finding that I think I might've stumbled across before as well. Indeed, it is due to us using the count object to create new node_groups dynamically under the Terraform HCL code.

resource "aws_eks_node_group" "main" {
count = length(var.node_groups)
cluster_name = aws_eks_cluster.main.name

Hence, each new node_group created by terraform is then referenced by its resources name e.g. aws_eks_node_group.main[0]. For example, considering the example you provided, would generate the following correspondence:

    grp_1 --> aws_eks_node_group.main[0]
    grp_2 --> aws_eks_node_group.main[1]
    grp_3 --> aws_eks_node_group.main[2]

However, as mentioned above, once you included more nodes and re-arranged the order, this lead to the following correspondence:

    grp_1 --> aws_eks_node_group.main[0]
    grp_2 --> aws_eks_node_group.main[1]
    grp_4 --> aws_eks_node_group.main[2]
    grp_5 --> aws_eks_node_group.main[3]
    grp_3 --> aws_eks_node_group.main[4]

The problem is that the already created resource aws_eks_node_group.main[4] will have its metadata changed (e.g., name, limits, etc.), leading to re-creation.

As you might have considered, the best approach would be to use the for_each block instead since we can loop over the node_group names as the indexer instead of an arbitrary position. The reason why we started with count, was most probably due to for_each not existing when that part of the code was first created, so its mostly legacy

@Adam-D-Lewis Adam-D-Lewis removed the needs: triage 🚦 Someone needs to have a look at this issue and triage label Nov 5, 2024
@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Nov 5, 2024

Discussed in the meeting today, we are open to fixing this by having terraform reference things by node group name rather than list position and we think that is a better way to handle this, but no one is assigned at the moment. @kenafoster Do you have interest in working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New 🚦
Development

No branches or pull requests

3 participants