Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

320 automated ami upgrades + upgrade locking #327

Conversation

preflightsiren
Copy link
Contributor

@preflightsiren preflightsiren commented Aug 5, 2021

Fixes #320

@preflightsiren preflightsiren requested review from a team as code owners August 5, 2021 12:19
@codecov
Copy link

codecov bot commented Aug 5, 2021

Codecov Report

Merging #327 (21d9792) into master (19de2db) will decrease coverage by 4.78%.
The diff coverage is 50.70%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #327      +/-   ##
==========================================
- Coverage   55.87%   51.08%   -4.79%     
==========================================
  Files          27       33       +6     
  Lines        4041     4504     +463     
==========================================
+ Hits         2258     2301      +43     
- Misses       1640     2062     +422     
+ Partials      143      141       -2     
Impacted Files Coverage Δ
controllers/providers/aws/aws.go 0.00% <0.00%> (ø)
controllers/providers/aws/ssm.go 0.00% <0.00%> (ø)
controllers/providers/kubernetes/crd.go 0.00% <0.00%> (ø)
controllers/provisioners/eks/eks.go 87.23% <0.00%> (-1.90%) ⬇️
controllers/provisioners/eksfargate/eksfargate.go 93.77% <0.00%> (-0.46%) ⬇️
controllers/provisioners/eksmanaged/eksmanaged.go 84.39% <0.00%> (-0.61%) ⬇️
controllers/provisioners/provisioners.go 100.00% <ø> (ø)
controllers/provisioners/eks/cloud.go 86.07% <71.42%> (+2.09%) ⬆️
api/v1alpha1/instancegroup_types.go 18.69% <100.00%> (+1.16%) ⬆️
controllers/providers/kubernetes/utils.go 5.42% <100.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19de2db...21d9792. Read the comment docs.

@preflightsiren
Copy link
Contributor Author

scenarios tested:

  • Adding locked annotation sets state to Locked
  • Removing locked annotation set state to Ready
  • Modifying IG in Locked state does not enter ReconcileModifying
  • Deleting IG in Locked state deletes AWS resources, but not the IG - whoops.
  • Creating IG in Locked state creates AWS resources; IG does not transition through states. - whoops.

@eytan-avisror
Copy link
Collaborator

Thanks for the PR @preflightsiren !
I think we need to consider the behavior/design we want here with locking - whether we lock all reconciles or just don't enter the upgrade path.

I think it would have less implications if we only avoid upgrades (e.g. a condition check before triggering upgrades) vs. a reconcile state.

Otherwise we need to consider all other states and transitions between them from Locked state.

@preflightsiren
Copy link
Contributor Author

Hey mate, thanks for taking a look. From observation each change to the IG sets the state back to Init meaning we only need to worry about the transitions into Locked, rather than worrying about how state transitions in and out of those final states? (I think we may be agreeing on this point)

I would want a Locked reconcile state to show operators why a change isn't being applied. Rather it show Ready with a log message or similar.

In its current form it only locks updates and upgrades - although currently implemented it's not a great experience with creates/deletes.

Next steps are to address the create/delete flows for locked IGs, and implement the imageID lookup.

@preflightsiren
Copy link
Contributor Author

Just looking through https://docs.aws.amazon.com/systems-manager/latest/userguide/parameter-store-public-parameters-eks.html

The ssm GetParameter response contains LastModifiedDate. This might be useful to create policies to allow teams to set different delays updating. Eg. Dev clusters immediately get new AMIs, staging clusters after 48h, prod after 1week.

What do you think?

@preflightsiren
Copy link
Contributor Author

success with multiple OS's

NAME                                          STATUS   ROLES                           AGE     VERSION
ip-172-29-72-68.us-west-2.compute.internal    Ready    <none>                          14d     v1.19.6-eks-49a6c0
ip-172-29-73-206.us-west-2.compute.internal   Ready    platform-us-west-2a             22h     v1.19.6-eks-49a6c0
ip-172-29-73-85.us-west-2.compute.internal    Ready    platform-us-west-2a             22h     v1.19.6-eks-49a6c0
ip-172-29-75-149.us-west-2.compute.internal   Ready    amazonlinux2-arm64-us-west-2b   5h57m   v1.19.13-eks-8df270
ip-172-29-75-175.us-west-2.compute.internal   Ready    bottlerocket-us-west-2b         5h57m   v1.19.13
ip-172-29-75-196.us-west-2.compute.internal   Ready    windows-us-west-2b              5h52m   v1.19.8-eks-fd6eea
ip-172-29-75-238.us-west-2.compute.internal   Ready    amazonlinux2-us-west-2b         5h57m   v1.19.13-eks-8df270
ip-172-29-77-46.us-west-2.compute.internal    Ready    <none>                          14d     v1.19.6-eks-49a6c0

@preflightsiren
Copy link
Contributor Author

Updating a locked instance group:

2021-08-19T21:23:06.496+1000    INFO    v1alpha1        state transition occured        {"instancegroup": "instance-manager/amazonlinux2-us-west-2b", "state": "InitUpdate", "previousState": "Init"}
2021-08-19T21:23:06.496+1000    INFO    v1alpha1        state transition occured        {"instancegroup": "instance-manager/amazonlinux2-us-west-2b", "state": "ReconcileModifying", "previousState": "InitUpdate"}
2021-08-19T21:23:06.496+1000    DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "ssm", "operation": "GetParameter"}
2021-08-19T21:23:06.496+1000    INFO    aws-provider    Latest EKS AMI ID:      {"value": "ami-0a8ad9f45d082a280"}
2021-08-19T21:23:06.497+1000    INFO    scaling drift not detected      {"instancegroup": "instance-manager/amazonlinux2-us-west-2b"}
2021-08-19T21:23:06.497+1000    INFO    controllers.instancegroup.eks   node rotation required  {"instancegroup": "instance-manager/amazonlinux2-us-west-2b", "scalingconfig": "uw2d-akp-scole-instance-manager-amazonlinux2-us-west-2b-20210819050930"}
2021-08-19T21:23:06.871+1000    DEBUG   aws-provider    AWS API call    {"cacheHit": false, "service": "autoscaling", "operation": "CreateOrUpdateTags"}
2021-08-19T21:23:06.871+1000    INFO    controllers.instancegroup.eks   updated scaling group tags      {"instancegroup": "instance-manager/amazonlinux2-us-west-2b", "scalinggroup": "uw2d-akp-scole-instance-manager-amazonlinux2-us-west-2b"}
2021-08-19T21:23:06.871+1000    INFO    controllers.instancegroup.eks   bootstrapping arn to aws-auth   {"instancegroup": "instance-manager/amazonlinux2-us-west-2b", "arn": "arn:aws:iam::026026080755:role/akp-eks-node-us-west-2"}
2021-08-19T21:23:07.229+1000    INFO    controllers.instancegroup.eks   waiting for node readiness conditions   {"instancegroup": "instance-manager/amazonlinux2-us-west-2b"}
2021-08-19T21:23:07.229+1000    INFO    controllers.instancegroup.eks   desired nodes are ready {"instancegroup": "instance-manager/amazonlinux2-us-west-2b", "instances": "i-0dd10ab2bbbfcbacc"}
2021-08-19T21:23:07.229+1000    INFO    v1alpha1        state transition occured        {"instancegroup": "instance-manager/amazonlinux2-us-west-2b", "state": "ReconcileModified", "previousState": "ReconcileModifying"}
2021-08-19T21:23:07.229+1000    INFO    v1alpha1        state transition occured        {"instancegroup": "instance-manager/amazonlinux2-us-west-2b", "state": "InitUpgrade", "previousState": "ReconcileModified"}
2021-08-19T21:23:07.229+1000    INFO    v1alpha1        state transition occured        {"instancegroup": "instance-manager/amazonlinux2-us-west-2b", "state": "Locked", "previousState": "InitUpgrade"}

@eytan-avisror
Copy link
Collaborator

@preflightsiren thanks, I am OOO until mid next week - will review as soon as I can. Suggest in the meanwhile you consider adding a functional test? Or at least make sure they are passing considering this is a fairly large change.

Copy link
Collaborator

@backjo backjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks pretty good! A few minor comments - and a +1 to Eytan's suggestion for a functional test.

Could also use some documentation of the lock down annotation.

controllers/providers/aws/aws.go Outdated Show resolved Hide resolved
controllers/providers/aws/aws.go Outdated Show resolved Hide resolved
controllers/providers/aws/ssm.go Show resolved Hide resolved
controllers/provisioners/eks/create.go Outdated Show resolved Hide resolved
controllers/provisioners/eks/helpers.go Outdated Show resolved Hide resolved
@preflightsiren
Copy link
Contributor Author

Good feedback; haven't had a chance yet, but I'll add functional tests, and I'll look at providing a ssm parameter path override annotation as well

@preflightsiren
Copy link
Contributor Author

woo, got the BDD tests running, all passed:

33 scenarios (33 passed)
189 steps (189 passed)
38m38.269353018s
testing: warning: no tests to run
PASS
ok      github.com/keikoproj/instance-manager/test-bdd  2318.760s [no tests to run]

@preflightsiren
Copy link
Contributor Author

Added BDD tests for locking the group

@preflightsiren
Copy link
Contributor Author

37 scenarios (37 passed)
214 steps (214 passed)
48m3.043404152s
testing: warning: no tests to run
PASS
ok      github.com/keikoproj/instance-manager/test-bdd  2883.513s [no tests to run]

now includes tests for locking and latest.

return false
}

func HasAnnotationWithValue(annotations map[string]string, key, value string) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this method as this tests the value of the annotation, not just that it exists - happy for suggestions for method names :)

annotations := instanceGroup.GetAnnotations()

var OSFamily string
if kubeprovider.HasAnnotation(annotations, OsFamilyAnnotation) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous version, now HasAnnotationWithValue caused some unexpected logic problems; I've added a new method to just detect the presence of a key in the annotation

@preflightsiren preflightsiren changed the title WIP: 320 automated ami upgrades + upgrade locking 320 automated ami upgrades + upgrade locking Oct 1, 2021
@preflightsiren preflightsiren force-pushed the preflightsiren-latest-ami-update branch 2 times, most recently from a5a0b85 to a78d23b Compare October 1, 2021 05:32
@preflightsiren
Copy link
Contributor Author

@eytan-avisror @backjo this is ready for final review; I want to finish this off and get it ready to merge before improving/expanding it.

My thoughts for next steps are:

  1. Support custom SSM Param path
  2. Support custom Role assumption (primarily to support cross account ssm lookup)
  3. NodeUpgradeRequest ala. Upgrade approval for changes to Instance Groups #321

@preflightsiren
Copy link
Contributor Author

Do you prefer to have commits squashed before merge? so I can give it a nice commit message - or would you prefer to do it once merging the PR?

@backjo
Copy link
Collaborator

backjo commented Oct 1, 2021

Not sure if we have a policy for the project - I definitely prefer squashed commits before merge personally :)

@@ -1173,3 +1173,40 @@ func (ctx *EksInstanceGroupContext) GetDesiredMixedInstancesPolicy(name string)

return policy
}

func FilterSupportedArch(architectures []string) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is FilterSupportedArch only meant to check if a slice of architectures is in slice of SupportedArchitectures ?
Could we just use common.StringSliceContains, or a different generic function?
In the below code it seems we return the first architecture that we find a match for, is there a condition where there are multiple supported architectures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FilterSupportedArch returns the first match between the SupportedArchitectures
from my investigation of the aws api most instance types either return

"arch": [
    "i386",
    "x86_64"
  ]

or

"arch": [
    "arm64"
  ]

since i386 isn't a supported arch, common.StringSliceContains would fail. Happy to use a different generic if one exists. If ever amazon provides an instance with multiple supported architectures, we would either need to explain the preference order, or provide some way of preferencing one.

@eytan-avisror
Copy link
Collaborator

This is looking pretty good to me, @backjo do you have any additional comments?
Great work on this @preflightsiren

@backjo
Copy link
Collaborator

backjo commented Oct 5, 2021

This is looking pretty good to me, @backjo do you have any additional comments? Great work on this @preflightsiren

Just the doc updates for SSM permissions - that's all that's left on my end

@preflightsiren
Copy link
Contributor Author

Cheers, I have that done locally, just trying to bump up the codecov - should have this finalised shortly.

@preflightsiren preflightsiren force-pushed the preflightsiren-latest-ami-update branch from a78d23b to de62999 Compare October 6, 2021 11:13
@preflightsiren
Copy link
Contributor Author

I've taken another look at the codecov, I don't get where the -5% comes from or where to expand our test coverage. If someone could help point it out I'll happily add in the testing functions.

docs/INSTALL.md Outdated Show resolved Hide resolved
Automated AMI management

InstanceGroups can now set the image configuration to "latest".
This will result the ami value being retrieved from a ssm parameter
(https://docs.aws.amazon.com/eks/latest/userguide/retrieve-ami-id.html).
This will ensure that nodes within an InstanceGroup are kept up-to-date
and is especially useful in development clusters.

Automated AMI management supports retrieving amazon amis for amazon
linux 2, bottlerocket and windows nodes. This can be configured using
the annotation
`instancemgr.keikoproj.io/os-family`.

Upgrade locking

InstanceGroups can now set an annotation
`instancemgr.keikoproj.io/lock-upgrades="true"`
which will prevent the InstanceGroup from entering the InitUpgrade
state.
This is useful for controlling when the nodes of an InstanceGroup can be
upgraded, pairing well with the automated AMI management feature.

Resolves: 320

Signed-off-by: Sebastian Cole <[email protected]>
@preflightsiren preflightsiren force-pushed the preflightsiren-latest-ami-update branch from de62999 to 21d9792 Compare October 7, 2021 22:02
Copy link
Collaborator

@eytan-avisror eytan-avisror left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for adding this feature @preflightsiren

@eytan-avisror eytan-avisror merged commit 165095e into keikoproj:master Oct 7, 2021
@eytan-avisror
Copy link
Collaborator

@preflightsiren BDD has been failing with:

-- Failed steps:

  Scenario: Resources can be submitted # features/01_create.feature:6
    And I create a resource instance-group-latest-locked.yaml # features/01_create.feature:20
      Error: open templates/instance-group-latest-locked.yaml: no such file or directory

I think you forgot to check in the YAML files for BDD, can you take a look?

@preflightsiren
Copy link
Contributor Author

I'll check tonight.

@preflightsiren
Copy link
Contributor Author

@eytan-avisror can you try running the BDD tests with this patch #334 , I'm unable to run the tests at the moment to verify myself, but I'm confident this is the thing missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automated AMI upgrades
3 participants