Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io] Move ephemeral VM agents to AWS #4316

Open
Tracked by #4313
dduportal opened this issue Sep 28, 2024 · 20 comments
Open
Tracked by #4313

[ci.jenkins.io] Move ephemeral VM agents to AWS #4316

dduportal opened this issue Sep 28, 2024 · 20 comments

Comments

@dduportal
Copy link
Contributor

dduportal commented Sep 28, 2024

  • Add back EC2 AMI builds in packer image
    • Will need to create EC2 resources to allow infra.ci to build VMs and create AMIs from it (check for old packer.tf in jenkins-infra/aws history)
      • We will need a dedicated VPC, IAM user with API key and limited permissions in this VPC to allow packer using EC2
    • Start from the PR which removed it
    • Don't forget the Garbage collector!
  • Add back the EC2 Jenkins plugin
    • Also need IAM user for ci.jenkins.io controller (distinct from packer - also check jenkins-infra/aws)
    • Need to check SSH vs. inbound
    • Need to check what replaces the current Azure VM init script in our older system (CloudInit?)
    • Need to check the VM retention to be ephemeral (or 1 min idle time max.)
Copy link

github-actions bot commented Sep 28, 2024

Take a look at these similar issues to see if there isn't already a response to your problem:

  1. 74% [ci.jenkins.io] Move ephemeral Linux containers to AWS #4317

@dduportal dduportal changed the title Move ephemeral VM agents to AWS [ci.jenkins.io] Move ephemeral VM agents to AWS Sep 28, 2024
@dduportal dduportal added triage Incoming issues that need review ci.jenkins.io EC2 aws labels Sep 28, 2024
@smerle33 smerle33 added this to the infra-team-sync-2024-10-08 milestone Oct 2, 2024
@smerle33 smerle33 removed the triage Incoming issues that need review label Oct 2, 2024
@smerle33
Copy link
Contributor

smerle33 commented Oct 3, 2024

to prepare this, we (jay and I) need to create a specific user for packer-images, as for azure https://github.com/jenkins-infra/azure/blob/main/packer-resources.tf I started creating it in the aws-sponsored repository.
jenkins-infra/terraform-aws-sponsorship#4
this PR worked (just the plan) and then get merged but failed (the deploy) on main : https://infra.ci.jenkins.io/job/terraform-jobs/job/terraform-aws-sponsorship/job/main/9/ because of missing rights

we did improve the policies for the role infra-developer to be able to create the new user directly on the terraform-states repo. With numerous try and fail we manage to have the correct set of rights (private link: https://github.com/jenkins-infra/terraform-states/blob/2ba74f30dd02a497062ecd8d1e5b52a7554e66b2/aws-sponsored/role-infra-developers.tf#L193-L210)

but when replaying on the infra.ci we still got this error

AccessDenied: User: arn:aws:iam::<redacted>:user/terraform-awssponsored-production is not authorized to perform: iam:GetUser

while the deploy is working locally with the infra-developer role (terraform-developer)

aws_iam_user.terraform_packer_user: Refreshing state... [id=terraform-packer-user]
data.aws_iam_policy_document.packer: Reading...
data.aws_iam_policy_document.packer: Read complete after 0s [id=<redacted>]
aws_iam_policy.packer: Refreshing state... [id=arn:aws:iam::<redacted>:policy/packer.iam_policy]
aws_iam_access_key.terraform_packer_api_keys: Refreshing state... [id=<redacted>]
aws_iam_user_policy_attachment.allow_packer_user: Refreshing state... [id=terraform-packer-user-<redacted>]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # aws_iam_user_policy_attachment.allow_packer_user is tainted, so must be replaced
-/+ resource "aws_iam_user_policy_attachment" "allow_packer_user" {
      ~ id         = "terraform-packer-user-<redacted>" -> (known after apply)
        # (2 unchanged attributes hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.
aws_iam_user_policy_attachment.allow_packer_user: Destroying... [id=terraform-packer-user-<redacted>]
aws_iam_user_policy_attachment.allow_packer_user: Destruction complete after 0s
aws_iam_user_policy_attachment.allow_packer_user: Creating...
aws_iam_user_policy_attachment.allow_packer_user: Creation complete after 1s [id=terraform-packer-user-<redacted>]

Apply complete! Resources: 1 added, 0 changed, 1 destroyed.

when checking on the UI we can see that terraform-awssponsored-production can assume role role/infra-developer so te changes should be working for it ....

@smerle33
Copy link
Contributor

smerle33 commented Oct 4, 2024

the packer user creation was moved to terraform-states hence no more problem of IAM rights

@jayfranco999
Copy link
Collaborator

Update:

The aws credentials used by user 'packer' to access packer-images is now available in sops. The PR below adds the credentials in infra-ci to build packer image templates.

jenkins-infra/kubernetes-management#5780

On testing the pipeline used to create packer-images templates, @smerle33 and I encountered an error with the GC (garbage collector) scripts:- https://infra.ci.jenkins.io/job/infra-tools/job/packer-images/job/PR-1430/11/pipeline-console/?selected-node=25

To overcome this we granted executable permissions to the cleanup scripts – jenkins-infra/packer-images#1430

On further testing of the packer-images ec2 instances, GC script ./cleanup/aws.sh passed but the next script ./cleanup/aws_images.sh threw an error with exit code 1. – https://infra.ci.jenkins.io/job/infra-tools/job/packer-images/job/PR-1430/12/pipeline-console/log?nodeId=58

Next steps will involve fixing the GC scripts and having atleast one docker.ubuntu_22.04 amazon-ebs template created by packer user.

@smerle33
Copy link
Contributor

smerle33 commented Oct 7, 2024

We try to setup our environement to use this new packer user for our run of packer locally.
we first tried as infra-developer but we bumped into hashicorp/packer#12110 that seems to avoid profile assuming.
we are updating the code to match azure way of working to be able to provide the token through environement variables to packer and play the builds with the new user token.

@jayfranco999
Copy link
Collaborator

jayfranco999 commented Oct 8, 2024

Update:

We created a user terraform-packer-user and exported the credentials to infra.ci. With this we were able to provide the necessary user policies required to create packer-images EC2 Ubuntu-22.04 arm64 and amd64 VM agents.
https://infra.ci.jenkins.io/job/infra-tools/job/packer-images/job/PR-1430/21/

Next steps involve

  • finding out how to use multiple aws_spot_instance_types for amd64 and arm64 VM agents.
  • Fixing GC scripts for aws instances

@jayfranco999
Copy link
Collaborator

jayfranco999 commented Oct 10, 2024

Update:

GC script now works for our pipeline, added the functionality that allows the AMI list to accept an empty array incase no AMI ids are found. The dry-run worked as expected.

12:10:27  == DRY RUN:
12:10:27  aws ec2 deregister-image --dry-run --image-id ami-025bfc41ca974eb4f
12:10:27  
12:10:27  == DRY RUN:
12:10:27  aws ec2 deregister-image --dry-run --image-id ami-05585bfe905b5e6fb
12:10:27  
12:10:28  == DRY RUN:
12:10:28  aws ec2 deregister-image --dry-run --image-id ami-01031637bfcffa0[70](https://infra.ci.jenkins.io/job/infra-tools/job/packer-images/job/PR-1430/30/pipeline-console/?start-byte=0&selected-node=58#log-70)
12:10:28  
12:10:28  + echo '== AWS Packer Cleanup IMAGES finished.'
12:10:28  == AWS Packer Cleanup IMAGES finished.

On further testing of our EC2 VMs, we discovered an issue that was preventing packer-images build. The apt used by agent VMs were incompatible with the outdated git_linux_version: 2.46.2 used by packer images. This was rectified by the PR – jenkins-infra/packer-images#1440

Packer-images now uses git_linux_version: 2.47.0

@smerle33
Copy link
Contributor

  • Will need to create EC2 resources to allow infra.ci to build VMs and create AMIs from it (check for old packer.tf in jenkins-infra/aws history)

nothing about that file in history, but we found the remove PR that helped us: jenkins-infra/packer-images#734

@jayfranco999
Copy link
Collaborator

jayfranco999 commented Oct 14, 2024

Due to the complexity of this PR – jenkins-infra/packer-images#1430
We are splitting the tasks into 4 stages:

  • packer-images: Add linux AMI
  • packer-images: Add Garbage Collector for AWS
  • packer-images: Add Windows AMI
  • packer-images: Optimize AMI builds (Restricted network, spot instances)

@dduportal
Copy link
Contributor Author

  * set up ci.jio VM to use instance metadata instead of credential and validate cloud config

@dduportal
Copy link
Contributor Author

First try at spinning up an ephemeral agent (with private IP in the private subnet):

  • The VM was started with success
  • SSH connection is timeout-ing.

=> need to update and check the Network ACLs as the inbound SSH and outbound JNLP are missing on the VM subnet

@dduportal
Copy link
Contributor Author

Update:

@dduportal
Copy link
Contributor Author

Update: wip on the init script/cloud init.

Main issue is that our private subnet setup forbids internet access. Gotta check if it's Network ACL forbidding access to the NAT gateway , or a missing routing table?

@dduportal
Copy link
Contributor Author

Update:

  • We have a first successful EC2 agent template with its Puppet JCasc setup with the following features:
    • Linux, Ubuntu arm64. Tested with x86 as well (manually) with success.
    • SSH launcher (not a spot instance)
    • Distinct Java version (JDK21, both Maven and java binary in the PATH) and Agent Java (JDK17) versions
    • Private subnet, no public IP
    • Cloudinit based, with an initscript checking to start the agent only when cloud init is finished.
    • AWS SDK v1

Next steps:

@dduportal
Copy link
Contributor Author

Update: we now have a working Windows 2019 template (JDK17 for agent, JDK21 for default java) since jenkins-infra/jenkins-infra#3804 and jenkins-infra/jenkins-infra#3805, with the following elements:

  • VM takes 2 to 6 min to start up before SSH => might be improved by:
    • EC2 Fast Launch which requires setup on the AMI, but it seems it does not require anything on the EC2 plugin (to be verified!
    • Try using the EC2 WinRM connection method instead of Unix SSH (is it faster?)
    • Analyze why SSH takes so much time before accepting the user key (1 to 3 minute between the first accepted SSH connection and the valid authentication)
  • Test with Windows 2022
  • Set up the "cloud init" and "init script" as code in puppet with:
    • Use the YAML syntax for cloud init (as the <powershell> XML syntax fails when specified to the EC2 plugins => worth an RFE) with reusability of the Azure VM Windows set of instructions
    • Use the same technique as with EC2 Linux by creating a token file at the end of cloud init, and check the presence of this file in the init script, to make sure everything works as expected
  • Set up the PATH configuration with the new syntax (e.g. appending: key: PATH and value: ${PATH};xxxx for Windows
    • Update Linux init script with the same technique
    • But keep the cloud init java selection in both cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants