Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Autoscaler not working in AWS #1172

Open
chovdary123 opened this issue Aug 15, 2024 · 17 comments
Open

Docker Autoscaler not working in AWS #1172

chovdary123 opened this issue Aug 15, 2024 · 17 comments

Comments

@chovdary123
Copy link

Server: GitLab EE: v16.11.8-ee
Client: v16.10.0 (Also tried v16.11.3)

Describe the bug

When using Docker Autoscaler executor, Runner Manager is unable to ssh into Worker with error key not found error.

To Reproduce

Steps to reproduce the behavior:

  1. Use the following basic main.tf -
  source  = "cattle-ops/gitlab-runner/aws"
  version = "7.12.1"

  environment = "gitlab-runners-fleet"

  vpc_id    = data.aws_vpc.vpc.id
  subnet_id = data.aws_subnets.example_subnets.ids[0]

  iam_permissions_boundary = "POLICY-PERMISSION-BOUNDARY"

  runner_gitlab = {
    url            = "https://example.mycompany.com/"
    runner_version = "16.10.0"
    preregistered_runner_token_ssm_parameter_name = "example-gitlab-runners-fleet-preregistered-token"
  }

  runner_manager = {
    maximum_concurrent_jobs = 10
  }

  runner_instance = {
    name = "gitlab-run"
    root_device_config = {
      volume_size = 100
      volume_type = "gp3"
    }
    ssm_access = true
  }
  runner_worker = {
    ssm_access            = true
    max_jobs              = 10
    request_concurrency   = 10
    type                  = "docker-autoscaler"
    environment_variables = ["AWS_REGION=us-west-2", "AWS_SDK_LOAD_CONFIG=true", "DOCKER_AUTH_CONFIG={\"auths\":{\"https://index.docker.io/v1/\":{\"auth\":\"${var.docker_auth_token}\"}}, \"credHelpers\": {\"${data.aws_caller_identity.current.account_id}.dkr.ecr.us-west-2.amazonaws.com\": \"ecr-login\"}}"]
  }

  runner_worker_docker_autoscaler_instance = {
    root_size   = 100
    volume_type = "gp3"
    monitoring  = true
  }

  runner_worker_docker_autoscaler_role = {
    policy_arns = ["arn:aws:iam::1111111111:policy/somepolicy", "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess"]
  }
  runner_worker_docker_autoscaler_asg = {
    on_demand_percentage_above_base_capacity = 0
    spot_allocation_strategy                 = "on-demand-price"
    enable_mixed_instance_policy             = true
    idle_time                                = 600
    subnet_ids                               = data.aws_subnets.example_subnets.ids
    types                                    = ["c5.large", "c5.xlarge", "c5.2xlarge", "c5.4xlarge"]
    volume_type                              = "gp3"
    private_address_only                     = true
    ebs_optimized                            = true
    root_size                                = 100
    sg_ingresses = [
      {
        description = "Allow all traffic within VPC and across local (TEST PURPOSE)"
        from_port   = 0
        to_port     = 65535
        protocol    = "tcp"
        cidr_blocks = ["10.0.0.0/8"]
      }
    ]
  }

  runner_worker_docker_options = {
    volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]
  }

  runner_worker_docker_autoscaler = {
    connector_config_user = "ubuntu"
  }

  runner_ami_owners = ["2222222222"]

  runner_ami_filter = {
    "tag:Name" = ["example-amazon-linux-ami"]
  }

  runner_worker_docker_autoscaler_ami_owners = ["1111111111"]
  runner_worker_docker_autoscaler_ami_filter = {
    "tag:Name" = ["example-ubuntu-ami"]
  }

  runner_worker_docker_autoscaler_autoscaling_options = [
    {
        periods = ["* * * * *"]
        timezone = "UTC"
        idle_count = 1
        idle_time = "600s"
        scale_factor = 2
    }
  ]
  debug = {
    trace_runner_user_data = true
    write_runner_config_to_file = true
    write_runner_user_data_to_file = true

  }

}
  1. Check pipeline invoked to use this Runner and you will see following error -
ERROR: Failed to remove network for build 
ERROR: Preparation failed: preparing environment: dial ssh: after retrying 30 times: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
  1. When you login to both Runner Manager and Worker and look into logs, you will find "key not found" error when manager is trying to connect to worker - (Please note authorized_keys file has runner-worker-key public-key added, but I believe keypair in the runner manager from which ssh happens is missing)
sshd[3781]: debug1: trying public key file /home/ubuntu/.ssh/authorized_keys
sshd[3781]: debug1: fd 9 clearing O_NONBLOCK
sshd[3781]: debug2: key not found

Expected behavior

Runner Manager has to connect to Worker without errors and run the job

@chovdary123
Copy link
Author

Issue seems tls_private_key.autoscaler[0].private_key_pem is never added to /etc/gitlab-runner/private_key and config.toml is missing key_path as shown here -

    [runners.autoscaler.connector_config]
      username          = "ubuntu"
      use_external_addr = false
      key_path          = "/etc/gitlab-runner/private_key"

When I manually extracted the tls_private_key.autoscaler[0].private_key_pem, saved under /etc/gitlab-runner/private_key and updated key_path in config.toml and restarted, docker-autoscaler based runner worked successfully!

@kayman-mk
Copy link
Collaborator

I have a working configuration here. Let me check tomorrow.

I also have some problems with the Fleeting plugin, but no idea, where to post them.

@kayman-mk
Copy link
Collaborator

@kayman-mk
Copy link
Collaborator

  runner_worker = {
    ssm_access            = true
    request_concurrency   = 1
    type                  = "docker-autoscaler"
  }

  runner_worker_docker_autoscaler = {
    fleeting_plugin_version = "1.0.0"
    max_use_count           = 50
  }

  runner_worker_docker_autoscaler_ami_owners = ["my-account-id"]
  runner_worker_docker_autoscaler_ami_filter = {
    name = ["gitlab_runner_fleeting*"]
  }

  runner_worker_docker_autoscaler_instance = {
    root_size        = 72
    root_device_name = "/dev/xvda"
  }

  runner_worker_docker_autoscaler_asg = {
    subnet_ids                               = var.subnet_ids
    types                                    = var.runner_settings.worker_instance_types
    enable_mixed_instances_policy            = true
    on_demand_base_capacity                  = 1
    on_demand_percentage_above_base_capacity = 0
    max_growth_rate                          = 10
  }

  runner_worker_docker_autoscaler_autoscaling_options = var.runner_settings.worker_autoscaling != null ? var.runner_settings.worker_autoscaling : [
    {
      periods      = ["* * * * *"]
      timezone     = "Europe/Berlin"
      idle_count   = 0
      idle_time    = "30m"
      scale_factor = 2
    },
    {
      periods      = ["* 7-19 * * mon-fri"]
      timezone     = "Europe/Berlin"
      idle_count   = 3
      idle_time    = "30m"
      scale_factor = 2
    }
  ]

@chovdary123
Copy link
Author

@kayman-mk tls_private_key.autoscaler[0].private_key_pem is never passed to runner manager instance that initiates the ssh call to runner worker and for that reason, it is not working. Your config and mine is almost similar.

@kayman-mk
Copy link
Collaborator

kayman-mk commented Aug 16, 2024

It's working like a charm here. That's strange.

EDIT: May be the AMI used for the workers?

@chovdary123
Copy link
Author

@kayman-mk Are you baking the AMIs with SSH keys? If so, that explains.

@chovdary123
Copy link
Author

I'm using Amazon Linux2 AMI for runner manager and Ubuntu 22.04 AMI for runner worker as recommended.

@chovdary123
Copy link
Author

I think the issue is somewhere here -

public_key = var.runner_worker_docker_machine_fleet.enable == true ? tls_private_key.fleet[0].public_key_openssh : ""
use_fleet = var.runner_worker_docker_machine_fleet.enable
private_key = var.runner_worker_docker_machine_fleet.enable == true ? tls_private_key.fleet[0].private_key_pem : ""
use_new_runner_authentication_gitlab_16 = var.runner_gitlab_registration_config.type != ""
and also, need to update docker-autoscaler template to have key_path under [runners.autoscaler.connector_config]

@chovdary123
Copy link
Author

chovdary123 commented Aug 16, 2024

Same logic is needed for docker-autoscaler and hence, the key pair generated are not part of /root/.ssh/ directory in runner, hence this issue.

@chovdary123
Copy link
Author

chovdary123 commented Aug 16, 2024

This issue never happens if someone is using runner for both docker+machine and docker-autoscaler and can see the bug if only docker-autoscaler executer is provisioned. But this is not possible as runner_worker type doesn't support these two together.

@pokidovea
Copy link

pokidovea commented Aug 19, 2024

I've faced the same issue, but usage of AMI with preinstalled Docker for worker nodes helped. It seems that fleeting plugin on agent node connects directly to docker on worker nodes.

@chovdary123
Copy link
Author

@pokidovea Worker AMI has preinstalled docker in my case.

@vpotap
Copy link

vpotap commented Oct 16, 2024

@chovdary123 I'm encountering the same error with a similar configuration. It looks like the key_path was not added under [runners.autoscaler.connector_config]. This might work with some custom AMI images, but the general example configuration does not seem to be functioning as expected.

@vpotap
Copy link

vpotap commented Oct 16, 2024

@chovdary123, @kayman-mk ,i can confirm that public/private keys variables not set when fleet is disabled the following issues regarding the key settings:

  1. Key Variables:

    • The public_key is defined as:
      public_key = var.runner_worker_docker_machine_fleet.enable == true ? tls_private_key.fleet[0].public_key_openssh : ""
    • The use_fleet variable is set to:
      use_fleet = var.runner_worker_docker_machine_fleet.enable
    • The private_key is defined as:
      private_key = var.runner_worker_docker_machine_fleet.enable == true ? tls_private_key.fleet[0].private_key_pem : ""
  2. Default Behavior:

    • Since var.runner_worker_docker_machine_fleet.enable is false by default, the public_key, private_key, and use_fleet variables are not set.
  3. Shell Template Condition:

    • In the shell template, we have the following conditional:
      if [[ "${use_fleet}" == "true" ]]; then
        echo "${public_key}" > /root/.ssh/id_rsa.pub
        echo "${private_key}" > /root/.ssh/id_rsa
        chmod 600 /root/.ssh/id_rsa
      fi
    • This block is not called at all because use_fleet is not set to true when the fleet is disabled.

@pysiekytel
Copy link

pysiekytel commented Oct 23, 2024

I had exactly same problem and for me yum install ec2-instance-connect -y helped - it seems that for whatever reason it was not available in al2023-ami-ecs-hvm-2023.0.20241003-kernel-6.1-x86_64 AMI (even if Amazon states that it should be preinstalled)

To easily patch your code just add this to your module inputs:

runner_worker_docker_autoscaler_instance = {
  # whatever else you have in this input
  start_script = file("worker_start_script.sh")
}

and inside worker_start_script.sh:

#!/bin/bash

yum install ec2-instance-connect -y

The ec2-instance-connect is used by aws fleeting plugin however it's not clearly stated in docs that this package is required

@vpotap
Copy link

vpotap commented Oct 25, 2024

@pysiekytel,
I just tested the solution by adding the following

#!/bin/bash
yum install ec2-instance-connect -y

to al2023-ami-ecs-hvm-*-kernel-6.1-x86_64

Everything works as expected now! The workaround with private/public keys is no longer needed.

Thanks for the suggestion!

@kayman-mk @chovdary123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants