Docker Autoscaler not working in AWS #1172

chovdary123 · 2024-08-15T18:01:00Z

Server: GitLab EE: v16.11.8-ee
Client: v16.10.0 (Also tried v16.11.3)

Describe the bug

When using Docker Autoscaler executor, Runner Manager is unable to ssh into Worker with error key not found error.

To Reproduce

Steps to reproduce the behavior:

Use the following basic main.tf -

  source  = "cattle-ops/gitlab-runner/aws"
  version = "7.12.1"

  environment = "gitlab-runners-fleet"

  vpc_id    = data.aws_vpc.vpc.id
  subnet_id = data.aws_subnets.example_subnets.ids[0]

  iam_permissions_boundary = "POLICY-PERMISSION-BOUNDARY"

  runner_gitlab = {
    url            = "https://example.mycompany.com/"
    runner_version = "16.10.0"
    preregistered_runner_token_ssm_parameter_name = "example-gitlab-runners-fleet-preregistered-token"
  }

  runner_manager = {
    maximum_concurrent_jobs = 10
  }

  runner_instance = {
    name = "gitlab-run"
    root_device_config = {
      volume_size = 100
      volume_type = "gp3"
    }
    ssm_access = true
  }
  runner_worker = {
    ssm_access            = true
    max_jobs              = 10
    request_concurrency   = 10
    type                  = "docker-autoscaler"
    environment_variables = ["AWS_REGION=us-west-2", "AWS_SDK_LOAD_CONFIG=true", "DOCKER_AUTH_CONFIG={\"auths\":{\"https://index.docker.io/v1/\":{\"auth\":\"${var.docker_auth_token}\"}}, \"credHelpers\": {\"${data.aws_caller_identity.current.account_id}.dkr.ecr.us-west-2.amazonaws.com\": \"ecr-login\"}}"]
  }

  runner_worker_docker_autoscaler_instance = {
    root_size   = 100
    volume_type = "gp3"
    monitoring  = true
  }

  runner_worker_docker_autoscaler_role = {
    policy_arns = ["arn:aws:iam::1111111111:policy/somepolicy", "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess"]
  }
  runner_worker_docker_autoscaler_asg = {
    on_demand_percentage_above_base_capacity = 0
    spot_allocation_strategy                 = "on-demand-price"
    enable_mixed_instance_policy             = true
    idle_time                                = 600
    subnet_ids                               = data.aws_subnets.example_subnets.ids
    types                                    = ["c5.large", "c5.xlarge", "c5.2xlarge", "c5.4xlarge"]
    volume_type                              = "gp3"
    private_address_only                     = true
    ebs_optimized                            = true
    root_size                                = 100
    sg_ingresses = [
      {
        description = "Allow all traffic within VPC and across local (TEST PURPOSE)"
        from_port   = 0
        to_port     = 65535
        protocol    = "tcp"
        cidr_blocks = ["10.0.0.0/8"]
      }
    ]
  }

  runner_worker_docker_options = {
    volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]
  }

  runner_worker_docker_autoscaler = {
    connector_config_user = "ubuntu"
  }

  runner_ami_owners = ["2222222222"]

  runner_ami_filter = {
    "tag:Name" = ["example-amazon-linux-ami"]
  }

  runner_worker_docker_autoscaler_ami_owners = ["1111111111"]
  runner_worker_docker_autoscaler_ami_filter = {
    "tag:Name" = ["example-ubuntu-ami"]
  }

  runner_worker_docker_autoscaler_autoscaling_options = [
    {
        periods = ["* * * * *"]
        timezone = "UTC"
        idle_count = 1
        idle_time = "600s"
        scale_factor = 2
    }
  ]
  debug = {
    trace_runner_user_data = true
    write_runner_config_to_file = true
    write_runner_user_data_to_file = true

  }

}

Check pipeline invoked to use this Runner and you will see following error -

ERROR: Failed to remove network for build 
ERROR: Preparation failed: preparing environment: dial ssh: after retrying 30 times: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

When you login to both Runner Manager and Worker and look into logs, you will find "key not found" error when manager is trying to connect to worker - (Please note authorized_keys file has runner-worker-key public-key added, but I believe keypair in the runner manager from which ssh happens is missing)

sshd[3781]: debug1: trying public key file /home/ubuntu/.ssh/authorized_keys
sshd[3781]: debug1: fd 9 clearing O_NONBLOCK
sshd[3781]: debug2: key not found

Expected behavior

Runner Manager has to connect to Worker without errors and run the job

The text was updated successfully, but these errors were encountered:

chovdary123 · 2024-08-15T19:40:18Z

Issue seems tls_private_key.autoscaler[0].private_key_pem is never added to /etc/gitlab-runner/private_key and config.toml is missing key_path as shown here -

    [runners.autoscaler.connector_config]
      username          = "ubuntu"
      use_external_addr = false
      key_path          = "/etc/gitlab-runner/private_key"

When I manually extracted the tls_private_key.autoscaler[0].private_key_pem, saved under /etc/gitlab-runner/private_key and updated key_path in config.toml and restarted, docker-autoscaler based runner worked successfully!

kayman-mk · 2024-08-15T21:11:55Z

I have a working configuration here. Let me check tomorrow.

I also have some problems with the Fleeting plugin, but no idea, where to post them.

kayman-mk · 2024-08-15T21:20:29Z

https://gitlab.com/gitlab-org/fleeting/plugins/aws/-/issues/?sort=updated_desc&state=opened&first_page_size=20 seems to be a good place.

kayman-mk · 2024-08-16T09:56:02Z

  runner_worker = {
    ssm_access            = true
    request_concurrency   = 1
    type                  = "docker-autoscaler"
  }

  runner_worker_docker_autoscaler = {
    fleeting_plugin_version = "1.0.0"
    max_use_count           = 50
  }

  runner_worker_docker_autoscaler_ami_owners = ["my-account-id"]
  runner_worker_docker_autoscaler_ami_filter = {
    name = ["gitlab_runner_fleeting*"]
  }

  runner_worker_docker_autoscaler_instance = {
    root_size        = 72
    root_device_name = "/dev/xvda"
  }

  runner_worker_docker_autoscaler_asg = {
    subnet_ids                               = var.subnet_ids
    types                                    = var.runner_settings.worker_instance_types
    enable_mixed_instances_policy            = true
    on_demand_base_capacity                  = 1
    on_demand_percentage_above_base_capacity = 0
    max_growth_rate                          = 10
  }

  runner_worker_docker_autoscaler_autoscaling_options = var.runner_settings.worker_autoscaling != null ? var.runner_settings.worker_autoscaling : [
    {
      periods      = ["* * * * *"]
      timezone     = "Europe/Berlin"
      idle_count   = 0
      idle_time    = "30m"
      scale_factor = 2
    },
    {
      periods      = ["* 7-19 * * mon-fri"]
      timezone     = "Europe/Berlin"
      idle_count   = 3
      idle_time    = "30m"
      scale_factor = 2
    }
  ]

chovdary123 · 2024-08-16T14:02:22Z

@kayman-mk tls_private_key.autoscaler[0].private_key_pem is never passed to runner manager instance that initiates the ssh call to runner worker and for that reason, it is not working. Your config and mine is almost similar.

kayman-mk · 2024-08-16T14:18:48Z

It's working like a charm here. That's strange.

EDIT: May be the AMI used for the workers?

chovdary123 · 2024-08-16T14:29:00Z

@kayman-mk Are you baking the AMIs with SSH keys? If so, that explains.

chovdary123 · 2024-08-16T14:30:38Z

I'm using Amazon Linux2 AMI for runner manager and Ubuntu 22.04 AMI for runner worker as recommended.

chovdary123 · 2024-08-16T17:21:07Z

I think the issue is somewhere here -

terraform-aws-gitlab-runner/main.tf

Lines 80 to 83 in 1d5791a

    
           public_key                                                   = var.runner_worker_docker_machine_fleet.enable == true ? tls_private_key.fleet[0].public_key_openssh : "" 
        
           use_fleet                                                    = var.runner_worker_docker_machine_fleet.enable 
        
           private_key                                                  = var.runner_worker_docker_machine_fleet.enable == true ? tls_private_key.fleet[0].private_key_pem : "" 
        
           use_new_runner_authentication_gitlab_16                      = var.runner_gitlab_registration_config.type != ""

and also, need to update docker-autoscaler template to have key_path under [runners.autoscaler.connector_config]

chovdary123 · 2024-08-16T17:23:41Z

Same logic is needed for docker-autoscaler and hence, the key pair generated are not part of /root/.ssh/ directory in runner, hence this issue.

chovdary123 · 2024-08-16T17:25:50Z

~~This issue never happens if someone is using runner for both docker+machine and docker-autoscaler and can see the bug if only docker-autoscaler executer is provisioned.~~ But this is not possible as runner_worker type doesn't support these two together.

pokidovea · 2024-08-19T12:05:45Z

I've faced the same issue, but usage of AMI with preinstalled Docker for worker nodes helped. It seems that fleeting plugin on agent node connects directly to docker on worker nodes.

chovdary123 · 2024-08-19T14:14:03Z

@pokidovea Worker AMI has preinstalled docker in my case.

vpotap · 2024-10-16T11:23:53Z

@chovdary123 I'm encountering the same error with a similar configuration. It looks like the key_path was not added under [runners.autoscaler.connector_config]. This might work with some custom AMI images, but the general example configuration does not seem to be functioning as expected.

vpotap · 2024-10-16T15:07:32Z

@chovdary123, @kayman-mk ,i can confirm that public/private keys variables not set when fleet is disabled the following issues regarding the key settings:

Key Variables:

The public_key is defined as:

public_key = var.runner_worker_docker_machine_fleet.enable == true ? tls_private_key.fleet[0].public_key_openssh : ""

The use_fleet variable is set to:

use_fleet = var.runner_worker_docker_machine_fleet.enable

The private_key is defined as:

private_key = var.runner_worker_docker_machine_fleet.enable == true ? tls_private_key.fleet[0].private_key_pem : ""

Default Behavior:
- Since var.runner_worker_docker_machine_fleet.enable is false by default, the public_key, private_key, and use_fleet variables are not set.

Shell Template Condition:

In the shell template, we have the following conditional:

if [[ "${use_fleet}" == "true" ]]; then
  echo "${public_key}" > /root/.ssh/id_rsa.pub
  echo "${private_key}" > /root/.ssh/id_rsa
  chmod 600 /root/.ssh/id_rsa
fi

This block is not called at all because use_fleet is not set to true when the fleet is disabled.

pysiekytel · 2024-10-23T12:39:16Z

I had exactly same problem and for me yum install ec2-instance-connect -y helped - it seems that for whatever reason it was not available in al2023-ami-ecs-hvm-2023.0.20241003-kernel-6.1-x86_64 AMI (even if Amazon states that it should be preinstalled)

To easily patch your code just add this to your module inputs:

runner_worker_docker_autoscaler_instance = {
  # whatever else you have in this input
  start_script = file("worker_start_script.sh")
}

and inside worker_start_script.sh:

#!/bin/bash

yum install ec2-instance-connect -y

The ec2-instance-connect is used by aws fleeting plugin however it's not clearly stated in docs that this package is required

vpotap · 2024-10-25T10:50:49Z

@pysiekytel,
I just tested the solution by adding the following

#!/bin/bash
yum install ec2-instance-connect -y

to al2023-ami-ecs-hvm-*-kernel-6.1-x86_64

Everything works as expected now! The workaround with private/public keys is no longer needed.

Thanks for the suggestion!

@kayman-mk @chovdary123

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker Autoscaler not working in AWS #1172

Docker Autoscaler not working in AWS #1172

chovdary123 commented Aug 15, 2024

chovdary123 commented Aug 15, 2024

kayman-mk commented Aug 15, 2024

kayman-mk commented Aug 15, 2024

kayman-mk commented Aug 16, 2024

chovdary123 commented Aug 16, 2024

kayman-mk commented Aug 16, 2024 •

edited

Loading

chovdary123 commented Aug 16, 2024

chovdary123 commented Aug 16, 2024

chovdary123 commented Aug 16, 2024

chovdary123 commented Aug 16, 2024 •

edited

Loading

chovdary123 commented Aug 16, 2024 •

edited

Loading

pokidovea commented Aug 19, 2024 •

edited

Loading

chovdary123 commented Aug 19, 2024

vpotap commented Oct 16, 2024

vpotap commented Oct 16, 2024

pysiekytel commented Oct 23, 2024 •

edited

Loading

vpotap commented Oct 25, 2024 •

edited

Loading

Docker Autoscaler not working in AWS #1172

Docker Autoscaler not working in AWS #1172

Comments

chovdary123 commented Aug 15, 2024

Describe the bug

To Reproduce

Expected behavior

chovdary123 commented Aug 15, 2024

kayman-mk commented Aug 15, 2024

kayman-mk commented Aug 15, 2024

kayman-mk commented Aug 16, 2024

chovdary123 commented Aug 16, 2024

kayman-mk commented Aug 16, 2024 • edited Loading

chovdary123 commented Aug 16, 2024

chovdary123 commented Aug 16, 2024

chovdary123 commented Aug 16, 2024

chovdary123 commented Aug 16, 2024 • edited Loading

chovdary123 commented Aug 16, 2024 • edited Loading

pokidovea commented Aug 19, 2024 • edited Loading

chovdary123 commented Aug 19, 2024

vpotap commented Oct 16, 2024

vpotap commented Oct 16, 2024

pysiekytel commented Oct 23, 2024 • edited Loading

vpotap commented Oct 25, 2024 • edited Loading

kayman-mk commented Aug 16, 2024 •

edited

Loading

chovdary123 commented Aug 16, 2024 •

edited

Loading

chovdary123 commented Aug 16, 2024 •

edited

Loading

pokidovea commented Aug 19, 2024 •

edited

Loading

pysiekytel commented Oct 23, 2024 •

edited

Loading

vpotap commented Oct 25, 2024 •

edited

Loading