Skip to content

Latest commit

 

History

History
997 lines (876 loc) · 56.4 KB

README.md

File metadata and controls

997 lines (876 loc) · 56.4 KB

terraform-aws-dr-infra

Terraform module to create AWS Cloud infrastructure resources required to run DataRobot.

Usage

module "datarobot_infra" {
  source = "datarobot-oss/dr-infra/aws"

  name        = "datarobot"
  domain_name = "yourdomain.com"

  create_network                  = true
  network_address_space           = "10.7.0.0/16"
  create_dns_zones                = false
  existing_public_route53_zone_id = "Z06110132R7HO9BLI64XY"
  create_acm_certificate          = false
  existing_acm_certificate_arn    = "arn:aws:acm:us-east-1:000000000000:certificate/00000000-0000-0000-0000-000000000000"
  create_encryption_key           = true
  create_storage                  = true
  create_container_registry       = true
  create_kubernetes_cluster       = true
  create_app_identity             = true

  cluster_autoscaler           = true
  descheduler                  = true
  ebs_csi_driver               = true
  aws_load_balancer_controller = true
  ingress_nginx                = true
  internet_facing_ingress_lb   = true
  cert_manager                 = true
  external_dns                 = true
  nvidia_device_plugin         = true
  metrics_server               = true

  tags = {
    application   = "datarobot"
    environment   = "dev"
    managed-by    = "terraform"
  }
}

Examples

  • Complete - Demonstrates all input variables
  • Partial - Demonstrates the use of existing resources
  • Minimal - Demonstrates the minimum set of input variables needed to deploy all infrastructure

Using an example directly from source

  1. Clone the repo
git clone https://github.com/datarobot-oss/terraform-aws-dr-infra.git
  1. Change directories into the example that best suits your needs
cd terraform-aws-dr-infra/examples/minimal
  1. Modify main.tf as needed with any changes to the input variables passed to the datarobot_infra module
  2. Run terraform commands
terraform init
terraform plan
terraform apply
terraform destroy

Module Descriptions

Network

Toggle

  • create_network to create a new VPC
  • existing_vpc_id to use an existing VPC

Description

Uses the terraform-aws-vpc module to create a new VPC with one public and private subnet per Availability Zone, a NAT gateway with an Elastic IP, and an Internet Gateway.

An interface VPC endpoint for the S3 service is created by default. More can be specified by updating the network_private_endpoints input variable.

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowVPCActions",
            "Effect": "Allow",
            "Action": [
              "ec2:DescribeAvailabilityZones",
              "ec2:CreateVpc",
              "ec2:DescribeVpcs",
              "ec2:DescribeVpcAttribute",
              "ec2:ModifyVpcAttribute",
              "ec2:DeleteVpc",
              "ec2:CreateSubnet",
              "ec2:DescribeSubnets",
              "ec2:DeleteSubnet",
              "ec2:CreateRouteTable",
              "ec2:DescribeRouteTables",
              "ec2:AssociateRouteTable",
              "ec2:DisassociateRouteTable",
              "ec2:DeleteRouteTable",
              "ec2:CreateRoute",
              "ec2:DeleteRoute",
              "ec2:CreateInternetGateway",
              "ec2:DescribeInternetGateways",
              "ec2:AttachInternetGateway",
              "ec2:DetachInternetGateway",
              "ec2:DeleteInternetGateway",
              "ec2:CreateNatGateway",
              "ec2:DescribeNatGateways",
              "ec2:DeleteNatGateway",
              "ec2:AllocateAddress",
              "ec2:DescribeAddresses",
              "ec2:DescribeAddressesAttribute",
              "ec2:DisassociateAddress",
              "ec2:ReleaseAddress",
              "ec2:DescribeSecurityGroups",
              "ec2:DescribeSecurityGroupRules",
              "ec2:RevokeSecurityGroupEgress",
              "ec2:RevokeSecurityGroupIngress",
              "ec2:CreateNetworkAclEntry",
              "ec2:DescribeNetworkAcls",
              "ec2:DeleteNetworkAclEntry",
              "ec2:DescribeNetworkInterfaces",
              "ec2:CreateTags"
            ],
            "Resource": "*"
        }
    ]
}

DNS

Toggle

  • create_dns_zones to create new Route53 zones
  • existing_public_route53_zone_id / existing_private_route53_zone_id to use an existing Route53 zone

Description

Uses the terraform-aws-route53 module to create new public and/or private Route53 hosted zone with name domain_name.

A public Route53 zone is used by external_dns to create records for the DataRobot ingress resources when internet_facing_ingress_lb is true. It is also used for DNS validation when creating a new ACM certificate.

A private Route53 zone is used by external_dns to create records for the DataRobot ingress resources when internet_facing_ingress_lb is false.

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowRoute53Actions",
            "Effect": "Allow",
            "Action": [
                "route53:CreateHostedZone",
                "route53:GetHostedZone",
                "route53:DeleteHostedZone",
                "route53:ListResourceRecordSets",
                "route53:GetChange",
                "route53:GetDNSSEC",
                "route53:ListTagsForResource",
                "route53:ChangeTagsForResource"
            ],
            "Resource": "*"
        }
    ]
}

ACM

Toggle

  • create_acm_certificate to create a new ACM certificate
  • existing_acm_certificate_arn to use an existing ACM certificate

Description

Uses the terraform-aws-acm module to create a new ACM certificate with SANs of domain_name and *.domain_name. Validation is performed against either an existing Route53 hosted zone id specified in the existing_public_route53_zone_id input variable or the public zone created by the dns module.

This certificate will be used on the NLB deployed by the ingress-nginx helm chart.

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowACMActions",
            "Effect": "Allow",
            "Action": [
                "acm:RequestCertificate",
                "acm:DescribeCertificate",
                "acm:DeleteCertificate",
                "acm:AddTagsToCertificate",
                "acm:ListTagsForCertificate",
                "route53:ChangeResourceRecordSets"
            ],
            "Resource": "*"
        }
    ]
}

Encryption Key

Toggle

  • create_encryption_key to create a new KMS key
  • existing_kms_key_arn to use an existing KMS key

Description

Uses the terraform-aws-kms module to create a new KMS encryption key with the current caller identity as a key administrator and the autoscaling service role (autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling). The key is used to encrypt EBS volumes in the EKS cluster.

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowKMSActions",
            "Effect": "Allow",
            "Action": [
                "kms:TagResource",
                "kms:CreateKey",
                "kms:CreateAlias",
                "kms:ListAliases",
                "kms:DeleteAlias"
            ],
            "Resource": "*"
        }
    ]
}

Storage

Toggle

  • create_storage to create a new S3 bucket
  • existing_s3_bucket_id to use an existing S3 bucket

Description

Uses the terraform-aws-s3 module to create a new S3 storage bucket.

The DataRobot application will use this storage bucket for persistent file storage.

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowS3Actions",
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:ListBucket",
                "s3:ListBucketVersions",
                "s3:GetBucketPolicy",
                "s3:GetBucketAcl",
                "s3:GetBucketCORS",
                "s3:GetBucketWebsite",
                "s3:GetBucketVersioning",
                "s3:GetBucketLogging",
                "s3:GetBucketRequestPayment",
                "s3:GetBucketTagging",
                "s3:PutBucketTagging",
                "s3:GetBucketPublicAccessBlock",
                "s3:PutBucketPublicAccessBlock",
                "s3:GetBucketObjectLockConfiguration",
                "s3:GetAccelerateConfiguration",
                "s3:GetLifecycleConfiguration",
                "s3:GetReplicationConfiguration",
                "s3:GetEncryptionConfiguration",
                "s3:DeleteObjectVersion",
                "s3:DeleteBucket"
            ],
            "Resource": "*"
        }
    ]
}

Container Registry

Toggle

  • create_container_registry to create a new Amazon Elastic Container Registry

Description

Uses the terraform-aws-ecr module to create a new ECR repositories used by the DataRobot application to host custom images created by various services.

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowECRActions",
            "Effect": "Allow",
            "Action": [
                "ecr:CreateRepository",
                "ecr:DescribeRepositories",
                "ecr:DeleteRepository",
                "ecr:TagResource",
                "ecr:ListTagsForResource"
            ],
            "Resource": "*"
        }
    ]
}

Kubernetes

Toggle

  • create_kubernetes_cluster to create a new Amazon Elastic Kubernetes Service Cluster
  • existing_eks_cluster_name to use an existing EKS cluster

Description

Uses the terraform-aws-eks module to create a new EKS cluster to host the DataRobot application and any other helm charts installed by this module.

Included EKS addons:

  • coredns
  • eks-pod-identity-agent
  • kube-proxy
  • vpc-cni

An access entry for the identity of the cluster creator is added as a cluster admin. More access entries can be created via the kubernetes_cluster_access_entries variable.

Network access to the cluster's public API endpoint (via the public internet) is enabled by default. This access can be restricted to a specific set of public IP addresses using the kubernetes_cluster_endpoint_public_access_cidrs variable or disabled completely by setting the kubernetes_cluster_endpoint_public_access variable to false.

Network access to the cluster's private API endpoint is only allowed for the Kubernetes nodes by default. If the private API endpoint needs to be accessed from other hosts (such as a provisioner or bastion within the same VPC), the IP address of that host needs to be specified in the kubernetes_cluster_endpoint_private_access_cidrs variable.

Two node groups are created:

  • A primary node group intended to host the majority of the DataRobot pods
  • A gpu node group intended to host GPU workload pods

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowEKSActions",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateSecurityGroup",
                "ec2:DeleteSecurityGroup",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:AuthorizeSecurityGroupEgress",
                "ec2:CreateLaunchTemplate",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeLaunchTemplateVersions",
                "ec2:DeleteLaunchTemplate",
                "ec2:RunInstances",
                "ec2:DescribeTags",
                "ec2:DeleteTags",
                "eks:CreateCluster",
                "eks:DescribeCluster",
                "eks:DeleteCluster",
                "eks:CreateAccessEntry",
                "eks:DescribeAccessEntry",
                "eks:DeleteAccessEntry",
                "eks:CreateNodegroup",
                "eks:DescribeNodegroup",
                "eks:DeleteNodegroup",
                "eks:AssociateAccessPolicy",
                "eks:ListAssociatedAccessPolicies",
                "eks:DisassociateAccessPolicy",
                "eks:CreateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "eks:DeleteAddon",
                "eks:TagResource",
                "iam:CreateRole",
                "iam:GetRole",
                "iam:GetRolePolicy",
                "iam:TagRole",
                "iam:PassRole",
                "iam:DeleteRole",
                "iam:CreatePolicy",
                "iam:GetPolicy",
                "iam:TagPolicy",
                "iam:GetPolicyVersion",
                "iam:ListPolicyVersions",
                "iam:DeletePolicy",
                "iam:AttachRolePolicy",
                "iam:ListRolePolicies",
                "iam:ListAttachedRolePolicies",
                "iam:PutRolePolicy",
                "iam:DetachRolePolicy",
                "iam:DeleteRolePolicy",
                "iam:ListInstanceProfilesForRole",
                "iam:CreateOpenIDConnectProvider",
                "iam:GetOpenIDConnectProvider",
                "iam:TagOpenIDConnectProvider",
                "iam:DeleteOpenIDConnectProvider",
                "logs:CreateLogGroup",
                "logs:DescribeLogGroups",
                "logs:DeleteLogGroup",
                "logs:PutRetentionPolicy",
                "logs:TagResource",
                "logs:ListTagsForResource"
            ],
            "Resource": "*"
        }
    ]
}

Helm Chart - aws-load-balancer-controller

Toggle

  • aws_load_balancer_controller to install the aws-load-balancer-controller helm chart

Description

Uses the terraform-aws-eks-pod-identity module to create a pod identity for the aws-load-balancer-controller service account in the aws-load-balancer-controller namespace with an IAM policy that allows the management of AWS load balancers.

Uses the terraform-helm-release module to install the https://aws.github.io/eks-charts/aws-load-balancer-controller helm chart into the aws-load-balancer-controller namespace.

This helm chart provisions Network Load Balancers for Kubernetes Service resources. In the default use-case, the AWS Load Balancer Controller will create a NLB directing traffic to the ingress-nginx Kubernetes services.

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowPodIdentityActions",
            "Effect": "Allow",
            "Action": [
                "eks:CreatePodIdentityAssociation",
                "eks:DescribePodIdentityAssociation",
                "eks:DeletePodIdentityAssociation"
            ],
            "Resource": "*"
        }
    ]
}

Helm Chart - cluster-autoscaler

Toggle

  • cluster_autoscaler to install the cluster-autoscaler helm chart

Description

Uses the terraform-aws-eks-pod-identity module to create a pod identity for the cluster-autoscaler-aws-cluster-autoscaler service account in the cluster-autoscaler namespace with an IAM policy that allows the creation and management of EC2 instances.

Uses the terraform-helm-release module to install the cluster-autoscaler helm chart from the https://kubernetes.github.io/autoscaler helm repo into the cluster-autoscaler namespace.

This helm chart allows for automatic horizontal scaling of EKS cluster nodes.

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowPodIdentityActions",
            "Effect": "Allow",
            "Action": [
                "eks:CreatePodIdentityAssociation",
                "eks:DescribePodIdentityAssociation",
                "eks:DeletePodIdentityAssociation"
            ],
            "Resource": "*"
        }
    ]
}

Helm Chart - descheduler

Toggle

  • descheduler to install the descheduler helm chart

Description

Uses the terraform-helm-release module to install the descheduler helm chart from the https://kubernetes-sigs.github.io/descheduler/ helm repo into the descheduler namespace.

This helm chart allows for automatic rescheduling of pods for optimizing resource consumption.

IAM Policy

Not required

Helm Chart - aws-ebs-csi-driver

Toggle

  • ebs_csi_driver to install the aws-ebs-csi-driver helm chart

Description

Uses the terraform-aws-eks-pod-identity module to create a pod identity for the ebs-csi-controller-sa service account in the aws-ebs-csi-driver namespace with an IAM policy that allows the creation and management of EBS volumes.

Uses the terraform-helm-release module to install the aws-ebs-csi-driver helm chart from the https://kubernetes-sigs.github.io/aws-ebs-csi-driver/ repo into the aws-ebs-csi-driver namespace.

This helm chart creates default Delete and Retain storage classes called ebs-standard and ebs-standard-retain, respectively, of type gp3 using the encryption key passed in from the existing_kms_key_arn variable or the KMS key created in the encryption_key module. These storage classes are used by the DataRobot application Persistent Volume Claims.

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowPodIdentityActions",
            "Effect": "Allow",
            "Action": [
                "eks:CreatePodIdentityAssociation",
                "eks:DescribePodIdentityAssociation",
                "eks:DeletePodIdentityAssociation"
            ],
            "Resource": "*"
        }
    ]
}

Helm Chart - ingress-nginx

Toggle

  • ingress_nginx to install the ingress-nginx helm chart

Description

Uses the terraform-helm-release module to install the ingress-nginx helm chart from the https://kubernetes.github.io/ingress-nginx repo into the ingress-nginx namespace.

The ingress-nginx helm chart will trigger the deployment of an AWS Network Load Balancer to act as ingress for the DataRobot application. When internet_facing_ingress_lb is true, the NLB will be of type internet-facing. When internet_facing_ingress_lb is false, the NLB will be of type internal.

By default this NLB will terminate TLS using either the certificate specified with the existing_acm_certificate_arn variable or the certificate created in the ACM module if create_acm_certificate is true. It is possible not to use ACM at all by setting create_acm_certificate to false and overriding the controller.service.targetPorts.https setting as demonstrated in the complete example.

IAM Policy

Not required

Helm Chart - cert-manager

Toggle

  • cert_manager to install the cert-manager helm chart

Description

Uses the terraform-aws-eks-pod-identity module to create a pod identity for the cert-manager service account in the cert-manager namespace with an IAM policy that allows the creation of DNS resources within the specified DNS zone.

Uses the terraform-helm-release module to install the cert-manager helm chart from the https://charts.jetstack.io repo into the cert-manager namespace.

cert-manager can be used by the DataRobot application to create and manage various certificates. When an ACM certificate is used in the ingress load balancer, cert-manager is typically just used to generate self-signed certificates that can be used for service to service communications.

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowPodIdentityActions",
            "Effect": "Allow",
            "Action": [
                "eks:CreatePodIdentityAssociation",
                "eks:DescribePodIdentityAssociation",
                "eks:DeletePodIdentityAssociation"
            ],
            "Resource": "*"
        }
    ]
}

Helm Chart - external-dns

Toggle

  • external_dns to install the external-dns helm chart

Description

Uses the terraform-aws-eks-pod-identity module to create a pod identity for the external-dns service account in the external-dns namespace with an IAM policy that allows the creation of DNS resources within the specified DNS zone.

Uses the terraform-helm-release module to install the external-dns helm chart from the https://charts.bitnami.com/bitnami repo into the external-dns namespace.

external-dns is used to automatically create DNS records for ingress resources in the Kubernetes cluster. When the DataRobot application is installed and the ingress resources are created, external-dns will automatically create a DNS record pointing at the ingress resource.

IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowPodIdentityActions",
            "Effect": "Allow",
            "Action": [
                "eks:CreatePodIdentityAssociation",
                "eks:DescribePodIdentityAssociation",
                "eks:DeletePodIdentityAssociation"
            ],
            "Resource": "*"
        }
    ]
}

Helm Chart - nvidia-device-plugin

Toggle

  • nvidia_device_plugin to install the nvidia-device-plugin helm chart

Description

Uses the terraform-helm-release module to install the nvidia-device-plugin helm chart from the https://nvidia.github.io/k8s-device-plugin repo into the nvidia-device-plugin namespace.

This helm chart is used to expose GPU resources on nodes intended for GPU workloads such as the default gpu node group.

IAM Policy

Not required

Helm Chart - metrics-server

Toggle

  • metrics_server to install the metrics-server helm chart

Description

Uses the terraform-helm-release module to install the metrics-server helm chart from the https://kubernetes-sigs.github.io/metrics-server repo into the metrics-server namespace.

This helm chart is used to expose CPU and memory metrics to the Kubernetes cluster.

IAM Policy

Not required

Comprehensive IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowVPCActions",
            "Effect": "Allow",
            "Action": [
              "ec2:DescribeAvailabilityZones",
              "ec2:CreateVpc",
              "ec2:DescribeVpcs",
              "ec2:DescribeVpcAttribute",
              "ec2:ModifyVpcAttribute",
              "ec2:DeleteVpc",
              "ec2:CreateSubnet",
              "ec2:DescribeSubnets",
              "ec2:DeleteSubnet",
              "ec2:CreateRouteTable",
              "ec2:DescribeRouteTables",
              "ec2:AssociateRouteTable",
              "ec2:DisassociateRouteTable",
              "ec2:DeleteRouteTable",
              "ec2:CreateRoute",
              "ec2:DeleteRoute",
              "ec2:CreateInternetGateway",
              "ec2:DescribeInternetGateways",
              "ec2:AttachInternetGateway",
              "ec2:DetachInternetGateway",
              "ec2:DeleteInternetGateway",
              "ec2:CreateNatGateway",
              "ec2:DescribeNatGateways",
              "ec2:DeleteNatGateway",
              "ec2:AllocateAddress",
              "ec2:DescribeAddresses",
              "ec2:DescribeAddressesAttribute",
              "ec2:DisassociateAddress",
              "ec2:ReleaseAddress",
              "ec2:DescribeSecurityGroups",
              "ec2:DescribeSecurityGroupRules",
              "ec2:RevokeSecurityGroupEgress",
              "ec2:RevokeSecurityGroupIngress",
              "ec2:CreateNetworkAclEntry",
              "ec2:DescribeNetworkAcls",
              "ec2:DeleteNetworkAclEntry",
              "ec2:DescribeNetworkInterfaces",
              "ec2:CreateTags"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowRoute53Actions",
            "Effect": "Allow",
            "Action": [
                "route53:CreateHostedZone",
                "route53:GetHostedZone",
                "route53:DeleteHostedZone",
                "route53:ListResourceRecordSets",
                "route53:GetChange",
                "route53:GetDNSSEC",
                "route53:ListTagsForResource",
                "route53:ChangeTagsForResource"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowACMActions",
            "Effect": "Allow",
            "Action": [
                "acm:RequestCertificate",
                "acm:DescribeCertificate",
                "acm:DeleteCertificate",
                "acm:AddTagsToCertificate",
                "acm:ListTagsForCertificate",
                "route53:ChangeResourceRecordSets"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowKMSActions",
            "Effect": "Allow",
            "Action": [
                "kms:TagResource",
                "kms:CreateKey",
                "kms:CreateAlias",
                "kms:ListAliases",
                "kms:DeleteAlias"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowS3Actions",
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:ListBucket",
                "s3:ListBucketVersions",
                "s3:GetBucketPolicy",
                "s3:GetBucketAcl",
                "s3:GetBucketCORS",
                "s3:GetBucketWebsite",
                "s3:GetBucketVersioning",
                "s3:GetBucketLogging",
                "s3:GetBucketRequestPayment",
                "s3:GetBucketTagging",
                "s3:PutBucketTagging",
                "s3:GetBucketPublicAccessBlock",
                "s3:PutBucketPublicAccessBlock",
                "s3:GetBucketObjectLockConfiguration",
                "s3:GetAccelerateConfiguration",
                "s3:GetLifecycleConfiguration",
                "s3:GetReplicationConfiguration",
                "s3:GetEncryptionConfiguration",
                "s3:DeleteObjectVersion",
                "s3:DeleteBucket"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowECRActions",
            "Effect": "Allow",
            "Action": [
                "ecr:CreateRepository",
                "ecr:DescribeRepositories",
                "ecr:DeleteRepository",
                "ecr:TagResource",
                "ecr:ListTagsForResource"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowEKSActions",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateSecurityGroup",
                "ec2:DeleteSecurityGroup",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:AuthorizeSecurityGroupEgress",
                "ec2:CreateLaunchTemplate",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeLaunchTemplateVersions",
                "ec2:DeleteLaunchTemplate",
                "ec2:RunInstances",
                "ec2:DescribeTags",
                "ec2:DeleteTags",
                "eks:CreateCluster",
                "eks:DescribeCluster",
                "eks:DeleteCluster",
                "eks:CreateAccessEntry",
                "eks:DescribeAccessEntry",
                "eks:DeleteAccessEntry",
                "eks:CreateNodegroup",
                "eks:DescribeNodegroup",
                "eks:DeleteNodegroup",
                "eks:AssociateAccessPolicy",
                "eks:ListAssociatedAccessPolicies",
                "eks:DisassociateAccessPolicy",
                "eks:CreateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "eks:DeleteAddon",
                "eks:TagResource",
                "iam:CreateRole",
                "iam:GetRole",
                "iam:GetRolePolicy",
                "iam:TagRole",
                "iam:PassRole",
                "iam:DeleteRole",
                "iam:CreatePolicy",
                "iam:GetPolicy",
                "iam:TagPolicy",
                "iam:GetPolicyVersion",
                "iam:ListPolicyVersions",
                "iam:DeletePolicy",
                "iam:AttachRolePolicy",
                "iam:ListRolePolicies",
                "iam:ListAttachedRolePolicies",
                "iam:PutRolePolicy",
                "iam:DetachRolePolicy",
                "iam:DeleteRolePolicy",
                "iam:ListInstanceProfilesForRole",
                "iam:CreateOpenIDConnectProvider",
                "iam:GetOpenIDConnectProvider",
                "iam:TagOpenIDConnectProvider",
                "iam:DeleteOpenIDConnectProvider",
                "logs:CreateLogGroup",
                "logs:DescribeLogGroups",
                "logs:DeleteLogGroup",
                "logs:PutRetentionPolicy",
                "logs:TagResource",
                "logs:ListTagsForResource"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowPodIdentityActions",
            "Effect": "Allow",
            "Action": [
                "eks:CreatePodIdentityAssociation",
                "eks:DescribePodIdentityAssociation",
                "eks:DeletePodIdentityAssociation"
            ],
            "Resource": "*"
        }
    ]
}

DataRobot versions

Release Supported DR Versions
~> 1.0 ~> 10.1

Requirements

Name Version
terraform >= 1.3.2
aws >= 5.61
helm >= 2.15

Providers

Name Version
aws >= 5.61

Modules

Name Source Version
acm terraform-aws-modules/acm/aws ~> 4.0
app_identity terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc ~> 5.0
aws_load_balancer_controller ./modules/aws-load-balancer-controller n/a
aws_vpc_cni_ipv4_pod_identity terraform-aws-modules/eks-pod-identity/aws ~> 1.0
cert_manager ./modules/cert-manager n/a
cluster_autoscaler ./modules/cluster-autoscaler n/a
container_registry terraform-aws-modules/ecr/aws ~> 2.0
descheduler ./modules/descheduler n/a
dns terraform-aws-modules/route53/aws//modules/zones ~> 3.0
ebs_csi_driver ./modules/ebs-csi-driver n/a
encryption_key terraform-aws-modules/kms/aws ~> 3.0
endpoints terraform-aws-modules/vpc/aws//modules/vpc-endpoints ~> 5.0
external_dns ./modules/external-dns n/a
ingress_nginx ./modules/ingress-nginx n/a
kubernetes terraform-aws-modules/eks/aws ~> 20.0
metrics_server ./modules/metrics-server n/a
network terraform-aws-modules/vpc/aws ~> 5.0
nvidia_device_plugin ./modules/nvidia-device-plugin n/a
storage terraform-aws-modules/s3-bucket/aws ~> 4.0

Resources

Name Type
aws_autoscaling_group_tag.gpu resource
aws_autoscaling_group_tag.primary resource
aws_availability_zones.available data source
aws_caller_identity.current data source
aws_eks_cluster.existing data source
aws_eks_cluster_auth.this data source
aws_route53_zone.private data source
aws_route53_zone.public data source

Inputs

Name Description Type Default Required
aws_load_balancer_controller Install the aws-load-balancer-controller helm chart to use AWS Network Load Balancers as ingress to the EKS cluster. All other aws_load_balancer_controller variables are ignored if this variable is false. bool true no
aws_load_balancer_controller_values Path to templatefile containing custom values for the aws-load-balancer-controller helm chart string "" no
aws_load_balancer_controller_variables Variables passed to the aws_load_balancer_controller_values templatefile any {} no
cert_manager Install the cert-manager helm chart. All other cert_manager variables are ignored if this variable is false. bool true no
cert_manager_values Path to templatefile containing custom values for the cert-manager helm chart string "" no
cert_manager_variables Variables passed to the cert_manager_values templatefile any {} no
cluster_autoscaler Install the cluster-autoscaler helm chart to enable horizontal autoscaling of the EKS cluster nodes. All other cluster_autoscaler variables are ignored if this variable is false bool true no
cluster_autoscaler_values Path to templatefile containing custom values for the cluster-autoscaler helm chart string "" no
cluster_autoscaler_variables Variables passed to the cluster_autoscaler_values templatefile any {} no
create_acm_certificate Create a new ACM certificate for the ingress load balancer to use. Ignored if existing_acm_certificate_arn is specified. bool true no
create_app_identity Create an IAM role for the DataRobot application service accounts bool true no
create_container_registry Create DataRobot image builder container repositories in Amazon Elastic Container Registry bool true no
create_dns_zones Create DNS zones for domain_name. Ignored if existing_public_route53_zone_id and existing_private_route53_zone_id are specified. bool true no
create_encryption_key Create a new KMS key used for EBS volume encryption on EKS nodes. Ignored if existing_kms_key_arn is specified. bool true no
create_kubernetes_cluster Create a new Amazon Elastic Kubernetes Cluster. All kubernetes and helm chart variables are ignored if this variable is false. bool true no
create_network Create a new Virtual Private Cloud. Ignored if an existing existing_vpc_id is specified. bool true no
create_storage Create a new S3 storage bucket to use for DataRobot application file storage. Ignored if an existing_s3_bucket_id is specified. bool true no
datarobot_namespace Kubernetes namespace in which the DataRobot application will be installed string "dr-app" no
descheduler Install the descheduler helm chart to enable rescheduling of pods. All other descheduler variables are ignored if this variable is false bool true no
descheduler_values Path to templatefile containing custom values for the descheduler helm chart string "" no
descheduler_variables Variables passed to the descheduler templatefile any {} no
dns_zones_force_destroy Force destroy the public and private Route53 zones. Ignored if an existing route53_zone_id is specified or create_dns_zones is false. bool false no
domain_name Name of the domain to use for the DataRobot application. If create_dns_zones is true then zones will be created for this domain. It is also used by ACM for DNS validation and as a domain filter by the external-dns helm chart. string "" no
ebs_csi_driver Install the aws-ebs-csi-driver helm chart to enable use of EBS for Kubernetes persistent volumes. All other ebs_csi_driver variables are ignored if this variable is false bool true no
ebs_csi_driver_values Path to templatefile containing custom values for the aws-ebs-csi-driver helm chart string "" no
ebs_csi_driver_variables Variables passed to the ebs_csi_driver_values templatefile any {} no
ecr_repositories Repositories to create set(string)
[
"base-image",
"ephemeral-image",
"managed-image",
"custom-apps-managed-image"
]
no
ecr_repositories_force_destroy Force destroy the ECR repositories. Ignored if create_container_registry is false. bool false no
existing_acm_certificate_arn ARN of existing ACM certificate to use with the ingress load balancer created by the ingress_nginx module. When specified, create_acm_certificate will be ignored. string "" no
existing_eks_cluster_name Name of existing EKS cluster to use. When specified, all other kubernetes variables will be ignored. string null no
existing_kms_key_arn ARN of existing KMS key used for EBS volume encryption on EKS nodes. When specified, create_encryption_key will be ignored. string "" no
existing_kubernetes_nodes_subnet_id List of existing subnet IDs to be used for the EKS cluster. Required when an existing_network_id is specified. Ignored if create_network is true and no existing_network_id is specified. Subnets must adhere to VPC requirements and considerations https://docs.aws.amazon.com/eks/latest/userguide/network_reqs.html. list(string) [] no
existing_private_route53_zone_id ID of existing private Route53 hosted zone to use for private DNS records created by external-dns. This is required when create_dns_zones is false and ingress_nginx is true with internet_facing_ingress_lb false. string "" no
existing_public_route53_zone_id ID of existing public Route53 hosted zone to use for public DNS records created by external-dns and ACM certificate validation. This is required when create_dns_zones is false and ingress_nginx and internet_facing_ingress_lb are true or when create_acm_certificate is true. string "" no
existing_s3_bucket_id ID of existing S3 storage bucket to use for DataRobot application file storage. When specified, all other storage variables will be ignored. string "" no
existing_vpc_id ID of an existing VPC to use. When specified, other network variables are ignored. string "" no
external_dns Install the external_dns helm chart to create DNS records for ingress resources matching the domain_name variable. All other external_dns variables are ignored if this variable is false. bool true no
external_dns_values Path to templatefile containing custom values for the external-dns helm chart string "" no
external_dns_variables Variables passed to the external_dns_values templatefile any {} no
ingress_nginx Install the ingress-nginx helm chart to use as the ingress controller for the EKS cluster. All other ingress_nginx variables are ignored if this variable is false. bool true no
ingress_nginx_values Path to templatefile containing custom values for the ingress-nginx helm chart. string "" no
ingress_nginx_variables Variables passed to the ingress_nginx_values templatefile any {} no
internet_facing_ingress_lb Determines the type of NLB created for EKS ingress. If true, an internet-facing NLB will be created. If false, an internal NLB will be created. Ignored when ingress_nginx is false. bool true no
kubernetes_cluster_access_entries Map of access entries to add to the cluster any {} no
kubernetes_cluster_endpoint_private_access_cidrs List of additional CIDR blocks allowed to access the Amazon EKS private API server endpoint. By default only the kubernetes nodes are allowed, if any other hosts such as a provisioner need to access the EKS private API endpoint they need to be added here. list(string) [] no
kubernetes_cluster_endpoint_public_access Indicates whether or not the Amazon EKS public API server endpoint is enabled bool true no
kubernetes_cluster_endpoint_public_access_cidrs List of CIDR blocks which can access the Amazon EKS public API server endpoint list(string)
[
"0.0.0.0/0"
]
no
kubernetes_cluster_version EKS cluster version string null no
kubernetes_gpu_nodegroup_ami_type Type of Amazon Machine Image (AMI) associated with the EKS GPU Node Group. See the AWS documentation for valid values string "AL2_x86_64_GPU" no
kubernetes_gpu_nodegroup_desired_size Desired number of nodes in the GPU node group number 0 no
kubernetes_gpu_nodegroup_instance_types Instance types used for the GPU node group list(string)
[
"g4dn.2xlarge"
]
no
kubernetes_gpu_nodegroup_labels Key-value map of Kubernetes labels to be applied to the nodes in the GPU node group. Only labels that are applied with the EKS API are managed by this argument. Other Kubernetes labels applied to the EKS Node Group will not be managed map(string)
{
"datarobot.com/node-capability": "gpu"
}
no
kubernetes_gpu_nodegroup_max_size Maximum number of nodes in the GPU node group number 10 no
kubernetes_gpu_nodegroup_min_size Minimum number of nodes in the GPU node group number 0 no
kubernetes_gpu_nodegroup_name Name of the GPU node group string "gpu" no
kubernetes_gpu_nodegroup_taints The Kubernetes taints to be applied to the nodes in the GPU node group. Maximum of 50 taints per node group any
{
"nvidia_gpu": {
"effect": "NO_SCHEDULE",
"key": "nvidia.com/gpu",
"value": "true"
}
}
no
kubernetes_primary_nodegroup_ami_type Type of Amazon Machine Image (AMI) associated with the EKS Primary Node Group. See the AWS documentation for valid values string "AL2023_x86_64_STANDARD" no
kubernetes_primary_nodegroup_desired_size Desired number of nodes in the primary node group number 1 no
kubernetes_primary_nodegroup_instance_types Instance types used for the primary node group list(string)
[
"r6a.4xlarge",
"r6i.4xlarge",
"r5.4xlarge",
"r4.4xlarge"
]
no
kubernetes_primary_nodegroup_labels Key-value map of Kubernetes labels to be applied to the nodes in the primary node group. Only labels that are applied with the EKS API are managed by this argument. Other Kubernetes labels applied to the EKS Node Group will not be managed. map(string)
{
"datarobot.com/node-capability": "cpu"
}
no
kubernetes_primary_nodegroup_max_size Maximum number of nodes in the primary node group number 10 no
kubernetes_primary_nodegroup_min_size Minimum number of nodes in the primary node group number 0 no
kubernetes_primary_nodegroup_name Name of the primary EKS node group string "primary" no
kubernetes_primary_nodegroup_taints The Kubernetes taints to be applied to the nodes in the primary node group. Maximum of 50 taints per node group any {} no
metrics_server Install the metrics-server helm chart to expose resource metrics for Kubernetes built-in autoscaling pipelines. All other metrics_server variables are ignored if this variable is false. bool true no
metrics_server_values Path to templatefile containing custom values for the metrics_server helm chart string "" no
metrics_server_variables Variables passed to the metrics_server_values templatefile any {} no
name Name to use as a prefix for created resources string n/a yes
network_address_space CIDR block to be used for the new VPC string "10.0.0.0/16" no
network_private_endpoints List of AWS services to create interface VPC endpoints for list(string)
[
"s3"
]
no
nvidia_device_plugin Install the nvidia-device-plugin helm chart to expose node GPU resources to the EKS cluster. All other nvidia_device_plugin variables are ignored if this variable is false. bool true no
nvidia_device_plugin_values Path to templatefile containing custom values for the nvidia-device-plugin helm chart string "" no
nvidia_device_plugin_variables Variables passed to the nvidia_device_plugin_values templatefile any {} no
s3_bucket_force_destroy Force destroy the public and private Route53 zones bool false no
tags A map of tags to add to all created resources map(string)
{
"managed-by": "terraform"
}
no

Outputs

Name Description
acm_certificate_arn ARN of the ACM certificate
app_role_arn ARN of the IAM role to be assumed by the DataRobot app service accounts
ebs_encryption_key_id ARN of the EBS KMS key
ecr_repository_urls URLs of the image builder repositories
kubernetes_cluster_certificate_authority_data Base64 encoded certificate data required to communicate with the cluster
kubernetes_cluster_endpoint Endpoint for your Kubernetes API server
kubernetes_cluster_name Name of the EKS cluster
private_route53_zone_arn Zone ARN of the private Route53 zone
private_route53_zone_id Zone ID of the private Route53 zone
public_route53_zone_arn Zone ARN of the public Route53 zone
public_route53_zone_id Zone ID of the public Route53 zone
s3_bucket_id Name of the S3 bucket
vpc_id The ID of the VPC

Development and Contributing

If you'd like to report an issue or bug, suggest improvements, or contribute code to this project, please refer to CONTRIBUTING.md.

Code of Conduct

This project has adopted the Contributor Covenant for its Code of Conduct. See CODE_OF_CONDUCT.md to read it in full.

License

Licensed under the Apache License 2.0. See LICENSE to read it in full.