Skip to content

datarobot-oss/terraform-google-dr-infra

Repository files navigation

terraform-google-dr-infra

Terraform module to create Google Cloud infrastructure resources required to run DataRobot.

Usage

module "datarobot_infra" {
  source = "datarobot-oss/dr-infra/google"

  name              = "datarobot"
  google_project_id = "your-google-project-id"
  region            = "us-west1"
  domain_name       = "yourdomain.com"

  create_network                     = true
  network_address_space              = "10.7.0.0/16"
  create_dns_zones                   = false
  existing_public_dns_zone_name      = "existing-public-dns-zone-name"
  create_storage                     = true
  create_container_registry          = false
  existing_artifact_registry_repo_id = "projects/your-google-project-id/locations/us-west1/repositories/existing-repository-name"
  create_kubernetes_cluster          = true
  create_app_identity                = true

  ingress_nginx                           = true
  internet_facing_ingress_lb              = true
  cert_manager                            = true
  cert_manager_letsencrypt_clusterissuers = true
  cert_manager_letsencrypt_email_address  = [email protected]
  external_dns                            = true
  nvidia_device_plugin                    = true
  descheduler                             = true

  tags = {
    application   = "datarobot"
    environment   = "dev"
    managed-by    = "terraform"
  }
}

Examples

  • Complete - Demonstrates all input variables
  • Partial - Demonstrates the use of existing resources
  • Minimal - Demonstrates the minimum set of input variables needed to deploy all infrastructure

Using an example directly from source

  1. Clone the repo
git clone https://github.com/datarobot-oss/terraform-google-dr-infra.git
  1. Change directories into the example that best suits your needs
cd terraform-google-dr-infra/examples/internal
  1. Modify main.tf as needed
  2. Run terraform commands
terraform init
terraform plan
terraform apply
terraform destroy

Module Descriptions

Network

Toggle

  • create_network to create a new Google VPC
  • existing_vpc_name, existing_kubernetes_nodes_subnet_name, existing_kubernetes_pods_range_name, and existing_kubernetes_services_range_name to use an existing VPC and subnet

Description

Create a new Google VPC with one subnet using a /20 slice of network_address_space and a NAT gateway attached.

kubernetes_pod_cidr and kubernetes_service_cidr are secondary ranges within the subnet which will be used for the Kubernetes pod and service IPs, respectively.

Only the priamry the kubernetes_pod_cidr IPs are attached to the Cloud NAT gateway.

Permissions

TBD

DNS

Toggle

  • create_dns_zones to create new Google Cloud DNS zones
  • existing_public_dns_zone_name / existing_private_dns_zone_name to use existing Google Cloud DNS zones

Description

Create new public and/or private DNS zones with name domain_name.

A public Cloud DNS zone is used by external_dns to create records for the DataRobot ingress resources when internet_facing_ingress_lb is true. It is also used for DNS validation when using cert_manager and cert_manager_letsencrypt_clusterissuers.

A private Cloud DNS zone is used by external_dns to create records for the DataRobot ingress resources when internet_facing_ingress_lb is false.

Permissions

TBD

Storage

Toggle

  • create_storage to create a new Google Cloud Storage Bucket
  • existing_gcs_bucket_name to use an existing Google Cloud Storage Bucket

Description

Create a new GCS Bucket with prefix name and name datarobot.

The DataRobot application will use this storage account for persistent file storage.

Permissions

TBD

Container Registry

Toggle

  • create_container_registry to create a new Google Artifact Registry Repository
  • existing_artifact_registry_repo_id to use an existing Google Artifact Registry Repository

Description

Create a new GAR repository with name name.

The DataRobot application will use this registry to host custom images created by various services.

Permissions

TBD

Kubernetes

Toggle

  • create_kubernetes_cluster to create a new Google Kubernetes Engine Cluster
  • existing_gke_cluster_name to use an existing GKE cluster

Description

Create a new GKE cluster to host the DataRobot application and any other helm charts installed by this module.

By default, the Kubernetes cluster API endpoint is accessible both via a private endpoint created within the same VPC as well as publicly over the internet. GKE nodes always communicate with the control plane using the private IP address. Public endpoint access can be restricted using the kubernetes_cluster_endpoint_access_list variable or disabled completely by setting kubernetes_cluster_endpoint_public_access to false.

When kubernetes_cluster_endpoint_public_access is false, Kubernetes management operations such as kubectl and helm commands (including the Helm chart installs performed by this Terraform module) must be run from a host which can access the Kubernetes cluster API private endpoint. By default, any host within the GKE nodes subnet has access but this can be extended using the kubernetes_cluster_endpoint_access_list variable. This can be helpful when running this Terraform module from a host that resides within the same VPC as the GKE cluster but in a different subnet than the GKE nodes.

Two node groups are created:

  • A primary node group intended to host the majority of the DataRobot pods
  • A gpu node group intended to host GPU workload pods containing the label datarobot.com/node-capability: gpu and taint nvidia.com/gpu:NoSchedule

By default, slices of network_address_space will be used for the cluster nodes and control plane private endpoint IPs. It is best to use a separate address space for kubernetes_pod_cidr and kubernetes_service_cidr as these are secondary (aliased) ranges.

Permissions

TBD

App Identity

Toggle

  • create_app_identity to create a new Google Service account to represent the DataRobot application

Description

Create a new GKE Service Account with roles/storage.admin access to the Google Cloud Storage bucket and roles/artifactregistry.writer access to the Google Artifact Registry Repository.

Workload identities are created for each datarobot_service_accounts within the datarobot_namespace and attached to this Service Account. This allows those pods running with those service accounts to access file storage and the artifact registry.

Permissions

TBD

Helm Chart - ingress-nginx

Toggle

  • ingress_nginx to install the ingress-nginx helm chart

Description

Uses the terraform-helm-release module to install the https://kubernetes.github.io/ingress-nginx/ingress-nginx helm chart into the ingress-nginx namespace.

The ingress-nginx helm chart will trigger the deployment of an Google Network Load Balancer directing traffic to the ingress-nginx-controller Kubernetes services.

Values passed to the helm chart can be overridden by passing a custom values file via the ingress_nginx_values variable as demonstrated in the complete example.

Permissions

Not required

Helm Chart - cert-manager

Toggle

  • cert_manager to install the cert-manager helm chart

Description

Uses the terraform-helm-release module to install the https://charts.jetstack.io/cert-manager helm chart into the cert-manager namespace.

A Google Service Account is created for the cert-manager Kubernetes service account running in the cert-manager namespace that allows the creation of DNS resources within the specified DNS zone.

cert-manager can be used by the DataRobot application to create and manage various certificates including the application.

When cert_manager_letsencrypt_clusterissuers is enabled, letsencrypt-staging and letsencrypt-prod ClusterIssuers will be created which can be used by the datarobot-google umbrella chart to issue certificates used by the DataRobot application. The default values in that helm chart (as of version 10.2) have global.ingress.tls.enabled, global.ingress.tls.certmanager, and global.ingress.tls.issuer as letsencrypt-prod which will use the letsencrypt-prod ClusterIssuer to issue a public ACME certificate as the TLS certificate used by the Kubernetes ingress resources.

Values passed to the helm chart can be overridden by passing a custom values file via the cert_manager_values variable as demonstrated in the complete example.

Permissions

TBD

Helm Chart - external-dns

Toggle

  • external_dns to install the external-dns helm chart

Description

Uses the terraform-helm-release module to install the https://charts.bitnami.com/bitnami/external-dns helm chart into the external-dns namespace.

A Google Service Account is created for the external-dns Kubernetes service account running in the external-dns namespace that allows the creation of DNS resources within the specified DNS zone.

external-dns is used to automatically create DNS records for ingress resources in the Kubernetes cluster. When the DataRobot application is installed and the ingress resources are created, external-dns will automatically create a DNS record pointing at the ingress resource.

Values passed to the helm chart can be overridden by passing a custom values file via the external_dns_values variable as demonstrated in the complete example.

Permissions

TBD

Helm Chart - nvidia-device-plugin

Toggle

  • nvidia_device_plugin to install the nvidia-device-plugin helm chart

Description

Uses the terraform-helm-release module to install the https://nvidia.github.io/k8s-device-plugin/nvidia-device-plugin helm chart into the nvidia-device-plugin namespace.

Values passed to the helm chart can be overridden by passing a custom values file via the nvidia_device_plugin_values variable as demonstrated in the complete example.

Permissions

Not required

Helm Chart - descheduler

Toggle

  • descheduler to install the descheduler helm chart

Description

Uses the terraform-helm-release module to install the descheduler helm chart from the https://kubernetes-sigs.github.io/descheduler/ helm repo into the kube-system namespace.

This helm chart allows for automatic rescheduling of pods for optimizing resource consumption.

Permissions

Not required

Comprehensive Required Permissions

TBD

DataRobot versions

Release Supported DR Versions
>= 1.0 >= 10.0

Requirements

Name Version
terraform >= 1.3.5
google >= 6.6.0
helm >= 2.15.0

Providers

Name Version
google >= 6.6.0

Modules

Name Source Version
app_identity terraform-google-modules/service-accounts/google ~> 4.0
cert_manager ./modules/cert-manager n/a
cloud_router terraform-google-modules/cloud-router/google ~> 6.1
descheduler ./modules/descheduler n/a
external_dns ./modules/external-dns n/a
ingress_nginx ./modules/ingress-nginx n/a
kubernetes terraform-google-modules/kubernetes-engine/google//modules/private-cluster ~> 33.0
network terraform-google-modules/network/google ~> 9.0
nvidia_device_plugin ./modules/nvidia-device-plugin n/a
private_dns terraform-google-modules/cloud-dns/google ~> 5.0
public_dns terraform-google-modules/cloud-dns/google ~> 5.0
storage terraform-google-modules/cloud-storage/google ~> 8.0

Resources

Name Type
google_artifact_registry_repository.this resource
google_artifact_registry_repository_iam_member.datarobot resource
google_service_account_iam_member.datarobot resource
google_storage_bucket_iam_member.datarobot resource
google_client_config.default data source
google_compute_network.existing data source
google_container_cluster.existing data source

Inputs

Name Description Type Default Required
cert_manager Install the cert-manager helm chart. All other cert_manager variables are ignored if this variable is false. bool true no
cert_manager_letsencrypt_clusterissuers Whether to create letsencrypt-prod and letsencrypt-staging ClusterIssuers bool true no
cert_manager_letsencrypt_email_address Email address for the certificate owner. Let's Encrypt will use this to contact you about expiring certificates, and issues related to your account. Only required if cert_manager_letsencrypt_clusterissuers is true. string "[email protected]" no
cert_manager_namespace Namespace to install the helm chart into string "cert-manager" no
cert_manager_values Path to templatefile containing custom values for the cert-manager helm chart string "" no
cert_manager_variables Variables passed to the cert_manager_values templatefile any {} no
create_app_identity Create a new user assigned identity for the DataRobot application bool true no
create_container_registry Create a new Google Container Registry. Ignored if an existing existing_artifact_registry_repo_id is specified. bool true no
create_dns_zones Create DNS zones for domain_name. Ignored if existing_public_dns_zone_id and existing_private_dns_zone_id are specified. bool true no
create_kubernetes_cluster Create a new Google Kubernetes Engine cluster. All kubernetes and helm chart variables are ignored if this variable is false. bool true no
create_network Create a new Google VPC. Ignored if an existing existing_vpc_id is specified. bool true no
create_storage Create a new Google Storage Bucket to use for DataRobot file storage. Ignored if an existing_gcs_bucket_name is specified. bool true no
datarobot_namespace Kubernetes namespace in which the DataRobot application will be installed string "dr-app" no
datarobot_service_accounts Names of the Kubernetes service accounts used by the DataRobot application set(string)
[
"dr",
"build-service",
"build-service-image-builder",
"buzok-account",
"dr-lrs-operator",
"dynamic-worker",
"internal-api-sa",
"nbx-notebook-revisions-account",
"prediction-server-sa",
"tileservergl-sa"
]
no
descheduler Install the descheduler helm chart to enable rescheduling of pods. All other descheduler variables are ignored if this variable is false bool true no
descheduler_namespace Namespace to install the helm chart into string "kube-system" no
descheduler_values Path to templatefile containing custom values for the descheduler helm chart string "" no
descheduler_variables Variables passed to the descheduler templatefile any {} no
dns_zones_force_destroy Force destroy for the public and private Cloud DNS zones when terminating bool false no
domain_name Name of the domain to use for the DataRobot application. If create_dns_zones is true then zones will be created for this domain. It is also used by the cert-manager helm chart for DNS validation and as a domain filter by the external-dns helm chart. string "" no
existing_artifact_registry_repo_id ID of existing artifact registry repository to use string null no
existing_gcs_bucket_name ID of existing Google Storage Bucket to use for DataRobot file storage. When specified, all other storage variables will be ignored. string null no
existing_gke_cluster_name Name of existing GKE cluster to use. When specified, all other kubernetes variables will be ignored. string null no
existing_kubernetes_nodes_subnet_name Name of an existing subnet to use for the GKE node pools and control plane private endpoint. Required when an existing_vpc_name is specified. Ignored if no existing_vpc_name is specified. string null no
existing_kubernetes_pods_range_name Name of an secondary IP range within subnet defined by existing_kubernetes_nodes_subnet_name to use for the Kubernetes pods. Required when an existing_vpc_name is specified. Ignored if no existing_vpc_name is specified. string null no
existing_kubernetes_services_range_name Name of an secondary IP range within subnet defined by existing_kubernetes_nodes_subnet_name to use for the Kubernetes services. Required when an existing_vpc_name is specified. Ignored if no existing_vpc_name is specified. string null no
existing_private_dns_zone_name ID of existing private hosted zone to use for private DNS records created by external-dns. This is required when create_dns_zones is false and ingress_nginx is true with internet_facing_ingress_lb false. string null no
existing_public_dns_zone_name ID of existing public hosted zone to use for public DNS records created by external-dns and public LetsEncrypt certificate validation by cert-manager. This is required when create_dns_zones is false and ingress_nginx and internet_facing_ingress_lb are true or when cert_manager and cert_manager_letsencrypt_clusterissuers are true. string null no
existing_vpc_name Name of an existing Google VPC to use. When specified, other network variables are ignored. string null no
external_dns Install the external_dns helm chart to create DNS records for ingress resources matching the domain_name variable. All other external_dns variables are ignored if this variable is false. bool true no
external_dns_namespace Namespace to install the helm chart into string "external-dns" no
external_dns_values Path to templatefile containing custom values for the external-dns helm chart string "" no
external_dns_variables Variables passed to the external_dns_values templatefile any {} no
google_project_id The ID of the Google Project where these resources will be created string n/a yes
ingress_nginx Install the ingress-nginx helm chart to use as the ingress controller for the GKE cluster. All other ingress_nginx variables are ignored if this variable is false. bool true no
ingress_nginx_namespace Namespace to install the helm chart into string "ingress-nginx" no
ingress_nginx_values Path to templatefile containing custom values for the ingress-nginx helm chart string "" no
ingress_nginx_variables Variables passed to the ingress_nginx_values templatefile any {} no
internet_facing_ingress_lb Determines the type of Load Balancer created for GKE ingress. If true, an external Load Balancer will be created. If false, an internal Load Balancer will be created. bool true no
kubernetes_cluster_deletion_protection Enable deletion protection on the GKE cluster bool true no
kubernetes_cluster_endpoint_access_list List of CIDRs allowed to access the Kubernetes cluster API endpoint. When kubernetes_cluster_endpoint_public_access is true, these CIDRs specify which public IP addresses are allowed to access the Kubernetes cluster API external endpoint. When kubernetes_cluster_endpoint_public_access is false, these CIDRs specify which private IP addresses are allowed to access the Kubernetes cluster API internal endpoint. By default, only hosts within the kubernetes nodes subnet are allowed to access the Kubernetes cluster API internal endpoint. list(string) [] no
kubernetes_cluster_endpoint_public_access Whether the Kubernetes cluster API endpoint can be accessed via an external IP address bool true no
kubernetes_cluster_grant_registry_access Grants created cluster-specific service account storage.objectViewer and artifactregistry.reader roles bool true no
kubernetes_cluster_version GKE cluster version string "latest" no
kubernetes_gpu_nodegroup_taints The Kubernetes taints to be applied to the nodes in the GPU node group. any
[
{
"effect": "NO_SCHEDULE",
"key": "nvidia.com/gpu",
"value": "true"
}
]
no
kubernetes_gpu_nodepool_labels A map of Kubernetes labels to apply to the GPU node pool map(string)
{
"datarobot.com/node-capability": "gpu"
}
no
kubernetes_gpu_nodepool_max_count Maximum number of nodes in the GPU node pool number 10 no
kubernetes_gpu_nodepool_min_count Minimum number of nodes in the GPU node pool number 0 no
kubernetes_gpu_nodepool_name Name of the GPU node pool string "gpu" no
kubernetes_gpu_nodepool_node_count Node count of the GPU node pool number 0 no
kubernetes_gpu_nodepool_vm_size VM size used for the GPU node pool string "n1-highmem-4" no
kubernetes_master_ipv4_cidr_block The IP range in CIDR notation to use for the hosted master network including the Kubernetes control plane. If you use this flag, GKE creates a new subnet that uses the values you defined in master-ipv4-cidr and uses the new subnet to provision the internal IP address for the control plane. string null no
kubernetes_pod_cidr The CIDR to use for Kubernetes pod IP addresses. This is used as a secondary IP range within the Kubernetes nodes subnet. string "192.168.0.0/18" no
kubernetes_primary_nodepool_labels A map of Kubernetes labels to apply to the primary node pool map(string)
{
"datarobot.com/node-capability": "cpu"
}
no
kubernetes_primary_nodepool_max_count Maximum number of nodes in the primary node pool number 10 no
kubernetes_primary_nodepool_min_count Minimum number of nodes in the primary node pool number 1 no
kubernetes_primary_nodepool_name Name of the primary node pool string "primary" no
kubernetes_primary_nodepool_node_count Node count of the primary node pool number 1 no
kubernetes_primary_nodepool_taints A list of Kubernetes taints to apply to the primary node pool any [] no
kubernetes_primary_nodepool_vm_size VM size used for the primary node pool string "e2-standard-32" no
kubernetes_service_cidr The CIDR to use for Kubernetes service IP addresses. This is used as a secondary IP range within the Kubernetes nodes subnet. string "192.168.64.0/18" no
name Name to use as a prefix for created resources string n/a yes
network_address_space The CIDR to use for the Kubernetes nodes and control plane. string "10.0.0.0/16" no
nvidia_device_plugin Install the nvidia-device-plugin helm chart to expose node GPU resources to the GKE cluster. All other nvidia_device_plugin variables are ignored if this variable is false. bool true no
nvidia_device_plugin_namespace Namespace to install the helm chart into string "nvidia-device-plugin" no
nvidia_device_plugin_values Path to templatefile containing custom values for the nvidia-device-plugin helm chart string "" no
nvidia_device_plugin_variables Variables passed to the nvidia_device_plugin_values templatefile any {} no
region Google region to create the resources in string n/a yes
release_channel The release channel of this cluster. Accepted values are UNSPECIFIED, RAPID, REGULAR and STABLE. Defaults to STABLE. string "STABLE" no
storage_force_destroy Force destroy for the public and private Cloud DNS zones when terminating bool false no
tags A map of tags to add to all created resources map(string)
{
"managed-by": "terraform"
}
no

Outputs

Name Description
artifact_registry_repo_id ID of the Artifact Registry repository
artifact_registry_repo_path Path to the Artifact Registry repository
datarobot_service_account_email Email of the DataRobot service account
datarobot_service_account_key DataRobot service account key
gke_cluster_name Name of the GKE cluster
private_dns_zone_name Name of the private DNS zone
public_dns_zone_name Name of the public DNS zone
storage_bucket_name Name of the storage bucket
vpc_name Name of the VPC

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages