Terraform module to create Google Cloud infrastructure resources required to run DataRobot.
module "datarobot_infra" {
source = "datarobot-oss/dr-infra/google"
name = "datarobot"
google_project_id = "your-google-project-id"
region = "us-west1"
domain_name = "yourdomain.com"
create_network = true
network_address_space = "10.7.0.0/16"
create_dns_zones = false
existing_public_dns_zone_name = "existing-public-dns-zone-name"
create_storage = true
create_container_registry = false
existing_artifact_registry_repo_id = "projects/your-google-project-id/locations/us-west1/repositories/existing-repository-name"
create_kubernetes_cluster = true
create_app_identity = true
ingress_nginx = true
internet_facing_ingress_lb = true
cert_manager = true
cert_manager_letsencrypt_clusterissuers = true
cert_manager_letsencrypt_email_address = [email protected]
external_dns = true
nvidia_device_plugin = true
descheduler = true
tags = {
application = "datarobot"
environment = "dev"
managed-by = "terraform"
}
}
- Complete - Demonstrates all input variables
- Partial - Demonstrates the use of existing resources
- Minimal - Demonstrates the minimum set of input variables needed to deploy all infrastructure
- Clone the repo
git clone https://github.com/datarobot-oss/terraform-google-dr-infra.git
- Change directories into the example that best suits your needs
cd terraform-google-dr-infra/examples/internal
- Modify
main.tf
as needed - Run terraform commands
terraform init
terraform plan
terraform apply
terraform destroy
create_network
to create a new Google VPCexisting_vpc_name
,existing_kubernetes_nodes_subnet_name
,existing_kubernetes_pods_range_name
, andexisting_kubernetes_services_range_name
to use an existing VPC and subnet
Create a new Google VPC with one subnet using a /20
slice of network_address_space
and a NAT gateway attached.
kubernetes_pod_cidr
and kubernetes_service_cidr
are secondary ranges within the subnet which will be used for the Kubernetes pod and service IPs, respectively.
Only the priamry the kubernetes_pod_cidr
IPs are attached to the Cloud NAT gateway.
TBD
create_dns_zones
to create new Google Cloud DNS zonesexisting_public_dns_zone_name
/existing_private_dns_zone_name
to use existing Google Cloud DNS zones
Create new public and/or private DNS zones with name domain_name
.
A public Cloud DNS zone is used by external_dns
to create records for the DataRobot ingress resources when internet_facing_ingress_lb
is true
. It is also used for DNS validation when using cert_manager
and cert_manager_letsencrypt_clusterissuers
.
A private Cloud DNS zone is used by external_dns
to create records for the DataRobot ingress resources when internet_facing_ingress_lb
is false
.
TBD
create_storage
to create a new Google Cloud Storage Bucketexisting_gcs_bucket_name
to use an existing Google Cloud Storage Bucket
Create a new GCS Bucket with prefix name
and name datarobot
.
The DataRobot application will use this storage account for persistent file storage.
TBD
create_container_registry
to create a new Google Artifact Registry Repositoryexisting_artifact_registry_repo_id
to use an existing Google Artifact Registry Repository
Create a new GAR repository with name name
.
The DataRobot application will use this registry to host custom images created by various services.
TBD
create_kubernetes_cluster
to create a new Google Kubernetes Engine Clusterexisting_gke_cluster_name
to use an existing GKE cluster
Create a new GKE cluster to host the DataRobot application and any other helm charts installed by this module.
By default, the Kubernetes cluster API endpoint is accessible both via a private endpoint created within the same VPC as well as publicly over the internet. GKE nodes always communicate with the control plane using the private IP address. Public endpoint access can be restricted using the kubernetes_cluster_endpoint_access_list
variable or disabled completely by setting kubernetes_cluster_endpoint_public_access
to false
.
When kubernetes_cluster_endpoint_public_access
is false
, Kubernetes management operations such as kubectl
and helm
commands (including the Helm chart installs performed by this Terraform module) must be run from a host which can access the Kubernetes cluster API private endpoint. By default, any host within the GKE nodes subnet has access but this can be extended using the kubernetes_cluster_endpoint_access_list
variable. This can be helpful when running this Terraform module from a host that resides within the same VPC as the GKE cluster but in a different subnet than the GKE nodes.
Two node groups are created:
- A
primary
node group intended to host the majority of the DataRobot pods - A
gpu
node group intended to host GPU workload pods containing the labeldatarobot.com/node-capability: gpu
and taintnvidia.com/gpu:NoSchedule
By default, slices of network_address_space
will be used for the cluster nodes and control plane private endpoint IPs. It is best to use a separate address space for kubernetes_pod_cidr
and kubernetes_service_cidr
as these are secondary (aliased) ranges.
TBD
create_app_identity
to create a new Google Service account to represent the DataRobot application
Create a new GKE Service Account with roles/storage.admin
access to the Google Cloud Storage bucket and roles/artifactregistry.writer
access to the Google Artifact Registry Repository.
Workload identities are created for each datarobot_service_accounts
within the datarobot_namespace
and attached to this Service Account. This allows those pods running with those service accounts to access file storage and the artifact registry.
TBD
ingress_nginx
to install theingress-nginx
helm chart
Uses the terraform-helm-release module to install the https://kubernetes.github.io/ingress-nginx/ingress-nginx
helm chart into the ingress-nginx
namespace.
The ingress-nginx
helm chart will trigger the deployment of an Google Network Load Balancer directing traffic to the ingress-nginx-controller
Kubernetes services.
Values passed to the helm chart can be overridden by passing a custom values file via the ingress_nginx_values
variable as demonstrated in the complete example.
Not required
cert_manager
to install thecert-manager
helm chart
Uses the terraform-helm-release module to install the https://charts.jetstack.io/cert-manager
helm chart into the cert-manager
namespace.
A Google Service Account is created for the cert-manager
Kubernetes service account running in the cert-manager
namespace that allows the creation of DNS resources within the specified DNS zone.
cert-manager
can be used by the DataRobot application to create and manage various certificates including the application.
When cert_manager_letsencrypt_clusterissuers
is enabled, letsencrypt-staging
and letsencrypt-prod
ClusterIssuers will be created which can be used by the datarobot-google
umbrella chart to issue certificates used by the DataRobot application. The default values in that helm chart (as of version 10.2) have global.ingress.tls.enabled
, global.ingress.tls.certmanager
, and global.ingress.tls.issuer
as letsencrypt-prod
which will use the letsencrypt-prod
ClusterIssuer to issue a public ACME certificate as the TLS certificate used by the Kubernetes ingress resources.
Values passed to the helm chart can be overridden by passing a custom values file via the cert_manager_values
variable as demonstrated in the complete example.
TBD
external_dns
to install theexternal-dns
helm chart
Uses the terraform-helm-release module to install the https://charts.bitnami.com/bitnami/external-dns
helm chart into the external-dns
namespace.
A Google Service Account is created for the external-dns
Kubernetes service account running in the external-dns
namespace that allows the creation of DNS resources within the specified DNS zone.
external-dns
is used to automatically create DNS records for ingress resources in the Kubernetes cluster. When the DataRobot application is installed and the ingress resources are created, external-dns
will automatically create a DNS record pointing at the ingress resource.
Values passed to the helm chart can be overridden by passing a custom values file via the external_dns_values
variable as demonstrated in the complete example.
TBD
nvidia_device_plugin
to install thenvidia-device-plugin
helm chart
Uses the terraform-helm-release module to install the https://nvidia.github.io/k8s-device-plugin/nvidia-device-plugin
helm chart into the nvidia-device-plugin
namespace.
Values passed to the helm chart can be overridden by passing a custom values file via the nvidia_device_plugin_values
variable as demonstrated in the complete example.
Not required
descheduler
to install thedescheduler
helm chart
Uses the terraform-helm-release module to install the descheduler
helm chart from the https://kubernetes-sigs.github.io/descheduler/
helm repo into the kube-system
namespace.
This helm chart allows for automatic rescheduling of pods for optimizing resource consumption.
Not required
TBD
Release | Supported DR Versions |
---|---|
>= 1.0 | >= 10.0 |
Name | Version |
---|---|
terraform | >= 1.3.5 |
>= 6.6.0 | |
helm | >= 2.15.0 |
Name | Version |
---|---|
>= 6.6.0 |
Name | Source | Version |
---|---|---|
app_identity | terraform-google-modules/service-accounts/google | ~> 4.0 |
cert_manager | ./modules/cert-manager | n/a |
cloud_router | terraform-google-modules/cloud-router/google | ~> 6.1 |
descheduler | ./modules/descheduler | n/a |
external_dns | ./modules/external-dns | n/a |
ingress_nginx | ./modules/ingress-nginx | n/a |
kubernetes | terraform-google-modules/kubernetes-engine/google//modules/private-cluster | ~> 33.0 |
network | terraform-google-modules/network/google | ~> 9.0 |
nvidia_device_plugin | ./modules/nvidia-device-plugin | n/a |
private_dns | terraform-google-modules/cloud-dns/google | ~> 5.0 |
public_dns | terraform-google-modules/cloud-dns/google | ~> 5.0 |
storage | terraform-google-modules/cloud-storage/google | ~> 8.0 |
Name | Type |
---|---|
google_artifact_registry_repository.this | resource |
google_artifact_registry_repository_iam_member.datarobot | resource |
google_service_account_iam_member.datarobot | resource |
google_storage_bucket_iam_member.datarobot | resource |
google_client_config.default | data source |
google_compute_network.existing | data source |
google_container_cluster.existing | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
cert_manager | Install the cert-manager helm chart. All other cert_manager variables are ignored if this variable is false. | bool |
true |
no |
cert_manager_letsencrypt_clusterissuers | Whether to create letsencrypt-prod and letsencrypt-staging ClusterIssuers | bool |
true |
no |
cert_manager_letsencrypt_email_address | Email address for the certificate owner. Let's Encrypt will use this to contact you about expiring certificates, and issues related to your account. Only required if cert_manager_letsencrypt_clusterissuers is true. | string |
"[email protected]" |
no |
cert_manager_namespace | Namespace to install the helm chart into | string |
"cert-manager" |
no |
cert_manager_values | Path to templatefile containing custom values for the cert-manager helm chart | string |
"" |
no |
cert_manager_variables | Variables passed to the cert_manager_values templatefile | any |
{} |
no |
create_app_identity | Create a new user assigned identity for the DataRobot application | bool |
true |
no |
create_container_registry | Create a new Google Container Registry. Ignored if an existing existing_artifact_registry_repo_id is specified. | bool |
true |
no |
create_dns_zones | Create DNS zones for domain_name. Ignored if existing_public_dns_zone_id and existing_private_dns_zone_id are specified. | bool |
true |
no |
create_kubernetes_cluster | Create a new Google Kubernetes Engine cluster. All kubernetes and helm chart variables are ignored if this variable is false. | bool |
true |
no |
create_network | Create a new Google VPC. Ignored if an existing existing_vpc_id is specified. | bool |
true |
no |
create_storage | Create a new Google Storage Bucket to use for DataRobot file storage. Ignored if an existing_gcs_bucket_name is specified. | bool |
true |
no |
datarobot_namespace | Kubernetes namespace in which the DataRobot application will be installed | string |
"dr-app" |
no |
datarobot_service_accounts | Names of the Kubernetes service accounts used by the DataRobot application | set(string) |
[ |
no |
descheduler | Install the descheduler helm chart to enable rescheduling of pods. All other descheduler variables are ignored if this variable is false | bool |
true |
no |
descheduler_namespace | Namespace to install the helm chart into | string |
"kube-system" |
no |
descheduler_values | Path to templatefile containing custom values for the descheduler helm chart | string |
"" |
no |
descheduler_variables | Variables passed to the descheduler templatefile | any |
{} |
no |
dns_zones_force_destroy | Force destroy for the public and private Cloud DNS zones when terminating | bool |
false |
no |
domain_name | Name of the domain to use for the DataRobot application. If create_dns_zones is true then zones will be created for this domain. It is also used by the cert-manager helm chart for DNS validation and as a domain filter by the external-dns helm chart. | string |
"" |
no |
existing_artifact_registry_repo_id | ID of existing artifact registry repository to use | string |
null |
no |
existing_gcs_bucket_name | ID of existing Google Storage Bucket to use for DataRobot file storage. When specified, all other storage variables will be ignored. | string |
null |
no |
existing_gke_cluster_name | Name of existing GKE cluster to use. When specified, all other kubernetes variables will be ignored. | string |
null |
no |
existing_kubernetes_nodes_subnet_name | Name of an existing subnet to use for the GKE node pools and control plane private endpoint. Required when an existing_vpc_name is specified. Ignored if no existing_vpc_name is specified. | string |
null |
no |
existing_kubernetes_pods_range_name | Name of an secondary IP range within subnet defined by existing_kubernetes_nodes_subnet_name to use for the Kubernetes pods. Required when an existing_vpc_name is specified. Ignored if no existing_vpc_name is specified. | string |
null |
no |
existing_kubernetes_services_range_name | Name of an secondary IP range within subnet defined by existing_kubernetes_nodes_subnet_name to use for the Kubernetes services. Required when an existing_vpc_name is specified. Ignored if no existing_vpc_name is specified. | string |
null |
no |
existing_private_dns_zone_name | ID of existing private hosted zone to use for private DNS records created by external-dns. This is required when create_dns_zones is false and ingress_nginx is true with internet_facing_ingress_lb false. | string |
null |
no |
existing_public_dns_zone_name | ID of existing public hosted zone to use for public DNS records created by external-dns and public LetsEncrypt certificate validation by cert-manager. This is required when create_dns_zones is false and ingress_nginx and internet_facing_ingress_lb are true or when cert_manager and cert_manager_letsencrypt_clusterissuers are true. | string |
null |
no |
existing_vpc_name | Name of an existing Google VPC to use. When specified, other network variables are ignored. | string |
null |
no |
external_dns | Install the external_dns helm chart to create DNS records for ingress resources matching the domain_name variable. All other external_dns variables are ignored if this variable is false. | bool |
true |
no |
external_dns_namespace | Namespace to install the helm chart into | string |
"external-dns" |
no |
external_dns_values | Path to templatefile containing custom values for the external-dns helm chart | string |
"" |
no |
external_dns_variables | Variables passed to the external_dns_values templatefile | any |
{} |
no |
google_project_id | The ID of the Google Project where these resources will be created | string |
n/a | yes |
ingress_nginx | Install the ingress-nginx helm chart to use as the ingress controller for the GKE cluster. All other ingress_nginx variables are ignored if this variable is false. | bool |
true |
no |
ingress_nginx_namespace | Namespace to install the helm chart into | string |
"ingress-nginx" |
no |
ingress_nginx_values | Path to templatefile containing custom values for the ingress-nginx helm chart | string |
"" |
no |
ingress_nginx_variables | Variables passed to the ingress_nginx_values templatefile | any |
{} |
no |
internet_facing_ingress_lb | Determines the type of Load Balancer created for GKE ingress. If true, an external Load Balancer will be created. If false, an internal Load Balancer will be created. | bool |
true |
no |
kubernetes_cluster_deletion_protection | Enable deletion protection on the GKE cluster | bool |
true |
no |
kubernetes_cluster_endpoint_access_list | List of CIDRs allowed to access the Kubernetes cluster API endpoint. When kubernetes_cluster_endpoint_public_access is true, these CIDRs specify which public IP addresses are allowed to access the Kubernetes cluster API external endpoint. When kubernetes_cluster_endpoint_public_access is false, these CIDRs specify which private IP addresses are allowed to access the Kubernetes cluster API internal endpoint. By default, only hosts within the kubernetes nodes subnet are allowed to access the Kubernetes cluster API internal endpoint. | list(string) |
[] |
no |
kubernetes_cluster_endpoint_public_access | Whether the Kubernetes cluster API endpoint can be accessed via an external IP address | bool |
true |
no |
kubernetes_cluster_grant_registry_access | Grants created cluster-specific service account storage.objectViewer and artifactregistry.reader roles | bool |
true |
no |
kubernetes_cluster_version | GKE cluster version | string |
"latest" |
no |
kubernetes_gpu_nodegroup_taints | The Kubernetes taints to be applied to the nodes in the GPU node group. | any |
[ |
no |
kubernetes_gpu_nodepool_labels | A map of Kubernetes labels to apply to the GPU node pool | map(string) |
{ |
no |
kubernetes_gpu_nodepool_max_count | Maximum number of nodes in the GPU node pool | number |
10 |
no |
kubernetes_gpu_nodepool_min_count | Minimum number of nodes in the GPU node pool | number |
0 |
no |
kubernetes_gpu_nodepool_name | Name of the GPU node pool | string |
"gpu" |
no |
kubernetes_gpu_nodepool_node_count | Node count of the GPU node pool | number |
0 |
no |
kubernetes_gpu_nodepool_vm_size | VM size used for the GPU node pool | string |
"n1-highmem-4" |
no |
kubernetes_master_ipv4_cidr_block | The IP range in CIDR notation to use for the hosted master network including the Kubernetes control plane. If you use this flag, GKE creates a new subnet that uses the values you defined in master-ipv4-cidr and uses the new subnet to provision the internal IP address for the control plane. | string |
null |
no |
kubernetes_pod_cidr | The CIDR to use for Kubernetes pod IP addresses. This is used as a secondary IP range within the Kubernetes nodes subnet. | string |
"192.168.0.0/18" |
no |
kubernetes_primary_nodepool_labels | A map of Kubernetes labels to apply to the primary node pool | map(string) |
{ |
no |
kubernetes_primary_nodepool_max_count | Maximum number of nodes in the primary node pool | number |
10 |
no |
kubernetes_primary_nodepool_min_count | Minimum number of nodes in the primary node pool | number |
1 |
no |
kubernetes_primary_nodepool_name | Name of the primary node pool | string |
"primary" |
no |
kubernetes_primary_nodepool_node_count | Node count of the primary node pool | number |
1 |
no |
kubernetes_primary_nodepool_taints | A list of Kubernetes taints to apply to the primary node pool | any |
[] |
no |
kubernetes_primary_nodepool_vm_size | VM size used for the primary node pool | string |
"e2-standard-32" |
no |
kubernetes_service_cidr | The CIDR to use for Kubernetes service IP addresses. This is used as a secondary IP range within the Kubernetes nodes subnet. | string |
"192.168.64.0/18" |
no |
name | Name to use as a prefix for created resources | string |
n/a | yes |
network_address_space | The CIDR to use for the Kubernetes nodes and control plane. | string |
"10.0.0.0/16" |
no |
nvidia_device_plugin | Install the nvidia-device-plugin helm chart to expose node GPU resources to the GKE cluster. All other nvidia_device_plugin variables are ignored if this variable is false. | bool |
true |
no |
nvidia_device_plugin_namespace | Namespace to install the helm chart into | string |
"nvidia-device-plugin" |
no |
nvidia_device_plugin_values | Path to templatefile containing custom values for the nvidia-device-plugin helm chart | string |
"" |
no |
nvidia_device_plugin_variables | Variables passed to the nvidia_device_plugin_values templatefile | any |
{} |
no |
region | Google region to create the resources in | string |
n/a | yes |
release_channel | The release channel of this cluster. Accepted values are UNSPECIFIED , RAPID , REGULAR and STABLE . Defaults to STABLE . |
string |
"STABLE" |
no |
storage_force_destroy | Force destroy for the public and private Cloud DNS zones when terminating | bool |
false |
no |
tags | A map of tags to add to all created resources | map(string) |
{ |
no |
Name | Description |
---|---|
artifact_registry_repo_id | ID of the Artifact Registry repository |
artifact_registry_repo_path | Path to the Artifact Registry repository |
datarobot_service_account_email | Email of the DataRobot service account |
datarobot_service_account_key | DataRobot service account key |
gke_cluster_name | Name of the GKE cluster |
private_dns_zone_name | Name of the private DNS zone |
public_dns_zone_name | Name of the public DNS zone |
storage_bucket_name | Name of the storage bucket |
vpc_name | Name of the VPC |