Spin up GPU instances for temporary PyTorch training jobs on Google Cloud Platform
This guide helps you quickly set up and run GPU-accelerated PyTorch training jobs on Google Cloud. Perfect for running experiments that need a few hours of GPU time without maintaining permanent (ec2) infrastructure.
TODO - goal of this repo was to actually get the vertex docker container to pull down a specified git repo / feature branch + training data - and start training - logging to WANDB for validating new features - then just shut down the machine - see start.sh. this is dependent on job / docker image id imageUri: 'gcr.io/kommunityproject/pytorch-train:v1.0.18'
-
Set Environment Variables
export GCP_PROJECT=your-project-id export GOOGLE_CLOUD_BUCKET_NAME=your-bucket-name
-
Submit Training Job
./push-job.sh
-
Monitor Progress
# Your job URL will appear after submission 🔗 View job at: https://console.cloud.google.com/vertex-ai/training/custom-jobs?project=$GCP_PROJECT
🔧 Prerequisites
- Google Cloud Platform (GCP) account
- Google Cloud SDK installed
- Docker installed locally
- Weights & Biases account (optional)
⚙️ Configuration
The job_config_gpu.yaml
file controls your GPU and environment settings:
workerPoolSpecs:
machineSpec:
machineType: a2-highgpu-1g # 40GB GPU
acceleratorType: NVIDIA_TESLA_A100
acceleratorCount: 1
replicaCount: 1
containerSpec:
imageUri: 'us-docker.pkg.dev/deeplearning-platform-release/gcr.io/pytorch-cu121.2-2.py310'
env:
- name: GCS_BUCKET_NAME
value: gs://your-bucket
- name: BRANCH_NAME
value: your-branch
- name: GITHUB_REPO
value: your-repo
📦 Storage Setup
-
Create a GCS bucket:
./create_bucket.sh
-
Upload training data:
gsutil -m cp -r ./training_data gs://your-bucket/
map google cloud storage to local drive
export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse
# Create mount point
mkdir ~/cloud-storage
# Mount bucket (replace with your bucket name)
gcsfuse $GOOGLE_CLOUD_BUCKET_NAME ~/cloud-storage
🔑 Required Permissions
Minimum IAM roles needed:
- AI Platform Admin (
roles/ml.admin
) - Storage Object Admin (
roles/storage.objectAdmin
) - Container Registry Service Agent
🐛 Troubleshooting
-
Job Won't Start
- Check IAM permissions
- Verify GPU quota in your region
-
Storage Access Issues
- Test bucket access:
gsutil ls gs://your-bucket
- Verify service account permissions
- Test bucket access:
-
Local Testing
# Mount cloud storage locally gcsfuse --anonymous-access your-bucket /mount/point
📚 Detailed Setup Guide
PRO TIP - Toggle on just these services to help you find things
# Install oh-my-zsh for better CLI experience
sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
# Add to .zshrc
plugins=(git)
export GCP_PROJECT=your-project-id
export GOOGLE_CLOUD_BUCKET_NAME=your-bucket-name
Your builds will appear in the artifacts with version bumped:
Monitor your training jobs in the console:
View detailed logs and metrics:
For public buckets, consider granting access to allUsers:
Available machine types:
workerPoolSpecs:
machineSpec:
# Choose one:
machineType: n1-standard-8
# machineType: n1-standard-32
# machineType: a2-ultragpu-1g # For A100 80GB
# GPU options:
# acceleratorType: NVIDIA_TESLA_V100
# acceleratorType: NVIDIA_A100_80GB
# acceleratorCount: 1
For local testing, set these environment variables:
export GCP_PROJECT=kommunityproject
export IMAGE_NAME="pytorch-training"
export GCS_BUCKET_NAME="gs://jp-ai-experiments"
export BRANCH_NAME="feat/ada-fixed4"
export GITHUB_REPO="https://github.com/johndpope/imf.git"
Dockerfile
: Defines training environmentbuild.sh
: Builds and pushes Docker imagejob_config.yaml
: Training job configurationpush-job.sh
: Submits training job
Track your training job:
- Real-time logs
- GPU utilization
- Training metrics
- Google Cloud AI Platform Documentation
- PyTorch Documentation
- Submit an issue for specific questions