Skip to content

helpers to dispatch vertex ai work jobs + get cli access (to train for a few hours) on A100 clusters.

Notifications You must be signed in to change notification settings

johndpope/vertex-jumpstart

Repository files navigation

Quick GPU Training on Google Cloud

Spin up GPU instances for temporary PyTorch training jobs on Google Cloud Platform

This guide helps you quickly set up and run GPU-accelerated PyTorch training jobs on Google Cloud. Perfect for running experiments that need a few hours of GPU time without maintaining permanent (ec2) infrastructure.

TODO - goal of this repo was to actually get the vertex docker container to pull down a specified git repo / feature branch + training data - and start training - logging to WANDB for validating new features - then just shut down the machine - see start.sh. this is dependent on job / docker image id imageUri: 'gcr.io/kommunityproject/pytorch-train:v1.0.18'

🎥 Demo: Submitting a GPU Training Job

asciicast

🚀 Quick Start

  1. Set Environment Variables

    export GCP_PROJECT=your-project-id
    export GOOGLE_CLOUD_BUCKET_NAME=your-bucket-name
  2. Submit Training Job

    ./push-job.sh
  3. Monitor Progress

    # Your job URL will appear after submission
    🔗 View job at: https://console.cloud.google.com/vertex-ai/training/custom-jobs?project=$GCP_PROJECT
🔧 Prerequisites
  • Google Cloud Platform (GCP) account
  • Google Cloud SDK installed
  • Docker installed locally
  • Weights & Biases account (optional)

See setup instructions

⚙️ Configuration

The job_config_gpu.yaml file controls your GPU and environment settings:

workerPoolSpecs:
  machineSpec:
    machineType: a2-highgpu-1g  # 40GB GPU
    acceleratorType: NVIDIA_TESLA_A100
    acceleratorCount: 1
  replicaCount: 1
  containerSpec:
    imageUri: 'us-docker.pkg.dev/deeplearning-platform-release/gcr.io/pytorch-cu121.2-2.py310'
    env:
      - name: GCS_BUCKET_NAME
        value: gs://your-bucket
      - name: BRANCH_NAME
        value: your-branch
      - name: GITHUB_REPO
        value: your-repo
📦 Storage Setup
  1. Create a GCS bucket:

    ./create_bucket.sh
  2. Upload training data:

    gsutil -m cp -r ./training_data gs://your-bucket/

Ubuntu/Debian

map google cloud storage to local drive

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse

# Create mount point
mkdir ~/cloud-storage

# Mount bucket (replace with your bucket name)
gcsfuse $GOOGLE_CLOUD_BUCKET_NAME ~/cloud-storage
🔑 Required Permissions

Minimum IAM roles needed:

  • AI Platform Admin (roles/ml.admin)
  • Storage Object Admin (roles/storage.objectAdmin)
  • Container Registry Service Agent
🐛 Troubleshooting
  1. Job Won't Start

    • Check IAM permissions
    • Verify GPU quota in your region
  2. Storage Access Issues

    • Test bucket access: gsutil ls gs://your-bucket
    • Verify service account permissions
  3. Local Testing

    # Mount cloud storage locally
    gcsfuse --anonymous-access your-bucket /mount/point
📚 Detailed Setup Guide

1. Enable Required APIs

Alt text

PRO TIP - Toggle on just these services to help you find things

2. Shell Configuration

# Install oh-my-zsh for better CLI experience
sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

# Add to .zshrc
plugins=(git)
export GCP_PROJECT=your-project-id
export GOOGLE_CLOUD_BUCKET_NAME=your-bucket-name

3. Build Process

Your builds will appear in the artifacts with version bumped: Alt text

4. Job Management

Monitor your training jobs in the console: Alt text

5. Job Logs

View detailed logs and metrics: Alt text

6. Storage Access

For public buckets, consider granting access to allUsers: Alt text

7. Resource Configuration

Available machine types:

workerPoolSpecs:
  machineSpec:
    # Choose one:
    machineType: n1-standard-8
    # machineType: n1-standard-32
    # machineType: a2-ultragpu-1g # For A100 80GB
    
    # GPU options:
    # acceleratorType: NVIDIA_TESLA_V100
    # acceleratorType: NVIDIA_A100_80GB
    # acceleratorCount: 1

8. Docker Configuration

For local testing, set these environment variables:

export GCP_PROJECT=kommunityproject
export IMAGE_NAME="pytorch-training"
export GCS_BUCKET_NAME="gs://jp-ai-experiments"
export BRANCH_NAME="feat/ada-fixed4"
export GITHUB_REPO="https://github.com/johndpope/imf.git"

9. File Structure

  • Dockerfile: Defines training environment
  • build.sh: Builds and pushes Docker image
  • job_config.yaml: Training job configuration
  • push-job.sh: Submits training job

📈 Monitoring

Track your training job:

  • Real-time logs
  • GPU utilization
  • Training metrics

🛟 Need Help?

About

helpers to dispatch vertex ai work jobs + get cli access (to train for a few hours) on A100 clusters.

Topics

Resources

Stars

Watchers

Forks