A detailed guide for MLOps and Data teams that want to stop burning time and money on click-ops.


1. Introduction

Vertex AI Workbench is a managed Jupyter environment in Google Cloud. It ships with native integration with BigQuery and Cloud Storage, supports GPU/TPU, and leverages GCP’s unified IAM model and networking policies.

Creating Jupyter instances in Vertex AI Workbench via the UI is the simplest way to get started. But as your team grows, this process becomes increasingly labor-intensive, audits get harder, and config mistakes and security issues become more likely.

A few data points: according to Flexera , average “cloud waste” reaches 32% of spend — nearly a third of bills pay for nothing; 84% of organizations name cost control as their top cloud challenge (Flexera) and up to 21% of infra budgets burn on underutilized resources, including idle GPUs (PR Newswire).

When every engineer manually creates their own Workbench instance, errors scale with headcount: random regions, forgotten auto-shutdowns, and service accounts with excessive permissions.

The goal of this article is to show how to minimize manual work, make configurations reproducible, and reduce idle waste using a Terraform module with CI pipeline for Vertex AI Workbench.

2. Why Vertex AI Workbench

If your infrastructure already lives in the Google Cloud ecosystem, Workbench provides practical advantages over Databricks, SageMaker, JupyterHub, and others:

Bottom line: if your infra and data are already in GCP, Vertex AI Workbench minimizes integration and operational overhead, enforces consistent security policies, and gives you a fast path to automation with Terraform.

3. UI Drawbacks In Practice

Let’s look at typical issues you hit when creating Workbench instances through the UI:

Terraform addresses these problems: configs are code-reviewed and CI-checked; terraform plan gives a clear diff; a module encodes standards (naming, labels, regions, auto-shutdown, SA roles) and applies them uniformly to every instance.

4. A Notebook Management Maturity Model

Level 0 — Manual UI

Instances and their parameters are set by hand. The (only) advantage is fast initial setup. Downsides: no standards, config drift, messy cost attribution, and heavy audits. Works tolerably with <5 users.

Level 1 — Terraform (local apply)

Configs live as code; terraform plan/apply is run locally. Easy to scale and reproduce environments, do code reviews, and standardize creation. But apply is still manual, leaving room for human error. Suits 5–20 users with infrequent onboarding.

Level 2 — Terraform + CI/CD

plan/apply runs in a pipeline (GitLab CI/CD or similar) with automated policy/security/cost checks. Requires basic DevOps practice (remote state, OIDC/WIF, env isolation). With >20 users and regular onboarding, this approach becomes essential to avoid manual toil and audit/compliance debt.

5. Terraform approach

You’ll need:

We could create instances directly with google_workbench_instance, but that quickly leads to duplication (VPC/network, service account, labels, auto-shutdown policy, region/zone, etc.). Any change becomes a mass update of similar blocks, which complicates review and audit.

Instead, we’ll use a module to encapsulate common parameters and expose only the minimal inputs engineers actually need.

For convenience, here’s a link to the repository with an example project structure: https://github.com/timonovid/vertexai-workbench-terraform-with-ci.

Module implementation

vertexai-workbench-module/main.tf:

resource "google_workbench_instance" "instance" {
  for_each = var.notebook_instances

  project         = var.project_id
  location        = coalesce(each.value.zone, var.default_zone)
  name            = each.key
  instance_owners = each.value.instance_owners
  labels          = var.labels

  gce_setup {
    machine_type = coalesce(each.value.machine_type, var.default_machine_type)
    dynamic "accelerator_configs" {
      for_each = each.value.accelerator_configs != null ? [each.value.accelerator_configs] : []

      content {
        type       = accelerator_configs.value.type
        core_count = accelerator_configs.value.core_count
      }
    }

    disable_public_ip = true

    shielded_instance_config {
      enable_secure_boot          = true
      enable_vtpm                 = true
      enable_integrity_monitoring = true
    }

    service_accounts {
      email = var.service_account_email
    }

    boot_disk {
      disk_size_gb = var.default_boot_disk_size_gb
      disk_type    = "PD_SSD"
    }

    data_disks {
      disk_size_gb = coalesce(each.value.data_disk_size_gb, var.default_data_disk_size_gb)
      disk_type    = "PD_SSD"
    }

    metadata = {
      terraform             = "true",
      idle-timeout-seconds  = var.idle_timeout_seconds,
      post-startup-script   = var.post_startup_script,
      report-event-health   = "true",
      report-dns-resolution = "true"
    }

    network_interfaces {
      network = var.network_name
      subnet  = var.subnet_name
    }
  }
}

Now declare the required variables in vertexai-workbench-module/variables.tf:

variable "notebook_instances" {
  description = "Configuration for each notebook instance"
  type = map(object({ 
    zone          = optional(string)
    machine_type      = optional(string)
    instance_owners   = list(string)
    data_disk_size_gb = optional(number)
        accelerator_configs = optional(object({
      type       = string
      core_count = number
    })) 
  }))
}

variable "labels" {
  description = "instance labels"
  type        = map(string)
}

variable "default_zone" {
  type        = string
  description = "Zone like us-central1-a"
  validation {
    condition     = can(regex("^[a-z0-9-]+-[a-z0-9]+[0-9]-[a-z]$", var.default_zone))
    error_message = "Use a zone format, e.g., us-central1-a."
  }
}

variable "service_account_email" {
  description = "Email of the service account"
  type        = string
}

variable "default_boot_disk_size_gb" {
  description = "Default size in GB for boot disks if not specified."
  type        = number
  default     = 150
}

variable "default_data_disk_size_gb" {
  description = "Default size in GB for data disks if not specified."
  type        = number
  default     = 150
}

variable "default_machine_type" {
  description = "Default machine type if not specified."
  type        = string
  default     = "e2-standard-2"
}

variable "project_id" {
  description = "The project ID"
  type        = string
}

variable "network_name" {
  description = "The name of the network"
  type        = string
}

variable "subnet_name" {
  description = "The name of the subnet"
  type        = string
}

variable "idle_timeout_seconds" {
  type        = number
  description = "Idle timeout in seconds"
  validation {
    condition     = var.idle_timeout_seconds >= 0
    error_message = "idle_timeout_seconds must be >= 0."
  }
}

variable "post_startup_script" {
  description = "The post startup script"
  type        = string
}

Usage example

Create several instances:

module "vertex_instances" {
  source = "./vertexai-workbench-module"

  project_id            = var.project_id
  network_name          = google_compute_network.my_network.name
  subnet_name           = google_compute_subnetwork.my_subnetwork.name
  service_account_email = google_service_account.vertexai-workbench-sa.email
  post_startup_script   = "" # Optional path in gcs to script to run on instance startup. Example gs://your-bucket/init.sh
  idle_timeout_seconds  = 7200
  labels = {
    instance_type = "vertexai_workbench"
  }

  notebook_instances = {
    "workbench-instance-analytics-team-user1" = {
      instance_owners = ["[email protected]"]
    },

    "workbench-instance-analytics-team-user2" = {
      instance_owners = ["[email protected]"]
      machine_type    = "n1-standard-8"
    },

    "workbench-instance-ml-team1-user3" = {
      zone              = "us-central1-a"
      machine_type      = "n1-highmem-8"
      instance_owners   = ["[email protected]"]
      data_disk_size_gb = 500
      accelerator_configs = {
        type       = "NVIDIA_TESLA_T4"
        core_count = 1
      }
    },
  }
}

Module parameter tips

Thus you get several advantages over UI-based creation

6. Creating Notebooks via Terraform + CI/CD

The next improvement is a pipeline that makes every notebook MR go through plan → review → apply without manual deployment. Below is a GitLab example you can stand up in a sprint and then harden with policies and checks.

Prerequisites: a remote Terraform state and GCP access. The fastest starting point is a JSON key for a service account stored as a GitLab File variable SA_KEY, with roles such as Notebooks Admin for Workbench and Storage Object User on the specific GCS bucket holding the state.

Runner image

Build a lightweight image with gcloud and Terraform, push it to your registry, and reference it in image::

FROM google/cloud-sdk:slim

RUN apt-get update && apt-get install -y --no-install-recommends \
    wget \
    && rm -rf /var/lib/apt/lists/*

RUN wget -O- https://apt.releases.hashicorp.com/gpg |  gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg \
    && echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" |  tee /etc/apt/sources.list.d/hashicorp.list \
    && apt update &&  apt install terraform

Minimal GitLab CI/CD config

Two stages: plan writes a plan artifact; apply applies that exact plan after merging to the default branch. plan also runs for MRs to validate config and syntax.

default:
  image: "your-image:latest" # your image with installed gcloud cli and terraform 
  before_script:
  - gcloud auth activate-service-account --key-file="$SA_KEY"
  - export GOOGLE_APPLICATION_CREDENTIALS="$SA_KEY"

variables:
  TF_INPUT: "false"

stages:
  - plan
  - apply

plan-job:
  stage: plan
  script:
    - terraform init -input=false
    - terraform plan -out $CI_PROJECT_DIR/planfile --target=module.vertex_instances -compact-warnings | grep -v -e "Acquiring state lock" -e "Refreshing state"
  artifacts:
    paths:
      - planfile
    expire_in: 1 week
  rules:
    - if: $CI_PIPELINE_SOURCE == 'merge_request_event' || $CI_COMMIT_REF_NAME == "$CI_DEFAULT_BRANCH"
      changes:
        - "vertexai-workbench-instances.tf"

apply-job:
  stage: apply
  script:
    - terraform init -input=false
    - terraform validate
    - terraform apply -auto-approve -input=false $CI_PROJECT_DIR/planfile
  rules:
    - if: $CI_COMMIT_REF_NAME == "$CI_DEFAULT_BRANCH"
      changes:
        - "vertexai-workbench-instances.tf"
  dependencies:
    - plan-job

What you get

How it looks for an engineer

  1. In a feature branch, add a module block (e.g. name, machine_type, disk_size_gb) in vertexai-workbench-instances.tf.
  2. Opening an MR triggers the plan job which runs syntax check and plan.
  3. After approval and merge CI runs apply and the instance appears in GCP with the right labels, network, and auto-shutdown.

Next as your team matures consider moving to OIDC or Workload Identity Federation instead of SA_KEY. Add policy checks and cost gates, split projects/states/workspaces for dev/stage/prod and implement Code Owners.

7. Conclusion

Moving from manual UI to a Terraform module and CI/CD solves three core problems for Vertex AI Workbench: reproducible configs, transparent cost control, and audit readiness. The module hides network/IAM/idle complexity and exposes only the parameters engineers need. The pipeline standardizes changes and captures history.

The result: fewer manual steps, less config drift, more predictable costs, and an operating model that scales with your team.