sia.hackernoon.com

Introduction

GitHub Actions is the go-to CI/CD tool for many teams. But when your organization runs thousands of pipelines daily, the default setup breaks down. You hit limits on scale, security, and governance — plus skyrocketing costs.

GitHub-hosted runners are easy but expensive and don’t meet strict compliance needs. Existing self-hosted solutions like Actions Runner Controller (ARC) or Terraform EC2 modules don’t fully solve multi-tenant isolation, automation, or centralized control.

ForgeMT, built inside Cisco’s Security Business Group, fills that gap. It’s an open-source AWS-native platform that manages ephemeral runners with strong tenant isolation, full automation, and enterprise-grade governance.

This article explains why ForgeMT matters and how it works — providing a practical look at building scalable, secure GitHub Actions runner platforms.

Why Enterprise CI/CD Runners Fail at Scale

At large organizations, scaling GitHub Actions runners encounters four key bottlenecks:

Fragmented Infrastructure: Teams independently choose their CI/CD tools: Jenkins, Travis, CircleCI, or self-hosted runners—which accelerates local delivery but creates duplicated effort, configuration drift, and fragmented monitoring. Without a unified platform, scalability, security, and reliability degrade.
Weak Tenant Isolation: Runners run untrusted code across teams. Without strong isolation, one compromised job can leak credentials or escalate attacks across tenants. Poor audit trails slow breach detection and hinder compliance.
Scalability Limits: Static IP pools cause IPv4 exhaustion, and manual provisioning delays runner startup. Without elastic scaling, resources are wasted or pipelines queue up, killing developer velocity.
Maintenance and Governance Overhead: Uneven patching weakens security, infrastructure drift complicates troubleshooting, and audits become expensive and error-prone. Secure scaling demands centralized governance, consistent policy enforcement, and automation.

In short, enterprises fail to scale GitHub Actions runners without a platform that:

Centralizes multi-tenancy
Automates lifecycle management
Provides enterprise-grade observability and governance

But beware—over-centralization can kill flexibility and introduce new challenges.

Why GitHub Actions — And Why It’s Not Enough at Enterprise Scale

GitHub Actions is popular because it offers:

Deep GitHub integration: triggers on PRs, branches, and tags with no extra logins, plus automatic secret and artifact handling.
Extensible ecosystem: thousands of marketplace actions simplify workflow creation.
Flexible runners: GitHub-hosted runners for convenience, or self-hosted for control, cost savings, and compliance.
Granular security: native GitHub Apps, OIDC tokens, and fine-grained permissions enforce least privilege.
Rapid scale: pipelines at repo or org level enable smooth CI/CD growth.

However, GitHub Actions alone can’t meet enterprise-scale demands. Enterprises require:

Strong tenant isolation and centralized governance across thousands of pipelines.
A unified platform to avoid fragmented infrastructure and scaling bottlenecks.
Fine-grained identity, network controls, and compliance enforcement.
Automation for onboarding, patching, and auditing to reduce operational overhead.

Cloud providers like AWS supply identity, networking, and automation building blocks—IAM/OIDC, VPC segmentation, EC2, EKS (needed to build secure, scalable, multi-tenant CI/CD platforms).

Existing Solutions and Why They Fall Short

Actions Runner Controller (ARC) runs ephemeral Kubernetes pods as GitHub runners, scaling dynamically with declarative config and Kubernetes-native integration. But:

Kubernetes namespaces alone don’t provide strong security isolation.
No native AWS IAM/OIDC integration.
Lacks onboarding, governance, and audit automation.
Network policy management is manual, increasing operational overhead.

Terraform AWS GitHub Runner Module provisions EC2 self-hosted runners with customizable AMIs, integrating well with IaC pipelines. However:

Typically deployed per team, causing fragmentation.
No native multi-tenant isolation.
Requires manual IAM and account setup.
No onboarding or patching automation.

Commercial Runner-as-a-Service options offer simple UX, automatic scaling, and vendor-managed maintenance with SLAs, but:

High costs at scale.
Vendor lock-in risks.
Limited multi-tenant isolation.
Often don’t meet strict compliance requirements.

Where ForgeMT Fits In

ForgeMT combines the best of these approaches to deliver an enterprise-ready platform:

Orchestrates ephemeral runners seamlessly.
Uses AWS-native identity and network isolation (IAM/OIDC).
Built-in governance with full lifecycle automation.
Designed for large, security-focused organizations.

ForgeMT doesn’t reinvent ARC or EC2 modules but extends them with:

Strict multi-tenant isolation: Each team runs in a separate AWS account to contain blast radius. IAM/OIDC enforces least privilege. Calico CNI manages Kubernetes network segmentation.
Full automation: Tenant onboarding, runner patching, centralized monitoring, and drift remediation happen automatically, cutting manual toil and errors.
Centralized control plane: One dashboard securely manages all tenants with governance, audit logs, and compliance-ready traceability.
Cost optimization: Spot instances, warm pools, and autoscaling based on real-time metrics and spot prices reduce costs without sacrificing availability.
Open-source transparency: 100% open source—no vendor lock-in, no license fees, full customization freedom.

Architecture Overview

At its core, ForgeMT is a centralized control plane that orchestrates ephemeral runner provisioning and lifecycle management across multiple tenants running on both EC2 and Kubernetes.

Key Components

Terraform module for EC2 runners — provisions ephemeral EC2 runners with autoscaling, spot/on-demand, and ephemeral lifecycle.
Actions Runner Controller (ARC) — manages EKS-based runners as Kubernetes pods with tenant namespace isolation.
OpenTofu + Terragrunt — Infrastructure as Code managing tenant/account/region deployments declaratively.
IAM Trust Policies — secure runner access with ephemeral credentials via role assumption.
Splunk & Observability — centralized logs and metrics per tenant.
Teleport — secure SSH access to ephemeral runners for auditing and debugging.
EKS + Calico CNI — scalable pod networking with strong tenant segmentation and minimal IP usage.
EKS + Karpenter — demand-driven node autoscaling with spot and on-demand instances, plus warm pools.

ForgeMT Control Plane

The control plane is the platform’s brain — managing runner provisioning, lifecycle, security, scaling, and observability.

Centralized Orchestration: Decides when and where to spin up ephemeral runners (EC2 or Kubernetes pods).
Multi-Tenant Isolation: Isolates each tenant via dedicated AWS accounts or Kubernetes namespaces, IAM roles, and network policies.
Security Enforcement: Applies hardened runner configurations, automates ephemeral credential rotation, and enforces least privilege.
Scaling & Optimization: Integrates with Karpenter and EC2 autoscaling to scale runners up/down with demand and cost awareness.
Observability & Governance: Streams logs and metrics to Splunk; provides audit trails and compliance dashboards.

Runner Types and Usage

Tenant Isolation

Each ForgeMT deployment is single-tenant and region-specific. IAM roles, policies, VPCs, and services are scoped exclusively to that tenant-region pair. This hard boundary prevents cross-tenant access, simplifies compliance, and minimizes blast radius.

EC2 Runners

Ephemeral VMs booted from Forge-provided or tenant-custom AMIs.
Jobs run directly on VMs or inside containers.
IAM role assumption replaces static credentials.
Terminated after each job to avoid drift or leaks.

EKS Runners

Managed by ARC as Kubernetes pods in tenant namespaces.
Images pulled from Forge or tenant ECR repositories.
Scales dynamically for burst workloads.

Warm Pools and Limits

ForgeMT supports warm pools of pre-initialized runners to minimize cold start latency—especially beneficial for EC2 runners with slower boot times.

Per-tenant limits enforce:

Max concurrent runners
Warm pool size
Runner lifetime (auto-termination after jobs)

These controls prevent resource abuse and keep costs predictable.

Tenant Onboarding

Deploying a new tenant is straightforward and fully automated via a single declarative config file, for example:

gh_config:
  ghes_url: ''
  ghes_org: cisco-open
tenant:
  iam_roles_to_assume:
    - arn:aws:iam::123456789012:role/role_for_forge_runners
  ecr_registries:
    - 123456789012.dkr.ecr.eu-west-1.amazonaws.com
ec2_runner_specs:
  small:
    ami_name: forge-gh-runner-v*
    ami_owner: '123456789012'
    ami_kms_key_arn: ''
    max_instances: 1
    instance_types:
      - t2.small
      - t2.medium
      - t2.large
      - t3.small
      - t3.medium
      - t3.large
    pool_config: []
    volume:
      size: 200
      iops: 3000
      throughput: 125
      type: gp3
  large:
    ami_name: forge-gh-runner-v*
    ami_owner: '123456789012'
    ami_kms_key_arn: ''
    max_instances: 1
    instance_types:
      - c6i.8xlarge
      - c5.9xlarge
      - c5.12xlarge
      - c6i.12xlarge
      - c6i.16xlarge
    pool_config: []
    volume:
      size: 200
      iops: 3000
      throughput: 125
      type: gp3
arc_runner_specs:
  dind:
    runner_size:
      max_runners: 100
      min_runners: 1
    scale_set_name: dependabot
    scale_set_type: dind
    container_actions_runner: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/actions-runner:latest
    container_requests_cpu: 500m
    container_requests_memory: 1Gi
    container_limits_cpu: '1'
    container_limits_memory: 2Gi
    volume_requests_storage_type: gp2
    volume_requests_storage_size: 10Gi
  k8s:
    runner_size:
      max_runners: 100
      min_runners: 1
    scale_set_name: k8s
    scale_set_type: k8s
    container_actions_runner: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/actions-runner:latest
    container_requests_cpu: 500m
    container_requests_memory: 1Gi
    container_limits_cpu: '1'
    container_limits_memory: 2Gi
    volume_requests_storage_type: gp2
    volume_requests_storage_size: 10Gi

Enter fullscreen mode Exit fullscreen mode

The ForgeMT platform uses this config to:

Provision tenant-specific AWS accounts and resources.
Set IAM roles with least privilege trust policies.
Configure GitHub integration and runner specs.
Enforce tenant limits and runner types.

This automation enables zero-touch onboarding with no manual AWS or GitHub setup required by the tenant.

Extensibility

ForgeMT lets tenants customize their environments and control runner access:

Custom AMIs for EC2 runners with tenant-specific tooling.
Private ECR repositories to host container images for VMs or Kubernetes.
Tenant IAM roles with trust policies so ForgeMT runners assume them securely without static keys.
Advanced access patterns like chained role assumptions or resource-based policies for complex needs.

This lets each team tune cost, security, and performance independently without affecting core platform stability.

Security Model

ForgeMT’s foundation is strong isolation and ephemeral execution to reduce risk:

Dedicated IAM roles, namespaces, and AWS accounts per tenant.
No cross-tenant visibility or access.
Ephemeral runners destroyed immediately after job completion to prevent credential or data leakage.
Temporary credentials via IAM role assumption replace static AWS keys.
Fine-grained access control configurable by tenants for resource permissions.
Full audit trail of provisioning, execution, and shutdown logged via CloudWatch → Splunk.
Meets CIS Benchmarks and internal security policies.

Debugging in a Secure, Ephemeral World

Ephemeral runners mean persistent debugging isn’t possible by design, but ForgeMT offers:

Live debugging with Teleport: Keep runners alive temporarily via workflow tweaks to enable SSH into running jobs.
Reproducible reruns: Failed jobs can be rerun identically from GitHub UI.
Log-based troubleshooting: Access runner telemetry, syslogs, and job logs centrally without infrastructure exposure.
Kubernetes support: Same debugging mechanisms apply to EKS runners, preserving isolation and auditability.

Conclusion

ForgeMT is likely overkill for small teams. Start simple with ephemeral runners (EC2 or ARC), GitHub Actions, and Terraform automation. Only scale up when you hit real pain points. ForgeMT shines in multi-team environments where tenant isolation, governance, and platform automation are mission-critical. For solo teams, it just adds unnecessary complexity.

ForgeMT addresses the major enterprise challenges of running GitHub Actions runners at scale by delivering:

Strong multi-tenant isolation
Fully automated lifecycle management and governance
Flexible runner types with cost-aware autoscaling and warm pools
Secure, ephemeral environments that meet compliance needs
An open-source, extensible platform for customization

For organizations struggling to scale self-hosted runners securely and efficiently on AWS, ForgeMT provides a battle-tested, transparent platform that combines AWS best practices with developer-friendly automation.

Dive Into the ForgeMT Project

Ideas are cheap — execution is what counts. ForgeMT’s source code is public — check it out:

👉 https://github.com/cisco-open/forge/

⭐️ If you find it useful, don’t forget to drop a star!

🤝 Connect

Let’s connect on LinkedIn and GitHub.

Scaling GitHub Actions on AWS with ForgeMT’s Security and Multi-Tenancy