I work with the engineering team of a global live-streaming platform, and recently, we built a planet-scale WebRTC Selective Forwarding Unit (SFU) on AWS using Kubernetes to keep end-to-end latency below 150 ms (P95 = 140 ms) even across continents.


Key features include:


Pods were autoscaled by tracking real-time SFU metrics (video track count) and using KEDA for bursts. In testing, even under a 2% loss and 200 ms delay network ‘chaos’, CPU usage only rose from ~20% to ~60% and we still met our latency target. Geo-sharding cut our global data egress costs by ~70%. We also enforce end-to-end DTLS/SRTP encryption, mutual TLS between nodes, private ALBs, and regular cert rotation for compliance.


Key Takeaways


Problem Statement: Why a Single-Region SFU Fails

When a WebRTC SFU is hosted in only one region, long-distance calls suffer high round-trip time (RTT) and jitter, degrading quality. For example, a call from Australia to a US-based SFU easily incurs 200+ ms RTT before any media processing. High RTT combined with even small packet loss can drastically reduce video throughput (many TCP/UDP protocols back off aggressively). Lab tests (see Fig. above) show packet-drop rates rising significantly on congested, high-RTT paths. In practice, this means user-perceived latency well above 150 ms and frequent frame freezes for remote participants. Single-region SFUs also concentrate egress traffic into one data center, inflating bandwidth costs (cross-region transit and NAT gateway fees apply). These issues make a single-region SFU architecture unsustainable at a global scale.





Key Takeaways


Architecture Overview

We deployed a multi-region Kubernetes architecture. Each major geographic region runs its own Amazon EKS cluster (for example, us-east-1eu-central-1, etc.) with identical SFU deployments. We put Amazon Route 53 in front with latency-based DNS routing, so a client’s DNS query is answered with the IP of the closest healthy region. To keep signaling sticky, our authentication gateway issues a JSON Web Token (JWT) that embeds the selected region (“geo-sticky JWT”), so subsequent reconnects use the same cluster. This ensures a client’s WebRTC PeerConnections always go to one regional SFU.


sequenceDiagram
    participant C as Client
    participant DNS as Route53 (Latency DNS)
    participant SFU1 as SFU Cluster (Region A)
    participant SFU2 as SFU Cluster (Region B)
    C->>DNS: Resolve "meet.example.com"
    DNS->>C: IP of SFU1 (lowest RTT)
    C->>SFU1: ICE candidate exchange (STUN/TURN)
    SFU1->>C: ICE checks succeed
    C->>SFU1: DTLS ClientHello
    SFU1->>C: DTLS ServerHello (handshake complete)
    C->>SFU1: SRTP media (encrypted)
    SFU1->>C: SRTP media (encrypted)


Components:





Key Takeaways


Auto-Scaling in Depth

Our SFU pods autoscale based on real-time load. We built a custom metric: outboundVideoTracks, counting how many video streams a pod is forwarding. The Kubernetes Horizontal Pod Autoscaler (HPA) is configured to scale up/down by monitoring this metric (via Prometheus or CloudWatch). The HPA controller queries our custom metric API and adjusts replica count to meet demand. For example, if each peer sends a 2 Mbps stream and an SFU CPU core can handle ~50 Mbps, then capacity per pod is roughly 50/2=25 peers. A simple capacity formula is:

N = (bitrate × peers) / (SFU_CPU_capacity × regions)

(e.g. N = ( 2 Mbps × 100 peers ) / ( 50 Mbps × 2 regions ) = 2 N=(2Mbps×100 peers)/(50Mbps×2regions)=2 pods per region).


To handle sudden spikes (e.g. a surge of new conference participants), we use KEDA (Kubernetes Event-driven Autoscaler) for step scaling. KEDA watches an event stream or burst counter (for instance, the rate of new WebSocket connections) and can instantly trigger multiple pods. An example KEDA ScaledObject YAML might look like:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sfu-burst-scaler
spec:
  scaleTargetRef:
    name: sfu-deployment
  triggers:
    - type: kafka
      metadata:
        topic: "sfu-signaling-events"
        bootstrapServers: "kafka:9092"
        lagThreshold: "100"
        consumerGroup: "sfu-group"

This configuration tells KEDA to read Kafka topic lag; if new-join events pile up (lag >100), it will scale the SFU deployment up aggressively. In summary, normal load is handled by HPA on our custom metrics, and KEDA provides on-demand extra capacity for bursts.


Key Takeaways


Chaos Benchmarks

We rigorously tested SFU resilience under adverse conditions. Using Linux tc (traffic control) netem, we injected 200 ms delay and 2% packet loss on SFU pods’ network interfaces. For example:

$ tc qdisc add dev eth0 root netem delay 200ms loss 2%


This “chaos” emulated a poor network path. We then ran a standard 10-party video call and measured metrics. CPU usage on each SFU pod rose only modestly (for instance, from ~20% to ~60%), and the P95 latency increased but stayed under 150 ms due to built-in buffering and Forward Error Correction. In contrast, a software MCU would have hit timeouts or frozen.


Below is a summary of before/after metrics:

MetricBaselineUnder 2% Loss/200 ms Delivery
CPU (avg per pod)20%60%
Observed Packet Loss0%2%
Jitter (P90, ms)10 ms50 ms


Even with induced latency and loss, the SFU gracefully handled retransmissions and only marginally increased jitter. (By contrast, end-to-end peer calls would have frozen under 2% loss at 200 ms RTT.) We logged the netem stats to CloudWatch and confirmed pods did not crash or OOM.


Key Takeaways


Security & Compliance

We enforce strict security for all media and control planes. DTLS-SRTP encryption is mandatory for WebRTC media per RFC 8827: every media channel must be secured via DTLS keying to SRTP. We also establish mutual TLS (mTLS) for all inter-node and control traffic between SFU pods and microservices, ensuring that even internal endpoints verify each other’s certificates. All Kubernetes ingress points use private Application Load Balancers (ALBs) with HTTPS; we do not expose any SFU pod directly to the public internet.


We manage TLS certificates centrally (using AWS Certificate Manager or cert-manager) with automated renewal: certificates rotate before expiry, and we have alarms if any certificate is near expiration. For compliance (e.g. GDPR, HIPAA), we log access and metrics to CloudWatch Logs (encrypted at rest), and we implement network ACLs and Security Groups to isolate clusters. In summary, WebRTC media is always end-to-end encrypted (DTLS/SRTP), node-to-node control paths use mTLS, and we follow AWS best practices for secrets and cert rotation.


Key Takeaways


Cost Analysis – Single-Region vs Multi-Region

Using multiple regions does add some overhead (extra EKS clusters, control-plane instances, and NAT gateways per region). However, it dramatically cuts network egress costs. In a single-region setup, all intercontinental streams incur cross-region data transfer: AWS charges (e.g. ~$0.01–0.02/GB for inter-region) and public egress ($0.09/GB typical). NAT Gateways in each VPC add $0.045/GB (US) for processing. After geo-sharding, roughly 70% of media traffic stays local, so we saw egress spending drop by ~70%. For example, moving users in EU to an EU region saved ~1/3 of transit costs and eliminated corresponding NAT charges.


On the compute side, multi-region means ~2× more EC2/EKS costs (for two regions) and some redundant idle pods. But using spot instances for SFU worker pods (~50% cost) and aggressive downscaling in idle hours offsets this. In sum, total monthly spend (EC2 + data transfer + NAT) is much lower with multi-region at scale, because inter-region bandwidth costs far outweigh the extra VM costs for large user bases.


Key Takeaways


Lessons Learned & Future Work


Building a planet-scale SFU taught us several practical lessons: first, test under network “chaos” early – inject delay and loss to catch bottlenecks. Second, keep the data plane stateless: SFU pods don’t share RTP state across regions, so they fail independently. Third, design autoscaling conservatively: over-provision a bit to absorb bursts without hot-start latency.


For future improvements: we’re experimenting with QUIC (HTTP/3) as a WebTransport backend to further cut handshake overhead and improve performance over lossy links. We also plan to implement dynamic SVC layer adaptation on the SFU: currently we prefer keyframes from lowest quality layer on loss, but dynamic SVC could let clients subscribe to multiple layers. Finally, WebTransport (IETF draft) may replace WebSocket for signaling/data to reduce overhead. We will continue monitoring performance via CloudWatch, tweak scaling formulas, and incorporate new WebRTC features as they mature.


Key Takeaways


References