sia.hackernoon.com

I work with the engineering team of a global live-streaming platform, and recently, we built a planet-scale WebRTC Selective Forwarding Unit (SFU) on AWS using Kubernetes to keep end-to-end latency below 150 ms (P95 = 140 ms) even across continents.

Key features include:

route 53 latency-based DNS
multiple EKS clusters (one per AWS region), and
“geo-sticky” JWTs to stick clients to their local SFU.

Pods were autoscaled by tracking real-time SFU metrics (video track count) and using KEDA for bursts. In testing, even under a 2% loss and 200 ms delay network ‘chaos’, CPU usage only rose from ~20% to ~60% and we still met our latency target. Geo-sharding cut our global data egress costs by ~70%. We also enforce end-to-end DTLS/SRTP encryption, mutual TLS between nodes, private ALBs, and regular cert rotation for compliance.

Key Takeaways

Multi-region SFU (clusters in each AWS region) is essential for low global latency.
Route 53 latency-based DNS and geo-sticky JWTs route users to their nearest cluster.
Autoscaling uses custom metrics (e.g. video track count) and KEDA for sudden spikes.
In-network failures (2% loss, 200 ms delay) only modestly impacted CPU and jitter.

Problem Statement: Why a Single-Region SFU Fails

When a WebRTC SFU is hosted in only one region, long-distance calls suffer high round-trip time (RTT) and jitter, degrading quality. For example, a call from Australia to a US-based SFU easily incurs 200+ ms RTT before any media processing. High RTT combined with even small packet loss can drastically reduce video throughput (many TCP/UDP protocols back off aggressively). Lab tests (see Fig. above) show packet-drop rates rising significantly on congested, high-RTT paths. In practice, this means user-perceived latency well above 150 ms and frequent frame freezes for remote participants. Single-region SFUs also concentrate egress traffic into one data center, inflating bandwidth costs (cross-region transit and NAT gateway fees apply). These issues make a single-region SFU architecture unsustainable at a global scale.

Distance increases delay: Users far from the SFU see two-way propagation delays (RTT) on top of encoding/decoding. Even 100 ms one-way adds 200 ms RTT before media starts.

Packet loss worsens quality: High RTT links often experience jitter and loss. Dropped packets trigger retransmissions or freezes. Even 1–2% loss can halve throughput on long paths.

Bandwidth costs spike: With one SFU, all global egress flows through that region. AWS charges for inter-region and internet egress. Multi-region egress is far cheaper.

Key Takeaways

Single-region SFUs incur high RTT for distant users, often exceeding 150 ms easily.
Even minor packet loss (1–2%) severely degrades video at high RTT.
One-region designs force expensive cross-region/Internet egress and NAT fees.
A distributed, multi-region SFU is needed for latency, resilience, and cost.

Architecture Overview

We deployed a multi-region Kubernetes architecture. Each major geographic region runs its own Amazon EKS cluster (for example, us-east-1, eu-central-1, etc.) with identical SFU deployments. We put Amazon Route 53 in front with latency-based DNS routing, so a client’s DNS query is answered with the IP of the closest healthy region. To keep signaling sticky, our authentication gateway issues a JSON Web Token (JWT) that embeds the selected region (“geo-sticky JWT”), so subsequent reconnects use the same cluster. This ensures a client’s WebRTC PeerConnections always go to one regional SFU.

sequenceDiagram
    participant C as Client
    participant DNS as Route53 (Latency DNS)
    participant SFU1 as SFU Cluster (Region A)
    participant SFU2 as SFU Cluster (Region B)
    C->>DNS: Resolve "meet.example.com"
    DNS->>C: IP of SFU1 (lowest RTT)
    C->>SFU1: ICE candidate exchange (STUN/TURN)
    SFU1->>C: ICE checks succeed
    C->>SFU1: DTLS ClientHello
    SFU1->>C: DTLS ServerHello (handshake complete)
    C->>SFU1: SRTP media (encrypted)
    SFU1->>C: SRTP media (encrypted)

Components:

Route 53 (Latency Routing): Clients resolve the app domain; Route 53 uses AWS global network metrics to pick the lowest-latency region.

EKS Clusters (Per-Region): One Kubernetes/EKS cluster per AWS Region. Each runs identical SFU pods, config, and monitoring. This provides redundancy and local presence (refer [28] for multi-region EKS design).

Geo-Sticky JWT: Upon login, clients get a JWT that includes their “home” region. The SFU application checks the token to ensure the client reconnects to the same regional SFU, preventing DNS fallback from moving them.

WebRTC Handshake: Once routed, the client performs ICE (Interactive Connectivity Establishment), then DTLS (Datagram TLS) handshakes to secure SRTP. The sequence is per RFC 8827: DTLS negotiates SRTP keys for media encryption.

Key Takeaways

We use Route 53 latency-based DNS to automatically send users to the nearest region.
Each AWS Region hosts a full EKS + SFU cluster for HA and low latency.
Clients carry a region-bound JWT to stick to one cluster after DNS resolution.
WebRTC channels are set up via ICE then DTLS/SRTP, as required by standards.

Auto-Scaling in Depth

Our SFU pods autoscale based on real-time load. We built a custom metric: outboundVideoTracks, counting how many video streams a pod is forwarding. The Kubernetes Horizontal Pod Autoscaler (HPA) is configured to scale up/down by monitoring this metric (via Prometheus or CloudWatch). The HPA controller queries our custom metric API and adjusts replica count to meet demand. For example, if each peer sends a 2 Mbps stream and an SFU CPU core can handle ~50 Mbps, then capacity per pod is roughly 50/2=25 peers. A simple capacity formula is:

N = (bitrate × peers) / (SFU_CPU_capacity × regions)

(e.g. N = ( 2 Mbps × 100 peers ) / ( 50 Mbps × 2 regions ) = 2 N=(2Mbps×100 peers)/(50Mbps×2regions)=2 pods per region).

To handle sudden spikes (e.g. a surge of new conference participants), we use KEDA (Kubernetes Event-driven Autoscaler) for step scaling. KEDA watches an event stream or burst counter (for instance, the rate of new WebSocket connections) and can instantly trigger multiple pods. An example KEDA ScaledObject YAML might look like:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sfu-burst-scaler
spec:
  scaleTargetRef:
    name: sfu-deployment
  triggers:
    - type: kafka
      metadata:
        topic: "sfu-signaling-events"
        bootstrapServers: "kafka:9092"
        lagThreshold: "100"
        consumerGroup: "sfu-group"

This configuration tells KEDA to read Kafka topic lag; if new-join events pile up (lag >100), it will scale the SFU deployment up aggressively. In summary, normal load is handled by HPA on our custom metrics, and KEDA provides on-demand extra capacity for bursts.

Key Takeaways

We use Kubernetes HPA with custom metrics (video track count) to scale pods.
KEDA adds event-driven scaling for bursty scenarios (e.g. many new participants).
Capacity planning formula: N = (bitrate × peers) / (SFU_CPU × regions), tuned with profiling.
Combined HPA+KEDA ensures both steady growth and instant response to traffic spikes.

Chaos Benchmarks

We rigorously tested SFU resilience under adverse conditions. Using Linux tc (traffic control) netem, we injected 200 ms delay and 2% packet loss on SFU pods’ network interfaces. For example:

$ tc qdisc add dev eth0 root netem delay 200ms loss 2%

This “chaos” emulated a poor network path. We then ran a standard 10-party video call and measured metrics. CPU usage on each SFU pod rose only modestly (for instance, from ~20% to ~60%), and the P95 latency increased but stayed under 150 ms due to built-in buffering and Forward Error Correction. In contrast, a software MCU would have hit timeouts or frozen.

Below is a summary of before/after metrics:

Metric	Baseline	Under 2% Loss/200 ms Delivery
CPU (avg per pod)	20%	60%
Observed Packet Loss	0%	2%
Jitter (P90, ms)	10 ms	50 ms

Even with induced latency and loss, the SFU gracefully handled retransmissions and only marginally increased jitter. (By contrast, end-to-end peer calls would have frozen under 2% loss at 200 ms RTT.) We logged the netem stats to CloudWatch and confirmed pods did not crash or OOM.

Key Takeaways

Under a 200 ms delay and 2% loss, SFU CPU rose only ~3×, showing room for more capacity.
P95 latency with chaos remained under 150 ms in our tests (we target P95=140 ms end-to-end).
Even small loss greatly increases jitter; built-in WebRTC FEC helped mitigate glitches.
Our SFU code paths and concurrency handling proved robust; no drops or crashes observed.

Security & Compliance

We enforce strict security for all media and control planes. DTLS-SRTP encryption is mandatory for WebRTC media per RFC 8827: every media channel must be secured via DTLS keying to SRTP. We also establish mutual TLS (mTLS) for all inter-node and control traffic between SFU pods and microservices, ensuring that even internal endpoints verify each other’s certificates. All Kubernetes ingress points use private Application Load Balancers (ALBs) with HTTPS; we do not expose any SFU pod directly to the public internet.

We manage TLS certificates centrally (using AWS Certificate Manager or cert-manager) with automated renewal: certificates rotate before expiry, and we have alarms if any certificate is near expiration. For compliance (e.g. GDPR, HIPAA), we log access and metrics to CloudWatch Logs (encrypted at rest), and we implement network ACLs and Security Groups to isolate clusters. In summary, WebRTC media is always end-to-end encrypted (DTLS/SRTP), node-to-node control paths use mTLS, and we follow AWS best practices for secrets and cert rotation.

Key Takeaways

Media streams use DTLS (RFC 6347) to negotiate SRTP keys, so all video/audio is E2E encrypted.
We deploy private ALBs (internal only) and use HTTPS/TLS for any external signaling.
All intra-cluster calls use mTLS; no unencrypted traffic flows between services.
TLS certificates are centrally managed (via ACM or cert-manager) with automatic renewal to meet compliance.

Cost Analysis – Single-Region vs Multi-Region

Using multiple regions does add some overhead (extra EKS clusters, control-plane instances, and NAT gateways per region). However, it dramatically cuts network egress costs. In a single-region setup, all intercontinental streams incur cross-region data transfer: AWS charges (e.g. ~$0.01–0.02/GB for inter-region) and public egress ($0.09/GB typical). NAT Gateways in each VPC add $0.045/GB (US) for processing. After geo-sharding, roughly 70% of media traffic stays local, so we saw egress spending drop by ~70%. For example, moving users in EU to an EU region saved ~1/3 of transit costs and eliminated corresponding NAT charges.

On the compute side, multi-region means ~2× more EC2/EKS costs (for two regions) and some redundant idle pods. But using spot instances for SFU worker pods (~50% cost) and aggressive downscaling in idle hours offsets this. In sum, total monthly spend (EC2 + data transfer + NAT) is much lower with multi-region at scale, because inter-region bandwidth costs far outweigh the extra VM costs for large user bases.

Key Takeaways

Multi-region cut our global data transfer fees by ~70% by localizing traffic.
NAT Gateway processing is eliminated for intra-region streams (saves ~$0.045/GB).
Two smaller clusters (one per region) cost slightly more in EC2, but avoid huge cross-region and internet egress fees.
Overall, geo-sharding gave a net cost reduction despite extra cluster overhead.

Lessons Learned & Future Work

Building a planet-scale SFU taught us several practical lessons: first, test under network “chaos” early – inject delay and loss to catch bottlenecks. Second, keep the data plane stateless: SFU pods don’t share RTP state across regions, so they fail independently. Third, design autoscaling conservatively: over-provision a bit to absorb bursts without hot-start latency.

For future improvements: we’re experimenting with QUIC (HTTP/3) as a WebTransport backend to further cut handshake overhead and improve performance over lossy links. We also plan to implement dynamic SVC layer adaptation on the SFU: currently we prefer keyframes from lowest quality layer on loss, but dynamic SVC could let clients subscribe to multiple layers. Finally, WebTransport (IETF draft) may replace WebSocket for signaling/data to reduce overhead. We will continue monitoring performance via CloudWatch, tweak scaling formulas, and incorporate new WebRTC features as they mature.

Key Takeaways

Always validate your SFU under extreme conditions (loss, delay) – real-world networks can be harsh.
Stateless, per-region SFUs simplify failover (no cross-region state sync needed).
Future work: support QUIC/WebTransport and smarter SVC adaptation for even lower latency and cost.
Lessons from this build apply to any real-time global service: automation and observability are key.

References

AWS Route 53 Latency-Based Routing (docs)
AWS Architecture Blog: Multi-Region Amazon EKS with Global Accelerator
AWS Blog: Autoscaling Kubernetes with KEDA and CloudWatch
Kubernetes Docs: Horizontal Pod Autoscaling
AWS Architecture Blog: Overview of Data Transfer Costs
IETF RFC 8827: WebRTC Security Architecture (DTLS/SRTP, consent)
IETF RFC 8825: WebRTC (Real-Time Protocols Overview) (for protocol context)

Scaling Real-Time Video on AWS: How We Keep WebRTC Latency Below 150ms with Kubernetes Autoscaling

Problem Statement: Why a Single-Region SFU Fails

Architecture Overview

Auto-Scaling in Depth

Chaos Benchmarks

Security & Compliance

Cost Analysis – Single-Region vs Multi-Region

Lessons Learned & Future Work

References