Two bridges, one rush hour, zero do-overs

“Cutover weekend” is a fairy tale when you’re migrating thousands of tapes (or billions of objects). Real migrations live in the messy middle: two stacks, two truths, and twice the places for ghosts to hide. The goal isn’t elegance—it’s survivability. You’re not building a bridge and blowing up the old one; you’re running both bridges during rush hour… while replacing deck boards.

TL;DR (for the exec sprinting between status meetings)

Why dual-stack is not optional

In any non-toy environment, your users don’t pause while you migrate. You must:

  1. Serve existing read/write demand on the old stack,
  2. Hydrate and validate data into the new stack, and
  3. Prove parity (fixity + findability + performance envelopes) before you demote the old system.

That’s three states of matter in one datacenter. If you don’t consciously separate them, your queueing theory will do it for you—in the form of backlogs and angry auditors.

The Five Hard Beats (and how to win them)

1) Two telemetry stacks (and why you want both)

Principle: Never collapse new-stack signals into old-stack plumbing. You’ll lose fidelity and paper over regressions.

Old stack (Legacy):

New stack (Target):

Bridging layer:

Operational rule: If a signal can be misread by someone new at 3 AM, duplicate it with a human label. (“ec_window_backlog → ‘EC pending repair chunks’.”)

2) Two alert planes (and how they fail differently)

Alerts are not purely technical; they encode who gets woken up and what’s allowed to break.

Plane A — Legacy SLA plane (keeps the current house standing):

Plane B — Migration plane (keeps the new house honest):

Golden rule: Never page the same person from both planes for the same symptom. Assign clear ownership and escalation. (Otherwise you’ll create the dreaded two pagers, zero action syndrome.)

3) Capacity headroom math (the part calculators “forget”)

Capacity during overlap has to absorb four things simultaneously:

  1. Working set your users still hit on the legacy stack (reads + occasional writes).
  2. Hydration buffer to ingest into the new stack.
  3. Verification buffer for post-write fixity and temporary duplication.
  4. Retry + rollback cushion for when a batch fails fixity or the new stack sneezes.

Let’s model new-stack headroom with conservative factors.

Variables

Headroom formula (new stack required free capacity during overlap):

Headroom_new ≈ (α_recall * D_total) 
             + (β_verify * d_day * W) 
             + (γ_retry * d_day * W) 
             + (δ_growth * D_total * (Overlap_years))

Worked example

Total headroom_new ≈ 1.6 + 0.25 + 0.17 + 4 = 6.02 PB

Reality check: If that number makes you queasy, good. Most teams under-provision growth and verification windows. You can attack it by shortening W (risk trade), throttling d_day (time trade), or using a larger on-prem cache with cloud spill (cost trade). Pick your poison intentionally.

4) Power/cooling overlap (how hot it gets before it gets better)

During dual-stack, you often run peak historical load + new system burn-in. Nameplate lies; measure actuals.

Basics

Variables

Peak facility power during overlap:

P_facility_peak ≈ (P_old_peak + P_new_peak * f_overlap) * PUE
   **Cooling load (BTU/hr):**     
BTU/hr ≈ P_facility_peak (kW) * 1000 * 3.412
Tons ≈ BTU/hr / 12000

Example

Implication: If your room is rated 80 tons with no redundancy, you’re courting thermal roulette. Either stage the new system ramp, or get a temporary chiller and hot-aisle containment tuned before the overlap peaks.

5) Rollback strategy (because you will need it)

You need a reversible plan when the new stack fails parity or the API lies.

Rollback checklist:

Litmus test: Can you prove that any object written in the last 72 hrs is readable and fixity-verified on exactly one authoritative stack? If you’re not sure which, you don’t have a rollback; you have a coin toss.

The RACI that keeps humans from stepping on the same rake

Timeline: the 2.5-year overlap

Legend: M = monthly rollback drill; “SLA parity soak” = run new stack at target SLOs with production traffic for 90 days minimum.

The Playbook (from “please work” to “boringly reliable”)

A) Telemetry: build the translator first

B) Alert planes: page the right human

C) Capacity headroom: compute, publish, revisit

D) Power & cooling: schedule the ugly

E) Rollback: rehearsed, timed, boring

F) Governance: declare the boring finish line

“But can we just… cut over?” (No.)

Let’s turn the snark dial: Cutover weekend is a cute phrase for small web apps, not for petabyte archives and tape robots named after Greek tragedies. Physics, queueing, and human sleep cycles don’t read your SOW. You’ll either:

Pick intentional.

Worked mini-scenarios (because math > vibes)

Scenario 1: Verify window pressure

Scenario 2: Power/cooling brown-zone

Scenario 3: Rollback horizon truth test

Common failure smells (and the antidotes)

Manager’s checklist (print this)

Closing

If your plan is a slide titled “Cutover Weekend,” save time and rename it to “We’ll Be Here All Fiscal Year.” It’s not pessimism; it’s project physics. Dual-stack years are how grown-ups ship migrations without turning their archives into crime scenes.