After years of searching, there is still no cure for Digital Disposophobia

This is the second article for The Migration Tax which was written and appeared in LinkedIn back on Sep 24th 2025


Everyone loves the tidy “18-month migration plan.” Slideware says: 18 months ≈ 548 days. Reality taps you on the shoulder: you planned work time, not calendar time. With a sane cadence (no heroics; 5.5 working days/week because weekends exist and people occasionally sleep), you have:

You don’t get 548 days unless your automation is 100% self-healing and somebody is always there to notice when it isn’t (spoiler: it isn’t). Real migrations are six queues running at once, not a single progress bar:

audit → stage → verify_current → in_flight → release → stage_new → verify_new

Note: “audit” is preformed before the six serial queues are start

Treat each as a first-class work center with its own SLOs, backpressure, and failure modes. If you plan like a Gantt chart but operate like a factory, you win. If you plan like a fairy tale, you’ll still be “almost done” when the your manager


Ground Rules (so our math means something)


Time Budget: the “18 months” illusion

Let’s encode the simple reality:

WorkDays = Weeks × WorkingDaysPerWeek = 72 × 5.5 = 396

396 days is the maximum number of supervise-and-respond days across the migration. Holidays, outages, audits, parallel projects, sick days, procurement hiccups, firmware “adventures,” and “someone changed a bucket policy at 3 a.m.” all subtract time. If your plan pretends you get 548, your plan is performance art.

If your project plan assumes 7×24 humans with 0 restroom breaks and 0 vacations, congratulations—you’ve engineered the world’s first migration powered entirely by guilt.


The Six Simultaneous States (WIP with names)

Why this matters: You can’t “do staging” and then “do verification.” In practice, all six states will carry work-in-progress (WIP) simultaneously. That means six places to build queues, six places to fall behind, six places to keep SLOs honest.

WIP accounting (copy-paste friendly):

WIP_total = Σ queue_len(state_i), for i ∈ {stage, verify_current, in_flight, release, stage_new, verify_new}

Backpressure triggers when any queue exceeds its max_age_SLO or max_len.

Set SLOs (SLO == service level objective) per state; publish them; alert when breached.


Tape Staging: Why Your “Throughput” Isn’t

You’ve got SL8500s or similar, a comforting number of LTO drives, and a tape labeled “~6 million files.” Your calculator says: “LTO-9 streams at ~300 MiB/s. Easy win.” Your calculator forgot shoe-shining: when the tape head oscillates between tiny reads with seeks and start/stop thrash because your application wrote a tarball of fleas (millions of 2 KiB pets) instead of a streaming elephant.

Small files kill streaming. The per-file overhead—mount/seek, open/close, file metadata, bookkeeping, verify touch—dominates. Streaming speed becomes trivia.

A tape horror vignette (yes, this happens):

Why staging can take 20+ days per tape: Multiply tiny per-file overhead by millions. That “300 MiB/s” dissolves into a few hundred KiB/s of effective throughput. It’s like towing a trailer with a Ferrari in rush-hour traffic: great engine, same gridlock.

“But our drives are rated for 400+ MiB/s sustained!” Sure, and my coffee mug is rated for “dishwasher safe.” Both are true. Neither helps while you seek 6,000,000 times.


The Throughput Model (equations you can paste)

Let’s define variables:

D  = target_stage_per_day (bytes/day) 
R  = sustained_tape_rate (bytes/s) for YOUR workload (not brochure) 
O  = per-file overhead (s/file) (seek + open/close + bookkeeping + verify touch) 
S  = average file size (bytes) 
F  = average files per tape (unitless) 
G  = concurrency (# active drives pulling)

Effective rate per drive (files/s):

files_per_second_per_drive ≈ 1 / ( (S / R) + O )

Bytes per drive per day:

bytes_per_drive_day ≈ files_per_second_per_drive × S × 86400

Required drives:

G_required ≈ D / bytes_per_drive_day

Intuition: If S is tiny, then (S / R) → 0 and O dominates. Your drive behaves like a seek machine, not a streamer. Required drives explode. (This is where dreams go to die.)


Worked Example (why tiny files = clown car)

Let’s pick conservative, round numbers (use your real ones later):

R = 300 MiB/s              # optimistic streaming rate 
S = 2 KiB                  # tiny file 
O = 6 ms = 0.006 s         # open/seek/close/verify bookkeeping per file 
D = 200 TiB/day            # target staging volume per day

Calculate the time to process one file:

S / R = (2×1024 B) / (300×1024×1024 B/s) ≈ 6.7×10^-6 s  (≈ 6.7 µs) 
Per-file time ≈ O + (S / R) ≈ 0.0060067 s 
files/s/drive ≈ 1 / 0.0060067 ≈ 166.5 
bytes/s/drive ≈ 166.5 × 2048 ≈ ~341,000 B/s ≈ 0.325 MiB/s 
bytes/day/drive ≈ 0.325 MiB/s × 86400 s ≈ 27.2 GiB/day 
G_required ≈ 200 TiB/day / 27.2 GiB/day ≈ ~7,520 drives

7,520 drives. That’s not a plan, that’s a cry for help. You either repackage or you accept that your migration finishes sometime after the heat-death of the universe.

If the mitigation plan is “we’ll just add more drives,” I’d like to introduce you to our CFO, who has some follow-up questions about your definition of “just.”


Repackaging (TarChonks™) Turns Gridlock into Throughput

Take the same dataset but aggregate into ~1 GiB chunks (tar, WARC, CAR, Parquet row-groups—pick your poison). Yes, there are tradeoffs (indexing, provenance, hot reads), but staging becomes sane:

S = 1 GiB 
R = 300 MiB/s 
O = 6 ms = 0.006 s 
D = 200 TiB/day

Compute:

S / R = (1 GiB) / (300 MiB/s) ≈ 3.41 s 
Per-file time ≈ 3.41 + 0.006 ≈ 3.416 s 
bytes/s/drive ≈ 1 GiB / 3.416 s ≈ ~0.293 GiB/s ≈ 301 MiB/s (near streaming) 
bytes/day/drive ≈ 301 MiB/s × 86400 ≈ ~25.4 TiB/day 
G_required ≈ 200 TiB/day / 25.4 TiB/day ≈ ~8 drives

From 7,520 to 8 drives. That’s the power of respecting physics. Aggregation isn’t optional; it’s the difference between a clown car and a freight train.


Tape Mix: Not All Reels Are Created Equal

Your library holds tapes ranging from 40+ files (fat streams) to 6+ million files (dust). Any daily plan that assumes “average tape” will produce wild day-to-day swings in achieved staging volume.

Daily staging plan must be tape-aware:

Queue policy: Don’t let a single “dust tape” monopolize a drive for days. Chunk its work across shifts or if possible repackage staged output before moving downstream.


Verification Windows and Debt (what keeps you honest)

Declare verification as a service with SLOs:

V2_deadline (Copy-2 verify)  ≤ 24 hours 
V3_deadline (Copy-3 verify)  ≤ 7 days

Track verification debt:

Debt(t) = Writes(t) − Verified(t)

Why Debt matters: If Debt climbs, your redundancy is theoretical. You’re stacking unverified bytes and borrowing against luck. The interest rate on luck is brutal.


The New System Is Production Now (please act like it)

Your shiny object store / file tier / cache cluster is not a lab. It’s where new data lands every day, while you’re shoving yesterday’s data through the ingest snake. Expect:

Mitigations:


People Math (the line no calculator includes)

Your 396-day window needs people who can:

Staffing model (copy-paste skeleton):

Pipelines = concurrent ingest lines (e.g., 3) 
InterventionsPerPipelinePerHour = 0.2–1.5 (depends on maturity) 
ShiftLength = 8–10 hours 
ShiftsPerDay = 2–3 (to cover 16–24h ops) 
PeoplePerShift = ceil(Pipelines × InterventionsPerPipelinePerHour × 2)  # 2 = handle overlap & escalations 
OncallCoverage = 1 primary + 1 secondary (24/6) 
RunbookMaturityFactor = 0.6–1.0 (lower is better; reduces interventions)

If your spreadsheet has 0 in “interventions/hour,” it’s not “lean”—it’s fan fiction.


Release & Idempotency: Don’t Paint Yourself into a Corner

It’s not a migration if you can’t prove provenance and repeat a write without duplicating it.


Planning With Real Numbers: A Daily Tape Plan Example

Let’s model a daily plan that fills a 200 TiB/day bucket using a mix of tape classes:

Per-drive daily: (from the formulas)

S/R ≈ (64 MiB)/(250 MiB/s) ≈ 0.256 s 
Per-file ≈ 0.256 + 0.006 = 0.262 s 
bytes/s ≈ 64 MiB / 0.262 ≈ ~244 MiB/s 
bytes/day ≈ 244 × 86400 ≈ 20.5 TiB/day/drive

Target: 200 TiB/day. Example lineup:

This isn’t exact science; it’s queue dietetics. The point is to budget by tape class so your bucket fills predictably.


Verification SLOs as First-Class Citizens

Declare them like APIs, not like wishes:

Copy-2 verify SLO: 95th percentile ≤ 24 hours 
Copy-3 verify SLO: 95th percentile ≤ 7 days 
Monthly spot-check: ≥ 1% random sample restored & re-verified 
Mismatch incident: >0.01% / 100k assets / 24h → incident mode

Post these. Live by them. The only thing less fun than a mismatch is not knowing you have a mismatch.


Cost and Risk Shape (because CFOs read, too)


Anti-Patterns (and how to be better)


A Simple, Sane Architecture (that won’t bite you later)

One SQL you should be able to run: “Show me all assets whose Copy-3 hasn’t been verified in the last 30 days.” If that query is hard, future-you is going to have a very exciting audit.


The Full Math Block

# Time budget 
Months = 18 
Weeks = 72 
WorkingDaysPerWeek = 5.5 
WorkDays = 396

#Six-state WIP
states = [stage, verify_current, in_flight, release, stage_new, verify_new] 
WIP_total = Σ queue_len(s) for s in states

#Tape staging throughput
D  = target_stage_per_day_bytes 
R  = sustained_tape_rate_Bps 
O  = per_file_overhead_seconds 
S  = avg_file_size_bytes files_per_second_per_drive = 1 / ((S / R) + O) 
bytes_per_drive_day = files_per_second_per_drive * S * 86400 
G_required = D / bytes_per_drive_day

#Verification SLOs
V2_deadline ≤ 24h,  
V3_deadline ≤ 7d 
Debt(t) = Writes(t) − Verified(t) 
Alert if Debt grows N days or mismatches > threshold

TL;DR Playbook

If your plan fits on one slide, you don’t have a migration plan — you have clip art with commitment issues.


#DataMigration #Storage #DigitalPreservation #Tape #Checksum #Fixity #PetabyteScale

#LTO #SL8500