sia.hackernoon.com

“After years of searching, there is still no cure for Digital Disposophobia”

What it takes to move a multi-petabyte archive from legacy tape to hybrid object storage—and why planning, hashing, and real-world limitations matter more than any cloud calculator.

Introduction — The Hidden Costs of Data Migration

When people hear you’re migrating 34 petabytes of data, they expect it’ll be expensive—but not that expensive. After all, storage is cheap. Cloud providers quote pennies per gigabyte per month, and object storage vendors often pitch compelling cost-per-terabyte pricing. Tape is still considered low-cost. Object systems are marketed as plug-and-play. And the migration itself? Supposedly, just a big copy job.

In reality? The true cost of a large-scale data migration isn’t in the storage—it’s in the movement.

If you’re managing long-term digital archives at scale, you already know: every file has history, metadata, and risk. Every storage platform has bottlenecks. Every bit has to be accounted for. And every misstep—be it silent corruption, metadata loss, or bad recall logic—can cost you time, money, and trust.

This article outlines the early stages of our ongoing migration of 34 petabytes of tape-based archival data to a new on-premises hybrid object storage system—and the operational, technical, and hidden costs we’re uncovering along the way.

The Current Day and the Life of the Preservation Environment

Before we examine the scale and complexity of our migration effort, it's important to understand the operational heartbeat of the current digital preservation environment. This is not a cold archive sitting idle—this is a living, actively maintained preservation system adhering to a rigorous 3-2-1 policy: at least three copies, on two distinct media types, with one copy geographically off-site.

3-2-1 in Practice

Our preservation strategy is based on three concurrent and deliberately separated storage layers:

Primary Copy (Tape-Based, On-Premises) Housed in our main data center, this is the primary deep archive. It includes Oracle SL8500 robotic libraries using T10000D media, a Quantum i6000 with LTO-9 cartridges, and is orchestrated entirely by Versity ScoutAM.
Secondary Copy (Tape-Based, Alternate Facility) Located in a separate data center, this second copy is maintained on a distinct tape infrastructure. It acts as both a resiliency layer and a compliance requirement, ensuring survivability in case of a catastrophic site failure at the primary location.
Tertiary Copy (Cloud-Based, AWS us-east-2) Every morning, newly ingested files written to the Versity ScoutAM system are reviewed and queued for replication to Amazon S3 buckets in the us-east-2 region. This process is automated and hash-validated, ensuring the offsite copy is both complete and independently recoverable.

Importantly, this cloud-based copy is contractual in nature—subject to renewal terms, vendor viability, and pricing structures. To uphold the 3-2-1 preservation standard long-term, we treat this copy as disposable yet essential: if and when the cloud contract expires, the full cloud copy is re-propagated to a new geographically distributed storage location—potentially another cloud region, vendor, or sovereign archive environment. This design ensures that dependency on any single cloud provider is temporary, not foundational.

Daily Lifecycle Operations

Despite the appearance of a “cold archive,” this system is active, transactional, and managed daily. Key operations include:

New Ingests: Files continue to be written to ScoutFS via controlled data pipelines. These often come from internal digitization projects, external partners, or ongoing digital collections initiatives.
Fixity Verification: For each new ingest, cryptographic checksums are embedded into the user hash space of ScoutFS to ensure future validation. These hashes are stored at time of write and used for all subsequent checks.
Replication Pipeline (Cloud Offsite Copy): Once a file is written and verified locally, a daily script scans the Versity environment for the current scheduler and gathers entries from the archiver.log to identify directories that had archive jobs executed the previous day. These identified files are queued for replication to AWS S3 in the us-east-2 region. Files are transmitted in their original structure, and upon successful upload, the cloud-stored version is validated using the same hash metadata. Any mismatch is flagged for remediation.

A Moving Target

This is the reality we are migrating from—not a static legacy tape pool, but an active, resilient, and highly instrumented preservation environment.

The migration plan outlined in the next section doesn’t replace this environment overnight—it transitions just one of the three preservation copies to a new hybrid object storage model. The second tape copy remains fully operational, continuing to receive daily writes, while cloud replication continues for all eligible content. This overlapping strategy allows us to validate new infrastructure in production without putting preservation guarantees at risk.

Upcoming Migration — From Tape to Hybrid Object Archive

We’re in the early planning stages of a migration project to move 34PB of legacy cold storage to a new on-premises hybrid object archival storage system. “Hybrid” here refers to an architecture that blends both high-capacity disk and modern tape tiers, all behind an S3-compatible interface. This design gives us the best of both worlds: faster recall and metadata access when needed, with cost-effective, long-term retention via tape.

Legacy Environment:

Oracle SL8500 robotic tape libraries containing the majority of our archive, based on T10000D cartridges
Approximately 100 LTO-9 tapes also stored within the SL8500 system
A Quantum i6000 tape library housing another ~500 LTO-9 cartridges
Managed and orchestrated via Versity ScoutAM, which handles:

This mixed tape environment presents real-world operational challenges:

Legacy T10000D drives are slower, with long mount and seek times
LTO-9 drives are higher performing but operate in a separate mechanical and logical tier
Drive sharing, recall contention, and concurrent read bandwidth must be carefully managed

To reduce risk and improve data fidelity, we've started integrating fixity hash values directly into the user hash space within the ScoutFS file system. This ensures each file can be validated during staging, catching any corruption, truncation, or misread before it’s written to the new system.

Our migration target includes not just the 34PB of existing tape-based data, but enough capacity to absorb an additional ~4PB of new ingest annually, for at least the first year. The total provisioned capacity in the new system is 40PB—designed to give us a buffer without overextending infrastructure.

The Real Costs in Migration

Migrations of this scale aren’t just about buying space—they’re about managing risk, trust, throughput, future-proofing, and time. It’s not enough to copy data from point A to point B. At any given moment, you’re balancing three active datasets:

Current production (new data being ingested)
Data in migration (from legacy tape to staging)
Data in verification (testing the copied files post-ingest)

Most vendor proposals and cloud calculators overlook the operational cost of running all three states simultaneously. Here’s a breakdown of what truly drives cost and complexity in the real world:

System Cost

The new hybrid on-premises archive system is provisioned to support approximately 40PB, allowing us to:

Absorb the full 34PB migration dataset
Accommodate at least 1 year of new ingest, estimated at ~4PB annually

The migration from the legacy tape environment is orchestrated by Versity ScoutAM, which manages a multi-stage pipeline:

- Volume serial number (VSN)-driven recalls from both T10000D and LTO-9 cartridges

- Staging of data into disk-based scratch/cache pools

- Controlled archival into the new S3-compatible object storage system

- Additional cache storage was provisioned to:

Support simultaneous ingest and migration staging
Handle production workloads
Allow for delayed verification of migrated files before release

Validation Overhead

To ensure bit-level data fidelity, we’ve begun populating user hash space fields in the ScoutFS file system with cryptographic fixity checksums prior to recall.

This approach enables:

- On-the-fly validation of files as they are staged from tape

- Comparison of staged file hashes with original stored hashes to immediately detect:

File corruption
Byte truncation
Mismatches from degraded tape or faulty drives

This strategy significantly reduces:

- Redundant hashing workloads during object ingest

- Silent corruption risks introduced during mechanical tape reads

- Migration delays due to manual file triage or inconsistent validation logic

Hidden Taxes — Time, Energy, and Human Overhead

Some of the most significant costs in a multi-petabyte migration don’t show up on vendor quotes or capacity calculators—they’re buried in the human effort, infrastructure overlap, and round-the-clock support needed to make it all happen.

Here’s what that looks like in practice:

1. Dual-System Overhead

We expect to operate both the legacy and new archival systems in parallel for at least two full years. That means:

Power, cooling, and maintenance costs for legacy robotics, tape drives, and storage controllers—even as data is actively migrating away
Infrastructure costs for the new system (rack space, spinning disk, tape robotics, S3 interface endpoints) that must scale up before the old system scales down
Ongoing monitoring and maintenance across both environments, which includes two independent telemetry stacks, alerting layers, and queue management processes

The dual-stack reality introduces complexity not just in capacity planning, but in operational overhead—particularly when issues affect both sides of the migration simultaneously.

2. Staffing Requirements

To meet our timeline and operational commitments, the migration team is scheduled for:

6-day-per-week operations, running 24 hours per day
Shifts covering:
- Tape handling and media recalls
- Staging and ingest monitoring
- Fixity verification and issue resolution
- Log review, alerting, and dashboard tuning
- Daily oversight of both legacy and new systems

Staff must be able to respond to issues across multiple layers—tape robotics, disk cache performance, object storage health, and software automation pipelines.

3. ScoutAM Operational Load

While Versity ScoutAM serves as the backbone of the migration orchestration, it requires constant operational intervention in a complex legacy environment:

Frequent manual remediation for ACSLS (Automated Cartridge System Library Software) issues, which affect tape visibility and mount accuracy
Managing high stage queues, which can stall throughput if not carefully balanced across drives, media pools, and disk cache availability
Regular validation and tuning of configuration to prevent deadlocks, retries, or starvation scenarios under load

This means that even with automation in place, the system must be actively managed and routinely adjusted to avoid migration stalls.

4. Migration Timeline Pressure

The goal: complete 34PB of migration in 18 to 23 months. That requires:

Continuous tuning of recall-to-ingest pipelines
Load balancing across tape drives, scratch pools, and object ingest nodes
Real-time monitoring of errors, retries, and throughput drops
Maintaining progress while still supporting current ingest and user requests

Every delay has downstream consequences:

A failed or slow tape recall can back up staging
A hash mismatch triggers manual triage
A missed verification step risks corrupted long-term storage

These aren’t exceptions—they’re expected parts of the workflow. And they require human expertise, resilience, and continuous iteration to manage effectively.

Storage vendors and cloud platforms love calculators. Plug in how many terabytes you have, pick a redundancy level, maybe add a retrieval rate, and out comes a tidy monthly cost or migration estimate. It all looks scientific—until you actually try to move 34 petabytes of long-term archive data.

The reality is: most calculators are built for static cost modeling, not for complex data movement and verification pipelines that span years, formats, and evolving systems.

Here’s where they fall short:

1. They Don’t Account for Legacy Media Complexity

Calculators assume all your data is neatly stored and instantly accessible. But we’re migrating from:

T10000D cartridges with long mount and seek times
LTO-9 cartridges in multiple libraries
A blend of media types, drive generations, and recall strategies

Vendor models don’t include the cost of slow robotic mounts, incompatible drive pools, or long recall chains. And they certainly don’t account for manual intervention required to babysit legacy systems like ACSLS.

2. They Ignore Fixity Validation Workflows

Most calculators focus on bytes moved, not bytes verified. In our case:

Every file must be validated against stored checksums in ScoutFS
Hash mismatches trigger triage workflows
Post-write verification in the object system must be staged, timed, and tracked

This adds both compute and storage demand to the migration, as data often exists in three states:

Original tape format
Staged file on disk
Verified object in long-term archive

The calculators? They don’t factor in staging costs, hash workloads, or space for verification.

3. They Omit Human Labor

People run migrations—not spreadsheets.

Calculators ignore:

24/6 staffing models
On-call support
Tape librarians
Log monitoring teams
Software maintainers

We’re running two live environments for two years, with full coverage across:

Legacy tape infrastructure
Object archive ingest
Monitoring and verification systems

The people-hours alone are non-trivial operational costs, yet they never appear on vendor estimates.

4. They Assume Ideal Conditions

Calculators assume perfect conditions:

All tapes readable
All files intact
All drives healthy
No queue contention
No ingest bottlenecks

That’s not real life. In production:

Drives fail
Mounts timeout
Fixity fails
Scripts stall
Resources saturate

And every hour lost to those failures is time you can’t get back—or model.

5. They Treat Migration as a Cost, Not a Capability

Most importantly, calculators treat migration as a one-time line item, not as a multi-phase operational capability that must be:

Designed
Tuned
Scaled
Monitored
Documented

For us, migration is a platform feature—not a side task. It requires:

Real-time logging
Prometheus/Grafana-based alerting
API-level orchestration
Hash-aware data flow management

None of this is in the default TCO calculator.

Recommendations for Teams Planning Large Migrations

If you're planning a multi-petabyte migration—especially from legacy tape to modern hybrid storage—understand that your success depends less on how much storage you buy and more on how well you architect your operational pipeline.

Here are our key takeaways for teams facing similar challenges:

1. Map Your Environment Thoroughly

Inventory every media type, volume serial number, and drive model
Understand robotic behaviors and drive sharing limitations
Track mount latencies, not just theoretical throughput

2. Build for Simultaneous Ingest, Recall, and Verification

Expect to run multiple systems in parallel for months to years
Provision dedicated staging storage to buffer tape recalls and object ingest
Treat hash verification as a core architectural feature—not a post-process

3. Treat Hashing as Core Metadata

Use file system-level hash fields (like ScoutFS user hash space) early
Don’t rehash if you can avoid it—store once, validate often
Ensure every copy operation is backed by fixity-aware logic

4. Invest in Open Monitoring and Alerting

Use tools like Prometheus, Grafana, and custom log collectors
Instrument every part of the pipeline—from tape mount to hash verification
Build dashboards and alert rules before your first PB moves

5. Automate What You Can, Document What You Can’t

Script all recall, ingest, and validation tasks
Maintain a living runbook for exceptions and intervention playbooks
Expect edge cases. Document them when they happen.

6. Design for Graceful Failure and Retry

Every file should have a known failure state and retry path
Don’t let bad tapes, bad hashes, or stalled queues stop the pipeline
Build small, testable units of work, not monolithic jobs

Conclusion: Migration Is Infrastructure, Not a One-Time Task

Moving 34PB of data isn’t a project—it’s the creation of an ongoing operational platform that defines how preservation happens, how access is retained, and how risk is managed.

For many institutions, the assumption has been that data needs to be migrated from tape every 7 to 10 years, driven by:

Media obsolescence
Hardware aging
And shifting vendor support lifecycles

That rhythm alone is expensive—and it multiplies with every additional tape copy you maintain “just in case.”

But what if the storage platform itself was built for permanence?

What we’re working toward is not just a migration—but a transition to an archival system that inherently supports long-term durability:

Built-in fault tolerance
Geographic or media-tier redundancy
Self-healing mechanisms like checksums and erasure coding
Verification pipelines that ensure data integrity over decades

If these characteristics are fully realized, it opens the door to reducing the number of physical tape copies required to meet digital preservation standards. Instead of three physical copies to ensure survivability, you may achieve equivalent or better protection with:

A primary object storage layer
A cold, fault-tolerant tape tier
And a hash-validated verification log or metadata registry

It doesn’t eliminate preservation requirements—it modernizes how we meet them.

True digital stewardship means designing systems that migrate themselves, that verify without intervention, and that allow future generations to access and trust the data without redoing all the work.

Preservation is no longer about saving the bits. It’s about building platforms that do it for us—consistently, verifiably, and automatically.

As we look beyond this migration cycle, a compelling evolution of the traditional 3-2-1 preservation strategy is the integration of ultra-resilient, long-lived media for one of the three preservation copies—specifically, Copy 2. By writing this second copy to a century-class storage medium such as DNA-based storage, fused silica glass (e.g., Project Silica), ceramic or film, we can significantly reduce the operational burden of decadal migrations. These emerging storage formats offer write-once, immutable characteristics with theoretical lifespans of 100 years or more, making them ideal candidates for infrequently accessed preservation tiers. If successfully adopted, this approach would allow institutions to focus active migration and infrastructure upgrades on only a single dynamic copy, while the long-lived copy serves as a stable anchor across technology generations. It’s not a replacement for redundancy—it’s an enhancement of durability and sustainability in preservation planning.

Inside a 34-Petabyte Migration: The True Cost of Moving a Digital Mountain