This is the Third entry in “The Migration Tax” series. You can read the previous entry here.


Most teams “intend” to do fixity. Then the backlog hits, the queue backs up, and checksums get demoted to “we’ll validate later”—which is like promising to check your parachute after you jump. Fixity isn’t a task you sprinkle on top; it’s core metadata that must travel with the object from its first breath to its centennial birthday. If you treat checksums like a sidecar, they’ll fall off the motorcycle the moment you hit a pothole (tape recall, POSIX copy, S3 upload, random “smart” tool that “helpfully” rewrites headers…).

Let’s make fixity first-class and stop role-playing “Schrödinger’s Archive.”


What “fixity first” actually means (and why you care)

Fixity = a verifiable claim that the content you have now is bit-for-bit the same as the content you had then. You prove that by calculating a checksum (hash) and carrying that value forward in a place that’s hard to lose and easy to read.

Principle: Compute once, carry always, verify cheaply, rehash only when necessary.


Hash families: “good enough” vs. cryptographic (and when to use which)

You don’t need SHA-512 to detect bit rot (silent corruption). A fast, non-cryptographic checksum like xxHash64 or CRC32C has an astronomically low false-positive rate for random media flips. Use fast hashes to continuously guard living data. Use cryptographic hashes (SHA-256/512) whenever you’re:

Pattern that works:


Where to put the checksum so it doesn’t die

Short answer: in the object itself, as metadata—not just in a side database. Do both.

POSIX & clustered filesystems

All of these support extended attributes (xattrs) in the user.* namespace, which you can read/write with getfattr/setfattr or your language’s xattr library:

user.hash.sha256 = hex
user.hash.xx64 = hex
user.fixity.ts = ISO 8601
user.fixity.src = “tape://<barcode>/path”

Example (Linux):

# Compute once at first touch (illustrative) 
sha256sum file.bin | awk '{print $1}' | xargs -I{} setfattr -n 
user.hash.sha256 -v {} file.bin xxhsum -H0 file.bin | awk '{print $1}' | xargs -I{} setfattr -n 
user.hash.xx64 -v {} file.bin date -Iseconds | xargs -I{} setfattr -n user.fixity.ts -v {} file.bin

Object storage (S3-compatible, including Ceph RGW, MinIO, cloud)

Tape (LTFS or managed by HSM)


End-to-end flow (your visual)

[Tape / Source] 
      │  stage 
      ▼ 
[Staging FS] --compute→ (xxHash64 + SHA-256)  --store→ xattrs/user.*  --log→ DB 
      │  verify-current (compare to manifest/source metadata) 
      ▼ [Hash-compare gate] --pass→ 
      │ 
      ├─> [Object Write] --write→ x-amz-meta-hash-* --record→ MPU partsize/count 
      │ 
      └─> [Quarantine] (mismatch: retry/re-read/second-copy)

[Post-Write Verify] --read-sample/full→ rehash → compare(xattrs/SHA-256) 
      │ 
      ├─> [Mark HashValid=True] (tags/DB) 
      └─> [Escalate] (if mismatch; see triage below)

Pin this above your desk. If your pipeline skips boxes, it’s not a pipeline; it’s a rumor.


Eliminating redundant hashing (and saving CPUs for real work)

Redundant hashing happens when each hop distrusts the last hop but forgets the hash already exists. Don’t re-hash because you lost the value; re-hash because your policy told you to verify.

Rules of engagement:

Result: one heavy compute at first contact, lightweight comparisons thereafter.


Validating at recall (because “we wrote it once” is not a warranty)

Every recall is a chance to catch silent corruption early—from tape, object bit-flips, or that one node with a grudge.

Recall flow:

  1. Read → stream hash (xxHash64 for speed).
  2. Compare to stored xxHash64. If mismatch, retry from alternate path/sibling drive.
  3. If mismatch persists, compute SHA-256 to rule out hash-family artifacts.
  4. If still off, this is a data incident. Quarantine the asset, raise Mismatch with provenance.

Why stream hashing? Because hashing after you write again is two I/Os and a lie.


Triage for mismatches (don’t panic; don’t hand-wave)

When stored_hash != computed_hash:

  1. Retry the read (I/O can lie temporarily).
  2. Recompute with the other family (xxHash64 vs. SHA-256) to rule out bugs.
  3. Check transforms:
  4. Consult provenance: last known-good SHA-256, size, and timestamp; if you have multiple independent copies (tape A, tape B, cloud), compare all three.
  5. Decide: restore from alt copy, re-ingest, or mark unrestorable (and stop pretending otherwise).

Incident math to keep you honest:

MismatchRate = mismatches / verified_in_window 
Alarm if MismatchRate > 0.01% over 100k assets / 24h 
VerificationDebt = objects_written - objects_verified

If VerificationDebt grows for a week, you’re running on hope.


Filesystem-specific notes (so your junior can implement this by lunch)

Spectrum Scale (GPFS)

Lustre

BeeGFS

CephFS

ScoutFS

ZFS

Everything POSIX

user.hash.sha256 = 64-hex
user.hash.xx64 = 16-hex
user.fixity.ts = ISO 8601
user.fixity.family = sha256,xx64
user.mpu.partsize = bytes (if relevant)
user.mpu.partcount = int

Everything S3-ish reality: user metadata is (mostly) immutable post-upload

“So doing it today isn’t the most help” — exactly. Here’s the practical split:

Next time (correct-by-construction)

Today (backfill plan for existing objects)

  1. Discover gaps: use S3 Inventory or a bucket listing to find objects missing x-amz-meta-hash-sha256 (and friends).
  2. Compute hashes without re-downloading if you can: prefer your on-prem xattrs/manifests; otherwise stream from S3 with range GETs in a controlled lane.
  3. Self-copy to write metadata (creates a new version; charges requests; requires restore for Glacier/DA; respects Object Lock):
  4. Write/merge tags via PutObjectTagging (safe post-upload).
  5. Versioning/Lock: if Object Lock (Compliance) is active and retention unexpired, you cannot replace metadata (copy will be blocked). Plan windows or use legal holds appropriately.
  6. KMS/SSE-C: include the source encryption headers on COPY; otherwise you’ll get 400s and a headache.
  7. Audit: after backfill, HEAD a sample to confirm headers/tags; keep a “BackfillDebt = total − updated” counter until zero.

Minimal boto3 patterns (drop-in)

Backfill metadata via self-copy (single-part)

import boto3, datetime
s3 = boto3.client('s3')

def backfill_metadata(bucket, key, sha256_hex, xx64_hex, partsize=None, parts=None):
    meta = {
        'hash-sha256': sha256_hex,
        'hash-xx64': xx64_hex,
        'fixity-ts': datetime.datetime.utcnow().isoformat(timespec='seconds')+'Z'
    }
    if partsize: meta['mpu-partsize'] = str(partsize)
    if parts:    meta['mpu-parts']    = str(parts)

    s3.copy_object(
        Bucket=bucket,
        Key=key,
        CopySource={'Bucket': bucket, 'Key': key},
        Metadata={f'x-amz-meta-{k}': v for k, v in meta.items()},  # boto3 sends as user-meta
        MetadataDirective='REPLACE'

Tag (or re-tag) after upload

def set_tags(bucket, key, tags):
    s3.put_object_tagging(
        Bucket=bucket, Key=key,
        Tagging={'TagSet': [{'Key': k, 'Value': v} for k, v in tags.items()]}
    )

# Example:
set_tags('my-bucket','path/obj', {
  'HashValid':'True', 'SHA256Valid':'True', 'Provenance':'GPFS'
})

Multipart copy skeleton (very large objects)

def multipart_copy_replace_metadata(bucket, key, meta, part_size_bytes=128*1024*1024):
    mpu = s3.create_multipart_upload(
        Bucket=bucket, Key=key,
        Metadata={f'x-amz-meta-{k}': v for k, v in meta.items()}
    )
    upload_id = mpu['UploadId']
    # HEAD to get size; then loop CopyPart with byte ranges of part_size_bytes
    # finally CompleteMultipartUpload with collected ETags

TL;DR you can paste into your doc

there—you’ve got the “do it right next time” and the “we didn’t, now what?” playbooks in one place.


“But isn’t hashing expensive?” (only if you do it wrong)


“Junior admin” summary (show this during onboarding)


A little snark to keep you awake

If your fixity plan is “the storage vendor said they do checksums,” that’s adorable. When auditors ask your system for proof, try replying, “Trust me, bro.” See how that plays in production.


So what — your next five moves

  1. Standardize keys (user.hash.*, x-amz-meta-hash-*, HashValid tag). Write it down.
  2. Instrument the pipeline to compute at first touch and propagate forward.
  3. Add recall-verify gates (stream hash on read; compare; quarantine on mismatch).
  4. Publish SLOs for verification windows and mismatch alarms.
  5. Ship the dashboard (MismatchRate, VerificationDebt, Verify throughput, Quarantine queue age).

Do this and your “digital preservation” stops being a poster and becomes a habit your systems can prove—without a séance.