Sia HackewrNoon

A Practitioner's Guide to DynamoDB, Redis, and ZooKeeper

1. The Problem: Locks Can Silently Fail

Last Tuesday, your order fulfillment system processed the same $499 purchase twice. Not because of a bug in your code—but because your "safe" distributed lock silently failed during a garbage collection pause. The lock expired while your service was frozen, another instance acquired it and fulfilled the order, then the original instance woke up and fulfilled it again. You didn't find out until accounting flagged the discrepancy three days later.

This isn't a rare edge case—it's inevitable if you don't close the fencing gap.

In 2016, Martin Kleppmann exposed a gap between the assumed and actual safety of distributed locks—a gap that remains unaddressed in most production systems today. His analysis correctly diagnosed the problem, but offered little guidance for engineers who still need to coordinate in the real world. This article bridges that gap: not by re-litigating theory, but by showing exactly how to implement fencing across three major platforms, when to avoid locks entirely, and how to detect silent failures before they corrupt your data.

Yet engineers continue deploying these locks to protect correctness-critical operations—financial transactions, inventory updates, distributed state machines—without understanding their limitations. Cloud-native architectures have made the problem worse, not better: containerization introduces unpredictable CPU throttling, managed services hide their consistency semantics, and the shift to microservices means more distributed coordination points than ever before.

Two Types of Locks: Know Which You Need

Efficiency locks prevent wasted work. Example: Running a nightly report twice wastes CPU but doesn't break data.

Correctness locks prevent data corruption. Example: Two services deducting from the same bank account concurrently causes negative balances.

This distinction determines everything that follows.

For efficiency locks, occasional double-execution is acceptable—any distributed lock implementation will work. For correctness locks, the lock alone is insufficient. You need fencing, and most systems don't provide it.

2. Why Locks Fail: The 60-Second Theory

If your system can't assume bounded network delays (and production systems can't), no lock algorithm can guarantee safety under all failures. Your job isn't to find a perfect lock—it's to choose where to accept risk and how to mitigate it.

Every lease-based lock shares a fundamental vulnerability. The lock holder must complete its work before the lease expires. If the holder is delayed—by garbage collection, network latency, or CPU contention—it may continue executing past lease expiration while another client has already acquired the lock.

Distributed locking isn't useless—it's incomplete. Like seatbelts in a car crash, locks reduce risk but don't eliminate it. Fencing is your airbag: it won't prevent the crash, but it can stop you from going through the windshield.

2.1 The Failure Timeline: How Two Clients End Up in the Critical Section

T0:  Client A acquires lock
T1:  Client A enters critical section, begins processing
T2:  Client A hits GC pause (heartbeat thread also paused)
T3:  Lease expires — Client A is unaware
T4:  Client B acquires lock
T5:  Client B enters critical section
T6:  Client A wakes, continues processing

→ TWO CLIENTS IN CRITICAL SECTION — DATA CORRUPTION

This isn't a bug in any implementation. It's inherent to lease-based locking. The solution is fencing: a monotonically increasing token that the storage layer uses to reject stale operations.

3. Three Implementations: How They Work and When to Walk Away

3.1 DynamoDB Lock Client: How It Really Works

The DynamoDB Lock Client uses lease-based locking with heartbeating, leveraging DynamoDB's strongly consistent reads and conditional writes.

Acquisition flow: Client attempts a conditional PutItem that succeeds only if the lock doesn't exist or the version matches. The key insight is the recordVersionNumber (RVN)—a UUID that changes with each heartbeat. If a client finds an existing lock, it waits locally for the full lease duration, then checks if the RVN changed. If unchanged, the original holder is dead and acquisition proceeds.

Clock dependency: Low. A common misconception is that DynamoDB Lock Client depends on synchronized clocks. It doesn't. The protocol relies on locally measured elapsed time, not wall-clock timestamps. The acquiring client waits its own measured lease period—even with completely unsynchronized clocks across machines, the protocol remains correct.

The failure mode: The library detects lost locks on the next heartbeat attempt and throws LockNotGrantedException. But by then, your code may have already performed operations assuming exclusive access. The lock detects but does not prevent the violation.

When to Walk Away from DynamoDB Lock Client: Without custom fencing implementation, DynamoDB Lock Client is only safe for efficiency locks. If you need correctness guarantees, you must implement fencing tokens yourself (see Section 4) or choose ZooKeeper.

3.2 Redlock's Fatal Flaw: Clock Assumptions

Single-instance Redis locking is simple and fast:

SET resource_name unique_value NX PX 30000

The single point of failure is obvious, making it appropriate for efficiency locks where you accept that risk.

Redlock attempts to solve this by distributing locks across N independent Redis instances (typically 5), requiring a quorum for acquisition. The algorithm calculates "validity time" based on elapsed time across nodes—and this is where it breaks.

The problem: Redlock's correctness depends on clocks across Redis instances advancing at roughly the same rate. Clock drift—from NTP adjustments, VM migration, or containerization—can cause one instance to expire locks early while others consider them valid.

Clock Skew Attack on Redlock

Setup: 5 Redis nodes (R1–R5), R3's clock is 5 seconds ahead

T0:  Client A acquires lock on R1, R2, R3 (quorum) with TTL=10s
T5:  R3 expires the lock (its clock shows T10)
T6:  Client B acquires lock on R3, R4, R5 (quorum)

→ BOTH CLIENTS HOLD THE LOCK

When to Walk Away from Redlock: Avoid Redlock for correctness-critical systems. Kleppmann's critique is fatal: the algorithm's safety depends on operational assumptions (bounded clock drift) that cannot be enforced algorithmically. Use single-instance Redis for efficiency locks, or ZooKeeper for correctness.

3.3 ZooKeeper: The Natural Fencing Token

ZooKeeper takes a fundamentally different approach using sequential ephemeral nodes and session-based expiration (heartbeat-based, not TTL-based).

The lock recipe: Create a sequential ephemeral node (/locks/resource/lock-000000001). Get children, check your position. If you have the lowest sequence number, you hold the lock. If not, set a watch on the node immediately before yours and wait. When that node is deleted, re-check your position.

Why watches matter: The watch mechanism is crucial for avoiding the "herd effect." If all waiting clients watched the lock holder directly, they would all wake up simultaneously when the lock is released, creating a thundering herd. By watching only the node immediately before yours, only one client wakes up on each lock release—the next in line.

Why this matters: The sequential node number (or zxid) is a natural fencing token—monotonically increasing across all clients, globally ordered by the ZooKeeper cluster. When Client A creates lock-000000042 and later Client B creates lock-000000043, the storage layer can reject writes from token 42 after seeing token 43.

Session-based expiration: Unlike DynamoDB and Redis, which use time-based TTLs, ZooKeeper ties the lock to the client's session. If the client disconnects (crashes, network failure), ZooKeeper detects the missed heartbeats and deletes the ephemeral node automatically. No clock dependency.

Critical: "Built-in Fencing" Is a Myth. ZooKeeper is the only system that generates safe, monotonic fencing tokens out of the box—but your storage layer must still enforce them. If your database doesn't check incoming tokens and reject stale ones, you're no safer than with Redis.

3.4 Summary: What Each System Actually Provides

Capability	DynamoDB Lock	Redis Single	Redlock	ZooKeeper
Clock Dependency	Low	Low	High	Low
Provides Fencing Tokens	No	No	No	Yes
Typical Latency	10–25ms	<1ms	2–5ms	2–10ms
Operational Cost	Low (managed)	Low	Medium	High
Correctness-Safe?	With custom fencing	With custom fencing	Avoid	With storage enforcement

4. Closing the Gap: Implementing Fencing

A complete fencing implementation requires three components working together:

Token generation: The lock service provides a monotonically increasing token with each acquisition
Token propagation: The client includes the token with every operation in the critical section
Token enforcement: The storage layer tracks the highest token seen and rejects operations with lower tokens

Here's how to implement this for each platform.

4.1 Fencing with DynamoDB Lock Client

The DynamoDB Lock Client library doesn't provide fencing tokens natively, so you need to augment it:

// Step 1: Acquire lock using DynamoDB Lock Client library
AmazonDynamoDBLockClient lockClient = new AmazonDynamoDBLockClient(
    AmazonDynamoDBLockClientOptions.builder(dynamoDB, "LocksTable")
        .withLeaseDuration(60L)
        .withHeartbeatPeriod(5L)
        .withTimeUnit(TimeUnit.SECONDS)
        .build());

LockItem lock = lockClient.acquireLock(
    AcquireLockOptions.builder("order:" + orderId)
        .withAcquireReleasedLocksConsistently(true)
        .build());

// Step 2: Generate fencing token using separate atomic counter
// (Lock Client doesn't provide this — you must add it)
UpdateItemRequest tokenRequest = UpdateItemRequest.builder()
    .tableName("FencingTokens")
    .key(Map.of("resourceId", AttributeValue.fromS("order:" + orderId)))
    .updateExpression("SET #token = if_not_exists(#token, :zero) + :one")
    .expressionAttributeNames(Map.of("#token", "fencingToken"))
    .expressionAttributeValues(Map.of(
        ":zero", AttributeValue.fromN("0"),
        ":one", AttributeValue.fromN("1")))
    .returnValues(ReturnValue.UPDATED_NEW)
    .build();

long fencingToken = Long.parseLong(
    dynamoDB.updateItem(tokenRequest).attributes().get("fencingToken").n());

try {
    // Step 3: Use fencing token in all writes within critical section
    UpdateItemRequest dataRequest = UpdateItemRequest.builder()
        .tableName("Orders")
        .key(Map.of("orderId", AttributeValue.fromS(orderId)))
        .updateExpression("SET #status = :status, #token = :token")
        // CRITICAL: Rejects write if a higher token has already written
        .conditionExpression("attribute_not_exists(#token) OR #token < :token")
        .expressionAttributeNames(Map.of(
            "#status", "status", "#token", "fencingToken"))
        .expressionAttributeValues(Map.of(
            ":status", AttributeValue.fromS("FULFILLED"),
            ":token", AttributeValue.fromN(String.valueOf(fencingToken))))
        .build();

    dynamoDB.updateItem(dataRequest);
    // ConditionalCheckFailedException → stale token;
    // treat as terminal failure (do not retry)

} finally {
    // Always release the lock
    lockClient.releaseLock(lock);
}

Important: Fencing tokens must be scoped per protected resource (e.g., order_id, account_id)—a global counter won't prevent cross-resource interference.

4.2 Fencing with Redis + PostgreSQL

-- Generate token when acquiring lock (Redis)
-- Use resource-specific key: INCR fencing:order:12345
INCR fencing:resource_name
-- Returns monotonically increasing integer

-- Schema change: Add fencing column (PostgreSQL)
ALTER TABLE protected_resources ADD COLUMN fencing_token BIGINT DEFAULT 0;

-- Every write checks the token
UPDATE protected_resources
SET value = $1,
    fencing_token = $2,    -- New token from Redis INCR
    updated_at = NOW()
WHERE id = $3
AND fencing_token < $2;    -- CRITICAL: Rejects stale tokens

-- Affected rows = 0 means token was stale, abort operation

4.3 The Cost of Fencing

Latency: One extra conditional check per write (~1–5ms)
Schema changes: Every protected table needs a fencing_token column
No retries possible: A stale token error means your operation is obsolete—retrying will never succeed. Handle it as a terminal failure.
Cross-system coordination: If your critical section spans multiple storage systems, each must enforce fencing

5. Escape Hatches: When Not to Use Locks

Before implementing fencing, ask whether you need a distributed lock at all. Often, simpler alternatives provide better guarantees with less complexity.

5.1 Decision Flow

Work through these questions in order:

Can you make the operation idempotent? → Use idempotency keys. No lock needed.
Does it touch only one database? → Use SELECT FOR UPDATE with database transactions.
Is contention rare (<1% of operations conflict)? → Use optimistic concurrency control.
Is this correctness-critical? → Use distributed lock WITH fencing.
Otherwise? → You probably don't need coordination at all.

5.2 Idempotency: The Simplest Solution

If duplicate execution is harmless, you don't need mutual exclusion. Stripe uses idempotency keys for payment processing—every API request includes a unique key, and the system deduplicates at the storage layer. This approach shifts the problem from "prevent duplicates" to "handle duplicates gracefully," which is often much easier:

-- Client generates unique request_id (UUID or hash of request parameters)
INSERT INTO processed_requests (request_id, result, created_at)
VALUES ($1, $2, NOW())
ON CONFLICT (request_id) DO NOTHING;

-- Affected rows: 1 = first execution, 0 = duplicate (safe to skip)

The beauty of idempotency is that it's failure-agnostic. Network timeouts, process crashes, retries—none of them cause duplicates because the storage layer handles deduplication. No distributed coordination required, no lease durations to tune, no fencing tokens to propagate.

5.3 Optimistic Concurrency Control

For read-modify-write patterns with low contention, version-based optimistic locking is often simpler than distributed locks. The approach trades coordination overhead for retry overhead—you don't acquire any lock, but you may need to retry if another client modified the data:

-- Read with version
SELECT value, version FROM resources WHERE id = ?;

-- Modify locally, then write with version check
UPDATE resources SET value = ?, version = version + 1
WHERE id = ? AND version = ?;

-- Affected rows = 0 means conflict, retry from read

This pattern works well when conflicts are rare (<1% of operations). Under high contention, retries can cascade and performance degrades rapidly. But for many workloads—user profile updates, configuration changes, low-frequency inventory adjustments—OCC provides correctness guarantees without any external coordination.

5.4 Database Transactions

If your critical section operates on a single database, use what the database gives you:

BEGIN;
SELECT * FROM accounts WHERE id = ? FOR UPDATE;  -- Row locked until COMMIT
UPDATE accounts SET balance = balance - ? WHERE id = ?;
UPDATE accounts SET balance = balance + ? WHERE id = ?;
COMMIT;

Database transactions provide ACID guarantees no distributed lock can match.

5.5 CRDTs: Coordination-Free State

For eventually consistent systems with mergeable state (counters, sets, registers), consider Conflict-Free Replicated Data Types. CRDTs provide strong eventual consistency without any coordination—updates can happen concurrently on any replica and are guaranteed to converge. If your use case fits the CRDT model, you eliminate the distributed locking problem entirely.

6. Production War Stories

6.1 How a 30-Second GC Pause Caused Duplicate Orders

An order processing service used DynamoDB Lock Client to prevent duplicate fulfillment. During a traffic spike, JVM heap pressure triggered a 30-second GC pause—far exceeding the 10-second lease duration. The lock expired, another instance acquired it and fulfilled the order, then the original instance woke up and fulfilled it again. The team only discovered the issue when accounting flagged duplicate shipments three days later.

How we fixed it: We increased lease duration to 60s (3× our p99 GC pause) and added alerts when hold time exceeded 80% of lease. We also implemented fencing tokens as a backstop—now, duplicate orders trigger a PagerDuty alert before fulfillment completes, and the fencing check prevents the duplicate write.

What to monitor: Lock hold duration (alert if p99 > 50% of lease), heartbeat latency (alert if p99 > heartbeat interval / 2).

6.2 The Metrics That Would Have Saved Us

Essential monitoring for any distributed lock deployment:

Lock acquisition latency (p50, p95, p99): Sudden increases indicate contention or lock service issues
Lock hold duration: Alert if p99 approaches 50% of lease duration
Heartbeat failures: Even one failure is a warning sign worth investigating
Fencing token rejections: Alert if >0.1% of writes are rejected—this means locks are failing silently

6.3 The Stale Read Trap

Fencing prevents stale writes but not stale reads. If Client A reads data, pauses, loses the lock, and Client B modifies the data, Client A may wake up and make decisions based on outdated information—even if its final write is correctly rejected.

The fix: Check the fencing token before any irreversible operation (API calls, emails, external service updates), not just before writes to the fenced storage. If you've sent a confirmation email based on stale data, the damage is done regardless of whether your database write succeeds. To prevent this, read the latest fencing token from storage before any irreversible action—even if you haven't written yet. If the stored token is greater than your token, abort immediately.

6.4 Chaos Testing Your Locks

Your lock will fail in production. The question is whether you've seen the failure mode before. Essential chaos tests:

Kill the lock holder mid-critical-section: Does the system recover correctly? Does another client acquire the lock within expected time?
SIGSTOP the process, resume after lease expiry: Does fencing prevent corruption? Does the client detect its lost lock?
Network partition between client and lock service: Does the client detect loss of lock before the lease expires?
Exhaust connection pools: Does heartbeating fail gracefully? Does the client release the lock?

7. Conclusion: Audit Your Locks Today

Distributed locking is harder than it appears because the failure modes are invisible until they corrupt your data. No lock implementation—not DynamoDB Lock Client, not Redis, not even ZooKeeper—can guarantee mutual exclusion in the presence of arbitrary process pauses without fencing.

The key insight from Kleppmann's analysis, which remains valid nearly a decade later, is that the lock is only half the solution. Without fencing—token generation, propagation, and storage-layer enforcement—you have a best-effort mechanism that will eventually fail in ways you won't detect until the damage is done.

For many systems, the right answer isn't to implement fencing—it's to eliminate the need for distributed locks entirely. Idempotency, optimistic concurrency, database transactions, and CRDTs each solve coordination problems more elegantly for specific use cases. The decision framework in Section 5 should be your first stop, not an afterthought.

Action Items

Label every distributed lock in your system as "efficiency" or "correctness."
For correctness locks: either implement end-to-end fencing or replace the lock with idempotency/transactions.
Add monitoring for hold duration and fencing token rejections.

Implementing these practices ensures your coordination logic stays robust under real-world failures—so your future self (and your data) stay safe.

References

Kleppmann, M. (2016). "How to do distributed locking." martin.kleppmann.com
Sanfilippo, S. (2016). "Is Redlock safe?" antirez.com
Amazon Web Services. "DynamoDB Lock Client." GitHub
Apache ZooKeeper. "Recipes and Solutions." zookeeper.apache.org
Shapiro, M. et al. (2011). "Conflict-Free Replicated Data Types." HAL INRIA

The Fencing Gap: Why Your Distributed Lock Isn't Safe (and How to Fix It)