The State Machine That Saved a National Payment System (and What I Learned Building It)

In the following article, I will tell you about one of the most interesting engineering challenges I worked on — building the settlement engine for a national Instant Payment System. We will start by understanding what an IPS is and why settlement is the hardest part of it; then, we will look at why a state machine is the natural model for this problem; after that, we will go deep into how we serialized a Spring State Machine to Postgres and restored it on a standby replica so the system could survive node failures without losing a single in-flight transaction.

What Is an Instant Payment System?

If you have ever transferred money and it arrived in seconds — that is an IPS at work. An Instant Payment System is a national or regional infrastructure that enables real-time fund transfers between banks, 24/7/365. Think SEPA Instant in Europe, FedNow in the US, or Pix in Brazil. These are not your regular wire transfers that take hours or days. The money moves in seconds, and once it moves, it is final and irrevocable.

But here is the thing most people do not think about — when Alice at Bank A sends €100 to Bob at Bank B, there is not just one step happening. There is a whole chain of operations that must be completed in under 10 seconds.

Let me walk you through it.

The Payment Flow

Step 1 — Initiation: Alice opens her banking app and says, "Send €100 to Bob." Bank A receives this instruction.
Step 2 — Validation at the sender's bank: Bank A checks if Alice has €100, if the IBAN is valid, and if the payment passes fraud and sanctions screening. If all is good, Bank A debits Alice and constructs an ISO 20022 message — specifically a pacs.008 (Financial Institution to Financial Institution Customer Credit Transfer). ISO 20022 is the global messaging standard that carries structured data about the payment: amount, currency, sender, receiver, purpose, timestamps, and more.
Step 3 — Submission to the IPS: Bank A sends this message to the IPS Central Infrastructure. The clock starts. Everything from here must be completed in seconds.
Step 4 — Clearing: The Central Infrastructure validates the message, checks that Bank B is a participant, verifies that Bank A has sufficient liquidity in its settlement account, and may run real-time sanctions screening at the infrastructure level.
Step 5 — Forwarding: If validation passes, the IPS forwards the payment to Bank B and asks: will you accept this?
Step 6 — Response: Bank B runs its own checks — does the account exist, is it blocked, does it pass fraud filters? Bank B responds with accept or reject. In most schemes, Bank B has a few seconds to respond; otherwise, the payment times out.
Step 7 — Settlement: This is the critical step. The IPS debits Bank A's settlement account and credits Bank B's. This is final. There is no undo button.
Step 8 — Confirmation: Both banks get notified. Bob gets his €100. Alice gets confirmation. Total time: 3–5 seconds.

Why Settlement Is the Hardest Part

You might think that routing and validation (steps 3–6) are the hard parts. They are complex, sure, but conceptually, they are message validation and routing. Settlement is where the real engineering challenge lives, and here is why.

Settlement has properties that most software does not need to worry about: finality (you cannot reverse it like an HTTP retry), atomicity across institutions (debiting Bank A and crediting Bank B must be a single atomic operation), 24/7 uptime (no maintenance windows while transactions are in flight), and full auditability (every state change must be traceable for regulators who will ask you about a specific transaction six months later).

This is why the settlement engine must be a state machine. Not as an architectural preference, because the problem itself is a finite automaton.

The Settlement State Machine

A payment in an IPS is not a thing — it is a process with a lifecycle. It has a defined set of states, a defined set of events that trigger transitions, and strict rules about which transitions are legal from which states.

Here is a simplified but faithful representation of what we built:

RECEIVED → VALIDATING → CLEARING → AWAITING_RESPONSE → SETTLING → SETTLED
                                         ↓                  ↓
                                      REJECTED           FAILED

Each transition has guards — conditions that must be true for the transition to fire. For example, you cannot go from VALIDATING to CLEARING if the sender's bank does not have sufficient liquidity. You cannot go from AWAITING_RESPONSE to SETTLING if the receiver's bank sent a rejection. The state machine enforces these invariants at the framework level, so a bug in business logic literally cannot produce an illegal transition.

We used Spring State Machine because it fits naturally into a Spring Boot ecosystem. The configuration looks like this:

@Override
public void configure(
        StateMachineTransitionConfigurer<SettlementState, SettlementEvent> transitions)
        throws Exception {
    transitions
        .withExternal()
            .source(RECEIVED).target(VALIDATING)
            .event(VALIDATE)
            .and()
        .withExternal()
            .source(VALIDATING).target(CLEARING)
            .event(VALIDATION_PASSED)
            .guard(liquidityGuard()) // can the sender's bank cover this?
            .and()
        .withExternal()
            .source(AWAITING_RESPONSE).target(SETTLING)
            .event(ACCEPTED)
            .action(executeSettlementAction()) // the actual ledger debit/credit
            .and()
        .withExternal()
            .source(SETTLING).target(SETTLED)
            .event(SETTLEMENT_CONFIRMED);
}

Clean, declarative, and readable. The guards make sure no illegal transition happens. The actions execute the side effects (like the actual ledger operations). So far so good.

But here comes the main question: what happens when the node running this state machine dies while a payment is in the SETTLING state?

The Problem: In-Memory State Does Not Survive

By default, Spring State Machine lives entirely in memory. The current state, the extended state variables (settlement amount, participant bank IDs, reconciliation tokens) — everything sits in the JVM heap. If the node crashes, all of it is gone.

Now, Spring does offer a persistence mechanism — StateMachinePersist and StateMachinePersister. These interfaces let you serialize a StateMachineContext to an external store and restore it onto a new machine instance. The concept is straightforward, and you can find it in the Spring State Machine documentation.

But here is the thing — this mechanism was designed for cases like resuming a shopping cart workflow when a user returns to a website. It was not designed for national payment infrastructure.

We hit four serious problems very quickly:

No transition history. It persists a snapshot, not a log. If the machine transitions through three states in quick succession, you only get the last one. Regulators need every transition.
Sub-machine restore is broken. We used hierarchical states — a parent machine for the settlement cycle with child machines for individual payment legs in a batch. On restore, child regions sometimes reset to their initial state instead of resuming. For us, that meant a payment leg could re-execute, which means duplicate fund movements.
The extended state does not propagate cleanly. Inactive sub-machines sometimes lost their variables on restore. A missing reconciliation token is a compliance failure.
No concurrency protection. Two events arriving simultaneously for the same settlement could race through the persist-on-state-change interceptor and leave the database in a state the machine was never actually in.

So what did we do? We built our own persistence layer on top of Spring State Machine — using Postgres.

The Solution: Serialize to Postgres, Restore on Replica

The design principle was simple:

Every state transition must be persisted to Postgres before the machine moves forward. A standby replica must be able to restore any in-flight settlement from the database and continue processing where the primary left off.

The Schema

We created two tables with two different jobs:

CREATE TABLE settlement_machine_snapshot (
    settlement_id    VARCHAR(64) PRIMARY KEY,
    machine_context  BYTEA NOT NULL,       -- Kryo-serialized StateMachineContext
    current_state    VARCHAR(32) NOT NULL,  -- denormalized for ops queries
    extended_state   JSONB NOT NULL,        -- denormalized for audit/debug
    sequence_number  BIGINT NOT NULL,
    updated_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE settlement_transition_log (
    id               BIGSERIAL PRIMARY KEY,
    settlement_id    VARCHAR(64) NOT NULL,
    sequence_number  BIGINT NOT NULL,
    source_state     VARCHAR(32) NOT NULL,
    target_state     VARCHAR(32) NOT NULL,
    triggering_event VARCHAR(64) NOT NULL,
    extended_state   JSONB NOT NULL,
    created_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (settlement_id, sequence_number)
);

settlement_machine_snapshot holds the latest serialized context for each active settlement — this is the fast-restore path. Load the bytes, deserialize, restore onto a fresh machine instance, and continue.

settlement_transition_log is the append-only audit trail. It is never updated, never deleted (only archived after the retention period passes). This is what you hand to regulators when they ask "what happened to payment X on Tuesday at 03:47 AM?"

You can ask, "Why store the same data three ways — a binary blob, denormalized columns, and a transition log?" Because each serves a different audience. The blob is for the machine. The columns are for the ops team running queries at 3 AM. The log is for compliance so this is not redundancy but by design.

Serialization: StateMachineContext to Bytes

You cannot just use standard Java serialization for a StateMachineContext. The object graph has deep references into Spring framework internals — bean factories, proxies, context objects — and they do not deserialize cleanly across JVM instances. Spring's own Redis-based persistence uses Kryo for this reason, and so did we:

// Kryo is NOT thread safe, hence ThreadLocal
private final ThreadLocal<Kryo> kryoPool = ThreadLocal.withInitial(() -> {
    Kryo kryo = new Kryo();
    kryo.setRegistrationRequired(false);
    kryo.register(DefaultStateMachineContext.class);
    kryo.register(SettlementState.class);
    kryo.register(SettlementEvent.class);
    kryo.register(BigDecimal.class); // settlement amounts
    kryo.register(Instant.class);    // timestamps
    return kryo;
});

We stored the result as BYTEA in Postgres. You might think, "Why not JSONB for the whole context?" Because JSON serialization of a StateMachineContext loses type information on nested objects and does not round-trip cleanly. Binary is ugly, but it is correct.

The Persist Layer

Our custom StateMachinePersist writes both tables in a single call:

@Override
public void write(StateMachineContext<SettlementState, SettlementEvent> context,
                  String settlementId) throws Exception {

    byte[] blob = serializer.serialize(context);
    String stateJson = toJson(context.getExtendedState().getVariables());
    long seq = getNextSequenceNumber(settlementId);

    // upsert the snapshot — always reflects the latest state
    jdbc.update("""
        INSERT INTO settlement_machine_snapshot
            (settlement_id, machine_context, current_state,
             extended_state, sequence_number)
        VALUES (?, ?, ?, ?::jsonb, ?)
        ON CONFLICT (settlement_id) DO UPDATE SET
            machine_context = EXCLUDED.machine_context,
            current_state   = EXCLUDED.current_state,
            extended_state  = EXCLUDED.extended_state,
            sequence_number = EXCLUDED.sequence_number,
            updated_at      = now()
        """, settlementId, blob, context.getState().name(), stateJson, seq);

    // append to the transition log — immutable history
    jdbc.update("""
        INSERT INTO settlement_transition_log
            (settlement_id, sequence_number, source_state,
             target_state, triggering_event, extended_state)
        VALUES (?, ?, ?, ?, ?, ?::jsonb)
        """, settlementId, seq, getPreviousState(settlementId),
             context.getState().name(), context.getEvent().name(), stateJson);
}

The Interceptor: No Transition Without Persistence

This part is the most important. We wired the persistence into Spring's StateMachineInterceptor, which fires during the transition, not after. Why does this matter? Because if the Postgres write fails, the transition itself rolls back. The state machine and the database can never disagree:

@Override
public void postStateChange(
        State<SettlementState, SettlementEvent> state,
        Message<SettlementEvent> message,
        Transition<SettlementState, SettlementEvent> transition,
        StateMachine<SettlementState, SettlementEvent> stateMachine,
        StateMachine<SettlementState, SettlementEvent> rootStateMachine) {

    String id = (String) stateMachine.getExtendedState()
        .getVariables().get("settlementId");
    try {
        persist.write(buildContext(state, message, stateMachine), id);
    } catch (Exception e) {
        // if persistence fails, the settlement must NOT proceed
        throw new SettlementPersistenceException(
            "Persist failed for " + id, e);
    }
}

If you treat persistence as something that happens after the fact, then recovery becomes a hope, not a guarantee. We made persistence a precondition for every transition.

Restore on Failover

And here is the payoff. When the standby replica starts — whether after a crash, a deployment, or a planned failover — it queries Postgres for every settlement not in a terminal state and brings them back to life:

public void recoverInFlightSettlements() {
    List<String> active = jdbc.queryForList("""
        SELECT settlement_id FROM settlement_machine_snapshot
        WHERE current_state NOT IN ('SETTLED','REJECTED','FAILED','RETURNED')
        """, String.class);

    for (String id : active) {
        // 1. create a fresh state machine from the factory
        StateMachine<SettlementState, SettlementEvent> machine =
            factory.getStateMachine(id);

        // 2. restore the serialized context from Postgres
        persister.restore(machine, id);

        // 3. verify — do NOT trust the deserialized blob blindly
        String dbState = getExpectedState(id);
        String machineState = machine.getState().getId().name();
        if (!machineState.equals(dbState)) {
            throw new SettlementConsistencyException(
                "Mismatch for " + id + ": DB=" + dbState
                + " machine=" + machineState);
        }

        // 4. verify that extended state variables survived serialization
        Map<String, Object> vars = machine.getExtendedState().getVariables();
        requireNonNull(vars.get("settlementAmount"), "missing amount");
        requireNonNull(vars.get("senderBankId"), "missing sender");
        requireNonNull(vars.get("receiverBankId"), "missing receiver");
        requireNonNull(vars.get("reconciliationToken"), "missing token");

        // 5. resume — the machine is now live on this replica
        settlementProcessor.resume(machine, id);
    }
}

The verification step is non-negotiable. We check the restored machine state against the denormalized current_state column, and we confirm every required extended state variable is present. If anything does not match, the settlement goes to manual review instead of proceeding with corrupt state.

Concurrency: Let Postgres Do the Work

One last problem — what happens when a timeout event and a bank response event arrive at the same time for the same settlement? Two threads read the same state, both try to transition, both try to persist.

The fix was the UNIQUE (settlement_id, sequence_number) constraint on the transition log. Both threads try to write sequence N+1. One wins. The other gets a unique constraint violation, and our interceptor translates that into a transition rollback.

Optimistic locking via Postgres allows to avoid distributed lock manager and Redis. Just a unique constraint doing exactly what it was designed for.

What I Took Away From This Project

This system ran in production. It survived node failures, rolling deployments, and traffic spikes when payroll landed on public holidays. Every interrupted settlement was recovered from Postgres and completed on a standby replica without anyone picking up a phone.

Here is what I carry with me from this experience:

Design the persistence schema before the state machine config. Your persistence model is your recovery model. If you think of persistence as plumbing, you have already decided that recovery is an afterthought.

Store the same state for different audiences. The binary blob for the machine, the denormalized columns for ops, the transition log for regulators. Each view serves a purpose.

Wrap the framework, do not fight it. Spring State Machine does states and transitions beautifully. It does not do financial-grade durability. We built around it instead of trying to patch it.

Optimistic locking is enough. A Postgres unique constraint solved our concurrency problem with zero operational overhead. I have used this pattern on every project since.

Use state machines more often. Any process with lifecycle stages and strict transition rules should be modeled as a state machine. The explicitness forces you to enumerate failure modes. You cannot have an "impossible" state transition if your model says it is impossible.

Payment infrastructure is transforming globally — the EU's Instant Payments Regulation mandates sub-10-second settlement 24/7, Canada is launching its Real-Time Rail, and Brazil's Pix already handles billions of transactions. Every one of these systems needs state transitions that persist reliably, restore deterministically, and audit completely.

If you are building anything where a state transition has real-world consequences, start with the persistence. The business logic is the part any decent engineer can figure out. Making it survive the real world — that is the architecture.