sia.hackernoon.com

Introduction:

If you’ve wired up PostgreSQL → Debezium → Kafka and started consuming change events, you may have seen something weird in your payloads:

{
  "after": {
    "id": 123,
    "title": "__debezium_unavailable_value",
    "body": "__debezium_unavailable_value"
  }
}

You know those title and body columns have data in the database, but Debezium is emitting __debezium_unavailable_value instead.

If you’re trying to:

Keep a search index (e.g., Elasticsearch) in sync,
Build an incremental aggregation (e.g., materialized view replacement),
Or replicate rows to another store,

this placeholder can silently corrupt your downstream state.

This post explains:

Why __debezium_unavailable_value happens
How PostgreSQL’s TOAST and replica identity interact with Debezium
Two practical ways to fix it, with concrete SQL and consumer code

The Problem: CDC Events Missing Previous Values

Consider a table:

CREATE TABLE articles (
  id       BIGSERIAL PRIMARY KEY,
  title    TEXT,
  body     TEXT,
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

You configure Debezium’s PostgreSQL connector to stream changes from articles into Kafka. For an UPDATE, you expect Debezium to send the new row state, including title and body.

Instead, you see events like:

{
  "op": "u",
  "after": {
    "id": 123,
    "title": "__debezium_unavailable_value",
    "body": "__debezium_unavailable_value",
    "updated_at":  "2025-01-01T10:00:00Z"
  },
  "before": {
    "id": 123,
    "title": "__debezium_unavailable_value",
    "body": "__debezium_unavailable_value",
    "updated_at": "2024-12-31T15:00:00Z"
  }
}

Downstream, your consumer:

Overwrites the previous title/body with "__debezium_unavailable_value", or
Crashes because it doesn’t expect that pseudo‑value.

What’s going on?

What’s Really Happening: TOAST and Replica Identity

Two PostgreSQL features are colliding here:

1. TOAST (The Oversized-Attribute Storage Technique)

PostgreSQL stores large column values (e.g., big TEXT or JSONB) in a separate “TOAST” table to avoid blowing up the main page size.

Key behavior:

When a row is updated, if a large column doesn’t change, PostgreSQL might not pull the old value from the TOAST table.
For logical decoding (which Debezium uses), unchanged TOASTed columns may be omitted from the “before/after” images.

Debezium represents those missing values as: __debezium_unavailable_value because it cannot reconstruct the actual content from the WAL alone.

2. Replica Identity

PostgreSQL controls how much “OLD” data is logged for replication via REPLICA IDENTITY.

By default:

ALTER TABLE articles REPLICA IDENTITY DEFAULT;

means:

Only primary key columns are guaranteed in the “before” image for updates/deletes.
Non‑PK columns (especially TOASTed ones) may be missing.

Solution 1: Adjust `REPLICA IDENTITY` (Full or Index-Based)

If your use case really needs the previous values of specific columns, you can tell PostgreSQL to log more information.

Option A: `REPLICA IDENTITY FULL`

This logs the “before” image for all columns, including TOASTed ones.

ALTER TABLE public.articles
  REPLICA IDENTITY FULL;

Pros:

Debezium will see actual values instead of __debezium_unavailable_value.
Your CDC events will always contain a complete picture of the row.

Cons:

Increases WAL size and write overhead on the source database.
Not ideal if you have many large columns and heavy write traffic.

Option B: `REPLICA IDENTITY USING INDEX`

If you only care about some columns (e.g., title and body but not other big JSON fields), create a dedicated index and use it as replica identity:

-- 1) Create an index on the columns you need in the "before" image
CREATE UNIQUE INDEX articles_replica_identity_idx
  ON public.articles (id, title, body);

-- 2) Use that index for replica identity
ALTER TABLE public.articles
  REPLICA IDENTITY USING INDEX articles_replica_identity_idx;

Now, for updates/deletes:

PostgreSQL logs id, title, body in the “before” image.
Debezium can read the actual values, not placeholders.

This is a good compromise:

More WAL than DEFAULT, but less than FULL
Targets only the columns your downstream systems really need

Important: changing REPLICA IDENTITY has production DB impact. Test on a staging cluster and monitor WAL size / replication lag.

Solution 2: Handle `__debezium_unavailable_value` in Consumers

Sometimes you don’t actually need the unchanged TOASTed values for certain use cases. You just need to avoid overwriting good data with a placeholder.

In those cases, you can handle this in your CDC consumer.

Example: Python Kafka Consumer (Confluent Kafka)

Let’s say your sink DB already has the correct previous values for title and body, and you:

Only care about new values when they’re provided
Want to ignore placeholders for unchanged fields

Simplified Python consumer:

from confluent_kafka import Consumer
import json

UNAVAILABLE = "__debezium_unavailable_value"

consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "articles-sync",
    "auto.offset.reset": "earliest",
})

consumer.subscribe(["cdc.public.articles"])

def merge_with_existing(existing_row, after_payload):
    """
    Merge CDC 'after' payload into existing row, ignoring unavailable fields.
    """
    merged = dict(existing_row)
    for col, val in after_payload.items():
        if val == UNAVAILABLE:
            # Keep existing value, do not overwrite
            continue
        merged[col] = val
    return merged

while True:
    msg = consumer.poll(1.0)
    if msg is None:
        continue
    if msg.error():
        print("Consumer error:", msg.error())
        continue

    event = json.loads(msg.value())
    op = event.get("op")
    after = event.get("after")
    key = event.get("after", {}).get("id")

    if op == "c":  # insert
        # write full after into sink
        upsert_into_sink(after)
    elif op == "u":  # update
        existing = read_from_sink(key)
        merged = merge_with_existing(existing, after)
        upsert_into_sink(merged)
    elif op == "d":  # delete
        delete_from_sink(key)

Key idea:

Treat __debezium_unavailable_value as “unchanged” and don’t overwrite sink values with it.

You can do similar logic in Java, Go, or wherever your consumer runs.

This approach is safe only if you know your sink always has the last good value. If you have consumers that start from an empty state or may miss events, you’ll need a reliable backfill/snapshot mechanism as well.

When to Use Which Approach?

Use REPLICA IDENTITY tweaks when:

You truly need before values of certain columns (e.g., to compute deltas, decrement aggregates, or build exact audit logs).
Your production DB can tolerate some WAL growth.

Use consumer‑side handling when:

You don’t need the old values to compute correct results.
You just want to prevent corrupting the sink with the placeholder.
You want zero impact on the source DB.

In many real systems, you end up using both:

Increase REPLICA IDENTITY only for a few critical tables/columns.
Keep consumer logic robust against placeholders in less critical streams.

Takeaways

If you see __debezium_unavailable_value in your Debezium CDC stream, it’s not a bug; it’s PostgreSQL and Debezium being honest about what they don’t know.

To fix it:

Understand TOAST and REPLICA IDENTITY.
For tables where previous values matter, change REPLICA IDENTITY (FULL or USING INDEX) so Debezium can see what you need.
For other tables, make your consumers ignore placeholders instead of overwriting valid data.

Do that, and your CDC pipelines become a lot more trustworthy and your downstream systems won’t be haunted by __debezium_unavailable_value ever again.

Why Your Debezium CDC Stream Is Full of __debezium_unavailable_value (and How to Fix It)