Based on recent practices in production environments using SeaTunnel CDC (Change Data Capture) to synchronize scenarios such as Oracle, MySQL, and SQL Server, and combined with feedback from a wide range of users, I have written this article to help you understand the process by which SeaTunnel implements CDC. The content mainly covers the three stages of CDC: Snapshot, Backfill, and Incremental.

The Three Stages of CDC

The overall CDC data reading process can be broken down into three major stages:

  1. Snapshot (Full Load)
  2. Backfill
  3. Incremental

1. Snapshot Stage

The meaning of the Snapshot stage is very intuitive: take a snapshot of the current database table data and perform a full table scan via JDBC.

Taking MySQL as an example, the current binlog position is recorded during the snapshot:

SHOW MASTER STATUS;

File

Position

Binlog_Do_DB

Binlog_Ignore_DB

Executed_Gtid_Set

binlog.000011

1001373553

SeaTunnel records the File and Position as the low watermark.

Note: This is not just executed once, because SeaTunnel has implemented its own split cutting logic to accelerate snapshots.

MySQL Snapshot Splitting Mechanism (Split)

Assuming the global parallelism is 10:

Table-level sequential processing (schematic):

// Processing sequence:
// 1. Table1 -> Generate [Table1-Split0, Table1-Split1, Table1-Split2]
// 2. Table2 -> Generate [Table2-Split0, Table2-Split1]
// 3. Table3 -> Generate [Table3-Split0, Table3-Split1, Table3-Split2, Table3-Split3]

Split-level parallel allocation:

// Allocation to different subtasks:
// Subtask 0: [Table1-Split0, Table2-Split1, Table3-Split2]
// Subtask 1: [Table1-Split1, Table3-Split0, Table3-Split3]
// Subtask 2: [Table1-Split2, Table2-Split0, Table3-Split1]

Each Split is actually a query with a range condition, for example:

SELECT * FROM user_orders WHERE order_id >= 1 AND order_id < 10001;

Crucial: Each Split separately records its own low watermark/high watermark.

Practical Advice: Do not make the split_size too small; having too many Splits is not necessarily faster, and the scheduling and memory overhead will be very large.

2. Backfill Stage

Why is Backfill needed? Imagine you are performing a full snapshot of a table that is being frequently written to. When you read the 100th row, the data in the 1st row may have already been modified. If you only read the snapshot, the data you hold when you finish reading is actually "inconsistent" (part is old, part is new).

The role of Backfill is to compensate for the "data changes that occurred during the snapshot" so that the data is eventually consistent.

The behavior of this stage mainly depends on the configuration of the exactly_once parameter.

2.1 Simple Mode (exactly_once = false)

This is the default mode; the logic is relatively simple and direct, and it does not require memory caching:

2.2 Exactly-Once Mode (exactly_once = true)

This is the most impressive part of SeaTunnel CDC, and it is the secret to guaranteeing that data is "never lost, never repeated." It introduces a memory buffer (Buffer) for deduplication.

Simple Explanation: Imagine the teacher asks you to count how many people are in the class right now (Snapshot stage). However, the students in the class are very mischievous; while you are counting, people are running in and out (data changes). If you just count with your head down, the result will definitely be inaccurate when you finish.

SeaTunnel does it like this:

  1. Take a Photo First (Snapshot): Count the number of people in the class first and record it in a small notebook (memory buffer); don't tell the principal (downstream) yet.
  2. Watch the Surveillance (Backfill): Retrieve the surveillance video (Binlog log) for the period you were counting.
  3. Correct the Records (Merge):
  1. Submit Homework (Send): After correction, the small notebook in your hand is a perfectly accurate list; now hand it to the principal.

Summary for Beginners: exactly_once = true means "hold it in and don't send it until it's clearly verified."

2.3 Two Key Questions and Answers

Q1: Why is case READ: throw Exception written in the code? Why aren't there READ events during the Backfill stage?

Q2: If it's placed in memory, can the memory hold it? Will it OOM?

2.4 Key Detail: Watermark Alignment Between Multiple Splits

This is a very hidden but extremely important issue. If not handled well, it will lead to data being either lost or repeated.

Plain Language Explanation: The Fast/Slow Runner Problem: Imagine two students (Split A and Split B) are copying homework (Backfill data).

Now, the teacher (Incremental task) needs to continue teaching a new lesson (reading Binlog) from where they finished copying. Where should the teacher start?

SeaTunnel's Solution: Start from the earliest and cover your ears for what you've already heard: SeaTunnel adopts a "Minimum Watermark Starting Point + Dynamic Filtering" strategy:

  1. Determine the Start (care for the slow one): The teacher decides to start from page 100 (the minimum watermark among all splits).
  2. Dynamic Filtering (don't listen to what's been heard): While the teacher is lecturing (reading Binlog), they hold a list: { A: 100, B: 200 }.
  1. Full Speed Mode (everyone has finished hearing): When the teacher reaches page 201 and finds everyone has already heard it, they no longer need the list.

Summary in one sentence: With exactly_once: The incremental stage strictly filters according to the combination of "starting offset + split range + high watermark."

Withoutexactly_once: The incremental stage becomes a simple "sequential consumption from a certain starting offset."

3. Incremental Stage

After the Backfill (for exactly_once = true) or Snapshot stage ends, it enters the pure incremental stage:

SeaTunnel's behavior in the incremental stage is very close to native Debezium:

4. Summary

The core design philosophy of SeaTunnel CDC is to find the perfect balance between "Fast" (parallel snapshots) and "Stable" (data consistency).

Let's review the key points of the entire process:

Understanding this trilogy of "Snapshot -> Backfill -> Incremental" and the coordinating role of "Watermarks" within it is to truly master the essence of SeaTunnel CDC.