sia.hackernoon.com

Intro

Hi everyone! My name is Max — I lead the Data Platform and ML at Tabby.

My teams and I would like to share the experience we’ve accumulated over the past few years, and this article is the first in a series about building a modern Data Platform from scratch.

I’m confident that our experience combining Google Cloud Platform services, open-source industry-standard solutions, and our own technological approaches will find an audience and be useful.

Synchronisation from PostgreSQL to DWH based on Google BigQuery

Why synchronize or deliver data from operational databases to an analytical DWH in the first place?

As I mentioned in the first article, operational databases — those backing product business logic — are often not designed to efficiently handle large-scale analytical tasks. However, they remain a critical source of valuable data for analytics, reporting, insights, and machine learning.

Therefore, when building a Data Platform or DWH as part of it, having an effective and secure mechanism for synchronising data from operational databases is essential.

Design & Architecture

Like most software projects, we started by exploring existing solutions on the market. Our main candidates were:

Fivetran (which we used temporarily)
Google Cloud Datastream

I won’t dive deeply into the technical details of these solid tools but will explain why we decided to build our own solution.

Strategically, we believed it was more beneficial to develop a custom service that captures data from PostgreSQL and delivers it, at minimum, to our DWH based on Cloud Storage and BigQuery. More importantly, we wanted no restrictions on target systems where we could deliver data in the future.

Another key reason was that Tabby is primarily a fintech company, subject to diverse regional regulations regarding data processing and storage. For some regions, third-party solutions might be acceptable, while for others, local regulations demand that data land and stay within specific jurisdictions. Choosing a third-party tool could unpredictably limit us or become a ticking time bomb.

Ultimately, we concluded that this technology is foundational for the future of our DWH and other Data Platform solutions. We decided to invest time and effort in building it ourselves to retain full control over synchronising critical data sources into our Data Platform.

How it works

Here, I’ll walk you through the challenges, compromises, and iterations we faced — starting from the broad picture and drilling down into details.

1st iteration

In our initial system, we chose a batch synchronisation approach: extracting data from PostgreSQL and loading it into BigQuery’s primary DWH layer using BigQuery’s MERGE method.

If you want to learn about the Tabby DWH architecture at the time, check out Part 1 of this series

This was a stable, straightforward solution for us, as we had extensive experience with Airflow and BigQuery. It also allowed us to quickly pivot away from the external Fivetran service.

However, the batch method had some limitations, though none were critical:

We could only capture new or updated data from PostgreSQL. In most cases, this was acceptable since operational databases rarely perform hard deletes; instead, they mark rows as deleted using a special soft-delete flag, which counted as data updates for us.
Since our method scanned tables partially or fully, we couldn’t use master PostgreSQL instances (to avoid performance impacts). Using replicas introduced a risk of delayed data capture, depending on replication lag, which added to the Airflow pipeline’s latency.
Delivery latency was another challenge. Airflow pipelines run on schedules and can’t guarantee near-real-time capture of changes as they happen.

2nd iteration

After launching the first iteration and discontinuing Fivetran, we began designing and developing streaming synchronisation from PostgreSQL to BigQuery.

At first glance, the batch method solved most problems with minor drawbacks. But we knew medium-term needs would push us towards streaming:

DWH Evolution: Our initial DWH architecture had two layers: a raw primary layer and data marts built by users for analytics. Over time, we planned to add data models like Star Schemas and Slowly Changing Dimensions (SCD2) to optimise data consumption. Batch syncing could limit that flexibility.
Data Platform Vision: Beyond the DWH, we wanted to support multiple solutions leveraging data. One priority was building a Feature Store for ML services, which needed minimal latency data streams from backend systems to avoid direct calls to operational databases.

CDC (Change Data Capture) streaming based on Debezium was the obvious choice.

We quickly launched CDC synchronisation using Debezium and Google Pub/Sub. But we soon learned hard lessons during the testing phase…

3rd iteration

During CDC testing with Debezium, we faced several issues:

Direct CDC to BigQuery Integration: While we had successfully connected Pub/Sub topics directly to BigQuery or via Cloud Functions for data transformations, scaling to many sources exposed BigQuery to insertion locks, quota exhaustion, and other unpredictable issues.
**Master Instance Risks:**We wanted to capture data from PostgreSQL master instances (not replicas) for freshest data. Although CDC uses PostgreSQL’s internal logical replication logs and is considered safe, we experienced incidents risking production database stability — caused by replication slot overflow and Debezium instance performance, fault tolerance, and scalability challenges.

We temporarily paused CDC use, reverted to batch sync, and continued R&D to safely re-enable CDC. Our final solution involved:

Replacing self-managed Debezium instances with the Flink Debezium Connector, which provides transparent pipeline performance analytics and straightforward scaling and fault-tolerance management.

The integration architecture became:

1 replication slot per database
Through practical experience, we’ve found that for large, high-load tables the optimal load‐balancing strategy is to create dedicated replication slots for each of those tables. Therefore, the “1 replication slot per database” scheme remains the default — but not the only — option
1 Flink job per replication slot
1 Pub/Sub topic per table
1 Cloud Storage bucket with 1 folder per database
Direct synchronisation between Cloud Storage and the DWH’s Bronze layer in BigQuery.

This setup allowed us to reliably capture data from PostgreSQL masters and deliver it to Pub/Sub topics, where each subscriber receives its own message copy — enabling delivery to the DWH or other Data Platform solutions.

We created a dedicated Bronze staging layer in the DWH to:

Store raw incoming data cheaply in Cloud Storage for disaster recovery
Separate the ingestion layer (accessible only to DWH engineers) from user-facing layers for data marts, reports, and dashboards.

We didn’t abandon batch synchronisation but significantly modified it:

Batch pipelines now produce data in the same format as CDC streams, acting as a fallback in case CDC fails.
Batch sync destinations changed: instead of inserting mini-batches directly into BigQuery, we publish to the same Pub/Sub topics used by CDC — ensuring minimal disruption when switching sync methods.
Flink Debezium Connector offers two startup modes: syncing from the table’s start or from the connection moment. We chose the latter to avoid the heavy initialisation load on large tables. So CDC sync delivers data only from the moment it’s enabled, while batch sync handles loading historical data in the background.

Quick results

We built a stable, scalable system for synchronising data from PostgreSQL to BigQuery — with flexibility to deliver to other storage solutions as well. This gave a massive boost to Tabby’s Data Platform development, especially our DWH.
Our system supports both CDC and batch synchronisation, allowing us to combine their strengths and always maintain a backup channel.
We maintain full control over the sync system’s functionality, development, and application — free from vendor lock-in or unforeseen constraints.

Thank you!

Modern Data Platform | Synchronisation from operational DB to analytical DWH