Intro

Hi everyone! My name is Max — I lead the Data Platform and ML at Tabby.

My teams and I would like to share the experience we’ve accumulated over the past few years, and this article is the first in a series about building a modern Data Platform from scratch.

I’m confident that our experience combining Google Cloud Platform services, open-source industry-standard solutions, and our own technological approaches will find an audience and be useful.

Synchronisation from PostgreSQL to DWH based on Google BigQuery

Why synchronize or deliver data from operational databases to an analytical DWH in the first place?


As I mentioned in the first article, operational databases — those backing product business logic — are often not designed to efficiently handle large-scale analytical tasks. However, they remain a critical source of valuable data for analytics, reporting, insights, and machine learning.


Therefore, when building a Data Platform or DWH as part of it, having an effective and secure mechanism for synchronising data from operational databases is essential.

Design & Architecture

Like most software projects, we started by exploring existing solutions on the market. Our main candidates were:

I won’t dive deeply into the technical details of these solid tools but will explain why we decided to build our own solution.

Strategically, we believed it was more beneficial to develop a custom service that captures data from PostgreSQL and delivers it, at minimum, to our DWH based on Cloud Storage and BigQuery. More importantly, we wanted no restrictions on target systems where we could deliver data in the future.

Another key reason was that Tabby is primarily a fintech company, subject to diverse regional regulations regarding data processing and storage. For some regions, third-party solutions might be acceptable, while for others, local regulations demand that data land and stay within specific jurisdictions. Choosing a third-party tool could unpredictably limit us or become a ticking time bomb.

Ultimately, we concluded that this technology is foundational for the future of our DWH and other Data Platform solutions. We decided to invest time and effort in building it ourselves to retain full control over synchronising critical data sources into our Data Platform.

How it works

Here, I’ll walk you through the challenges, compromises, and iterations we faced — starting from the broad picture and drilling down into details.

1st iteration

In our initial system, we chose a batch synchronisation approach: extracting data from PostgreSQL and loading it into BigQuery’s primary DWH layer using BigQuery’s MERGE method.

If you want to learn about the Tabby DWH architecture at the time, check out Part 1 of this series

This was a stable, straightforward solution for us, as we had extensive experience with Airflow and BigQuery. It also allowed us to quickly pivot away from the external Fivetran service.

However, the batch method had some limitations, though none were critical:

2nd iteration

After launching the first iteration and discontinuing Fivetran, we began designing and developing streaming synchronisation from PostgreSQL to BigQuery.

At first glance, the batch method solved most problems with minor drawbacks. But we knew medium-term needs would push us towards streaming:

CDC (Change Data Capture) streaming based on Debezium was the obvious choice.


We quickly launched CDC synchronisation using Debezium and Google Pub/Sub. But we soon learned hard lessons during the testing phase…

3rd iteration

During CDC testing with Debezium, we faced several issues:

We temporarily paused CDC use, reverted to batch sync, and continued R&D to safely re-enable CDC. Our final solution involved:

Replacing self-managed Debezium instances with the Flink Debezium Connector, which provides transparent pipeline performance analytics and straightforward scaling and fault-tolerance management.

The integration architecture became:

This setup allowed us to reliably capture data from PostgreSQL masters and deliver it to Pub/Sub topics, where each subscriber receives its own message copy — enabling delivery to the DWH or other Data Platform solutions.

We created a dedicated Bronze staging layer in the DWH to:

We didn’t abandon batch synchronisation but significantly modified it:

Quick results

Thank you!