As the leading fulfilment platform for digital commerce in India, Delhivery fulfils a million packages a day, 365 days a year. Its 24 automated sort centres, 101 hubs, 3,100+ direct delivery centres, 1000+ partner centres, 11,000+ fleet, and 60,000+ team members run smoothly thanks to a vast network of IoT devices. Thousands of data events and messages are coming in and going out of our pipelines each second. This amounts to a massive daily data volume in terabytes, which makes operational visibility crucial for us and our stakeholders.

Recognising the requirements, we decided to build data marts—centralised, eventually consistent databases that offer users quick access to pre-aggregated business data. This allows our stakeholders to quickly access business insights without searching through an entire data warehouse.

However, with this daunting scale, one of the major challenges was to maintain data integrity and low latency while providing the capacity for analytical workloads.

In this blog, I am going to unpack all of my leanings while migrating our data marts from Amazon Aurora to TiDB, a Hybrid Transactional/Analytical Processing (HTAP), distributed SQL database. Hopefully, this post can provide insights to data engineering leaders, database administrators, or data architects who are considering a similar migration to TiDB or any other HTAP database.

OLTP, OLAP, and HTAP

To better understand the real-time data marts case at Delhivery, let’s first get familiar with three concepts that are at the core of our use case: OLTP, OLAP & HTAP:

Real-Time Data Marts Use Case at Delhivery

Real-time data marts differ from traditional data marts in that they ingest data in real time, not at specific intervals. These data marts are critical for ground operational decision-making at Delhivery because we can't afford any delay in synchronising these events.

Our real-time data mart journey began in 2020 when we identified a need for centralised dashboards—specifically the EYE dashboard. This dashboard's purpose was to provide real-time operational visibility to ground operations, enabling decision-making based on up-to-the-minute data. Examples of usages include:

Initial implementation and the challenges

We thought of solving our use cases using data warehouse tools like Redshift, and Snowflake but none of these solutions worked for us considering the design pattern and requirement for real-time data ingestion along with merge.

Thus, we initially chose Aurora (PostgreSQL) to serve our data mart use case.

The data ingestion process around Aurora

We architected our real-time data marts using Spark Streaming and Aurora. Our steaming pipeline was very simple--reading data from Kafka, processing data in Spark micro batches, and performing upsert operations in Aurora.

Our database was modelled using a multi-layered architecture, which consists of a raw layer, a partitioned layer, and a data marts layer. Users did not have access to view or modify data in the raw layer. The partitioned layer is kept to maintain all partitioned tables (dimension tables generally). Below is a simple schema design of our database:

Challenges we faced with Aurora

The system initially performed well, until it had to handle a throughput beyond 3K messages per second. This marked the onset of several challenges:

Searching for alternatives

To resolve Aurora’s limitations, we set out to find a better alternative that met the following requirements:

Considering all the above requirements, we initially explored many PostgreSQL alternatives including Spanner and Yugabyte because we wanted to keep our change management minimal.

Spanner

Spanner is a distributed SQL database management and storage service offered by Google. It is fully managed on the Google Cloud Platform (GCP). However, we found that Spanner might not be a good use case for our architecture due to the following reasons:

Yugabyte

YugabyteDB is a high-performance transactional distributed SQL database for cloud-native applications, developed by Yugabyte. This database is very close to our use case because it was fully PostgreSQL compliant, horizontally scalable, and fully distributed. Unfortunately, it didn’t work as well because of its limitation with scalability, Our success criteria demanded 7k+ transactions per second but Yugabyte was only able to scale up to 5k.

We also looked into other possible candidates like BigQuery, but none of them served our requirements well.

Landing with TiDB

After the above PostgreSQL alternatives, we decided to add HTAP to our requirements, which led us to TiDB. It supports out-of-the-box scalability, consistency, availability, multi-site deployment topology, and many more features. As a distributed database, TiDB has multiple components that communicate with each other and form a complete TiDB system.

The following features of TiDB addressed our key challenges and met our operational requirements:

Our TiDB Infrastructure

Our TiDB infrastructure is deployed over the VMs of leading cloud service providers. We use TiUP, TiDB’s package manager, to manage the cluster and all the administrative operations. Our cluster is deployed over 3 available zones (AZs).

Our cluster configurations are as follows:

By deploying our TiDB cluster across multiple AZs and carefully selecting node types to meet our processing and memory needs, we've created a robust, highly available infrastructure capable of handling our high data throughput requirements.

Tuning TiDB for Our Case

To make it work for our use case, we worked closely with the PingCAP team to tune the database. Here are some of the critical adjustments we made:

Index Optimisation

Set the following parameters before starting the index.

SET @@global.tidb_ddl_reorg_worker_cnt = 16;
SET @@global.tidb_ddl_reorg_batch_size = 4096;

Reset to default values after index creation.

SET @@global.tidb_ddl_reorg_worker_cnt = 4;
SET @@global.tidb_ddl_reorg_batch_size = 256;

Partition Pruning

This is mainly important for partitioned tables. It analyses the filter conditions in query statements and eliminates (prunes) partitions when they do not contain any required data.

SET @@session.tidb_partition_prune_mode = 'dynamic';

Tuning analyse

Sometimes the auto analyser in TiDB fails if a high volume of data is ingested. In that case, all the queries might use the wrong execution plan and end up scanning the full table. To avoid such a situation we made the following changes in TiDB configurations:

set global tidb_max_auto_analyze_time = 86400;
set global tidb_enable_pseudo_for_outdated_stats = off;
set global tidb_sysproc_scan_concurrency = 15;

If you are working with partitioned tables, we suggest you run analyze table operations manually for one partition at a time to avoid analyzing failures.

Through adjustments like these, we were able to effectively streamline our use of TiDB, so that we can achieve an optimal performance for our real-time data mart.

Our Experience With TiDB

Conclusion

In this post, we have explored the challenges of using Aurora with our use case of real-time data marts and the migration journey to TiDB. We also discussed how Delhivery is using TiDB at scale.

Despite our success with TiDB, we acknowledge that no solution is perfect, and effectiveness can vary depending on the use case. In TiDB, we noted a couple of areas for improvement, including the lack of out-of-the-box support for materialized views and native quota management. However, with appropriate workarounds and adjustments, we have managed to address these limitations effectively.

So far, we have deployed TiDB in our production environment. Based on our benchmarks, TiDB enables us to handle over thousands of requests per second with less than 100ms latency. Moving forward, we will continue to explore more use cases that require a robust, consistently distributed database.

References

https://docs.pingcap.com/tidb/stable/tidb-lightning-overview

https://reorg.github.io/pg_repack/

https://github.com/dataegret/pgcompacttable

https://cloud.google.com/spanner

https://www.yugabyte.com/yugabytedb/

https://cloud.google.com/bigquery/

https://docs.pingcap.com/tidb/dev/transaction-overview

https://proxysql.com/

Author:

Hari Kishan (Senior Engineering Manager @ Delhivery)

Akash Deep Verma (Director of Technology @ Delhivery)