sia.hackernoon.com

As the leading fulfilment platform for digital commerce in India, Delhivery fulfils a million packages a day, 365 days a year. Its 24 automated sort centres, 101 hubs, 3,100+ direct delivery centres, 1000+ partner centres, 11,000+ fleet, and 60,000+ team members run smoothly thanks to a vast network of IoT devices. Thousands of data events and messages are coming in and going out of our pipelines each second. This amounts to a massive daily data volume in terabytes, which makes operational visibility crucial for us and our stakeholders.

Recognising the requirements, we decided to build data marts—centralised, eventually consistent databases that offer users quick access to pre-aggregated business data. This allows our stakeholders to quickly access business insights without searching through an entire data warehouse.

However, with this daunting scale, one of the major challenges was to maintain data integrity and low latency while providing the capacity for analytical workloads.

In this blog, I am going to unpack all of my leanings while migrating our data marts from Amazon Aurora to TiDB, a Hybrid Transactional/Analytical Processing (HTAP), distributed SQL database. Hopefully, this post can provide insights to data engineering leaders, database administrators, or data architects who are considering a similar migration to TiDB or any other HTAP database.

OLTP, OLAP, and HTAP

To better understand the real-time data marts case at Delhivery, let’s first get familiar with three concepts that are at the core of our use case: OLTP, OLAP & HTAP:

OLTP: Online Transaction Processing (OLTP) systems are designed for transaction-oriented applications, ensuring data integrity through ACID (Atomicity, Consistency, Isolation, Durability) properties.
OLAP: Online Analytical Processing (OLAP) systems enable high-speed, multidimensional analysis of large data volumes, aiding in data-driven decision-making.
HTAP: Hybrid Transaction/Analytical Processing (HTAP) combines OLTP and OLAP functionalities, permitting real-time analytics on transactional data.

Real-Time Data Marts Use Case at Delhivery

Real-time data marts differ from traditional data marts in that they ingest data in real time, not at specific intervals. These data marts are critical for ground operational decision-making at Delhivery because we can't afford any delay in synchronising these events.

Our real-time data mart journey began in 2020 when we identified a need for centralised dashboards—specifically the EYE dashboard. This dashboard's purpose was to provide real-time operational visibility to ground operations, enabling decision-making based on up-to-the-minute data. Examples of usages include:

Vehicle planning & visibility: Real-time monitoring of incoming and outgoing connection schedules for Delhivery hubs.
Performance tracking: Continuous performance tracking of Delhivery's facilities.
Centralised control visibility: Providing the central team with precise information on ground blockers to take appropriate actions. These could be due to various factors such as a dip in center performance, shipment aging, or congestion in incoming and outgoing connections.
Compliances: Tracking of put and pick compliance metrics

Initial implementation and the challenges

We thought of solving our use cases using data warehouse tools like Redshift, and Snowflake but none of these solutions worked for us considering the design pattern and requirement for real-time data ingestion along with merge.

Thus, we initially chose Aurora (PostgreSQL) to serve our data mart use case.

The data ingestion process around Aurora

We architected our real-time data marts using Spark Streaming and Aurora. Our steaming pipeline was very simple--reading data from Kafka, processing data in Spark micro batches, and performing upsert operations in Aurora.

Our database was modelled using a multi-layered architecture, which consists of a raw layer, a partitioned layer, and a data marts layer. Users did not have access to view or modify data in the raw layer. The partitioned layer is kept to maintain all partitioned tables (dimension tables generally). Below is a simple schema design of our database:

Challenges we faced with Aurora

The system initially performed well, until it had to handle a throughput beyond 3K messages per second. This marked the onset of several challenges:

Scalability limitation: As we exceeded a throughput of 3K messages per second, Aurora's Input/Output Operations Per Second (IOPS) limitations became a bottleneck. The scalability constraint had started impacting our operations.
Bloating issue: Every record update led to the creation of a new record and a dead tuple (previous version of the record). When the production rate of these dead tuples outpaced the cleanup process, bloating occurred. Since VACUUM FULL was not able to claim the storage, the disk usage continuously increased. For roughly 5 TB of data, Aurora was using 30+ TB of storage.
Maintenance burden: The bloating issue is directly tied to our maintenance challenges. With over 70 pipelines and a total write QPS surpassing 5k messages/second, we found PostgreSQL's auto cleanup process, Auto Vacuum, failed to keep pace with the rate of dead tuple generation. Therefore, manually running VACUUM or VACUUM FULL is required to recover the database. Our attempts with PostgreSQL tools like pg_repack and pgcompacttable also proved unsuccessful. Consequently, maintenance became increasingly complex and time-consuming.

Cost: To accommodate the read & write workload, we had to scale to the highest available nodes (24XLarge). This led to an expenditure of approximately $100,000 per month for a three-node Aurora cluster. With this scale, Aurora turned out to be expensive because of IOPS auto-scaling.

Searching for alternatives

To resolve Aurora’s limitations, we set out to find a better alternative that met the following requirements:

Scalable with high write QPS: The database should support at least 10k+ write QPS and is horizontally scalable.
Real-time analytics: The database should be able to provide high-speed or real-time OLAP capabilities
Fully distributed: The database should be distributed across multiple sites to provide high availability and fault tolerance.
Strong consistency: The database should maintain strong consistency, ensuring that all users see the same data.

Considering all the above requirements, we initially explored many PostgreSQL alternatives including Spanner and Yugabyte because we wanted to keep our change management minimal.

Spanner

Spanner is a distributed SQL database management and storage service offered by Google. It is fully managed on the Google Cloud Platform (GCP). However, we found that Spanner might not be a good use case for our architecture due to the following reasons:

Spanner does not support schemas.
We did not find the proper tools to load historical data. We explored Harbourbridge, an open-source tool for Spanner evaluation and migration. However, it had limitations around 100 GB of data loading.

Yugabyte

YugabyteDB is a high-performance transactional distributed SQL database for cloud-native applications, developed by Yugabyte. This database is very close to our use case because it was fully PostgreSQL compliant, horizontally scalable, and fully distributed. Unfortunately, it didn’t work as well because of its limitation with scalability, Our success criteria demanded 7k+ transactions per second but Yugabyte was only able to scale up to 5k.

We also looked into other possible candidates like BigQuery, but none of them served our requirements well.

Landing with TiDB

After the above PostgreSQL alternatives, we decided to add HTAP to our requirements, which led us to TiDB. It supports out-of-the-box scalability, consistency, availability, multi-site deployment topology, and many more features. As a distributed database, TiDB has multiple components that communicate with each other and form a complete TiDB system.

TiDB: It is the stateless SQL processing component that provides the client-facing endpoint to the user. It locates the correct TiKV node to connect from PD to get the data.
TiKV: It is a distributed transactional key-value data store that keeps the data in the left-closed-right-open range. Data is kept in shards with multiple replicas. TiKV uses the Raft protocol for replication.
PD: The placement driver (PD) keeps the metadata of the cluster such as shard replica locations, and it’s also responsible for scheduling the shards across TiKV nodes. PD leader handles such tasks while other nodes maintain high availability.
TiFlash: The columnar storage extension that uses the Multi-Raft Learner protocol to replicate data from TiKV in real-time, ensuring consistent data between the TiKV row-based storage engine.

The following features of TiDB addressed our key challenges and met our operational requirements:

Easy scaling

The TiDB architecture design separates computing from storage, letting you scale out or scale in the computing or storage capacity online as needed. The scaling process is transparent to application operations and maintenance staff.
ACID compliant

TiDB is MySQL-compliant and supports transactions out of the box. It supports both optimistic & pessimistic types of transactions. This makes it unique from other databases.
Highly available

TiKV stores data in multiple replicas and uses the Multi-Raft protocol to obtain the transaction log. A transaction can only be committed when the data has been successfully written into the majority of replicas. This guarantees strong consistency and high availability when a minority of replicas go down.
Real-time HTAP

TiDB combines both row storage (TiKV) and columnar storage (TiFlash) in the same architecture, forming a streamlined tech stack that makes it easier to produce real-time analytics on operational data.

Our TiDB Infrastructure

Our TiDB infrastructure is deployed over the VMs of leading cloud service providers. We use TiUP, TiDB’s package manager, to manage the cluster and all the administrative operations. Our cluster is deployed over 3 available zones (AZs).

Our cluster configurations are as follows:

PD: The PD layer has 3 nodes split across Multi-AZs. PD leader handles such tasks while other nodes maintain high availability.
TiDB: The TiDB layer has 9 nodes of the n2-highmem-8 family. These nodes were chosen based on the memory requirements, with 64 GB RAM & 8 Core CPUs allocated for each TiDB node.
TiKV: The TiKV layer has 15 nodes of the n2-highmem-16 family which has 128 GB RAM & 16 vCORE CPUs.

By deploying our TiDB cluster across multiple AZs and carefully selecting node types to meet our processing and memory needs, we've created a robust, highly available infrastructure capable of handling our high data throughput requirements.

Tuning TiDB for Our Case

To make it work for our use case, we worked closely with the PingCAP team to tune the database. Here are some of the critical adjustments we made:

Index Optimisation

Set the following parameters before starting the index.

SET @@global.tidb_ddl_reorg_worker_cnt = 16;
SET @@global.tidb_ddl_reorg_batch_size = 4096;

Reset to default values after index creation.

SET @@global.tidb_ddl_reorg_worker_cnt = 4;
SET @@global.tidb_ddl_reorg_batch_size = 256;

Partition Pruning

This is mainly important for partitioned tables. It analyses the filter conditions in query statements and eliminates (prunes) partitions when they do not contain any required data.

SET @@session.tidb_partition_prune_mode = 'dynamic';

Tuning analyse

Sometimes the auto analyser in TiDB fails if a high volume of data is ingested. In that case, all the queries might use the wrong execution plan and end up scanning the full table. To avoid such a situation we made the following changes in TiDB configurations:

set global tidb_max_auto_analyze_time = 86400;
set global tidb_enable_pseudo_for_outdated_stats = off;
set global tidb_sysproc_scan_concurrency = 15;

If you are working with partitioned tables, we suggest you run analyze table operations manually for one partition at a time to avoid analyzing failures.

Through adjustments like these, we were able to effectively streamline our use of TiDB, so that we can achieve an optimal performance for our real-time data mart.

Our Experience With TiDB

Improved queries performance

We have bench-marked 400+ queries and found that all the queries are running within SLA. We have even seen a 15-20% performance gain of P95 queries.
Easy migration

We used the TiDB Lighting tool to migrate all of our table's historical data from Postgres to TiDB. This tool is very easy to use and very fast. We were able to load terabytes of data within roughly 2-3 hours. However, it’s worth noting that there is a lot of tuning required before loading such huge data.
Strong support

We went through a couple of hiccups during the production infrastructure setup but the PingCAP support team played a very crucial role and helped us tune the cluster for the nature of the workload.

Conclusion

In this post, we have explored the challenges of using Aurora with our use case of real-time data marts and the migration journey to TiDB. We also discussed how Delhivery is using TiDB at scale.

Despite our success with TiDB, we acknowledge that no solution is perfect, and effectiveness can vary depending on the use case. In TiDB, we noted a couple of areas for improvement, including the lack of out-of-the-box support for materialized views and native quota management. However, with appropriate workarounds and adjustments, we have managed to address these limitations effectively.

So far, we have deployed TiDB in our production environment. Based on our benchmarks, TiDB enables us to handle over thousands of requests per second with less than 100ms latency. Moving forward, we will continue to explore more use cases that require a robust, consistently distributed database.

References

https://docs.pingcap.com/tidb/stable/tidb-lightning-overview

https://reorg.github.io/pg_repack/

https://github.com/dataegret/pgcompacttable

https://cloud.google.com/spanner

https://www.yugabyte.com/yugabytedb/

https://cloud.google.com/bigquery/

https://docs.pingcap.com/tidb/dev/transaction-overview

https://proxysql.com/

Author:

Hari Kishan (Senior Engineering Manager @ Delhivery)

Akash Deep Verma (Director of Technology @ Delhivery)

Delhivery’s Data Marts - Migration Journey from OLTP to HTAP