Migrating from one OLAP database to another is huge. Even if you're unhappy with your current data tool and have found some promising candidates, you might still hesitate to do the big surgery on your data architecture because you're uncertain about how things are going to work. So you need experience shared by someone who has walked the path.

Luckily, a user of Apache Doris has written down their migration process from ClickHouse to Doris, including why they need the change, what needs to be taken care of, and how they compare the performance of the two databases in their environment.

To decide whether you want to continue reading, check if you tick one of the following boxes:

If you do, this post might be of some help to you.

Replacing Kylin, ClickHouse, and Druid with Apache Doris

The user undergoing this change is an e-commerce SaaS provider. Its data system serves real-time and offline reporting, customer segmentation, and log analysis. Initially, they used different OLAP engines for these various purposes:

The three components have their own sore spots.

As they work together, this architecture might be too demanding to navigate because it requires knowledge of all these components in terms of development, monitoring, and maintenance. Also, every time the user scales a cluster, they must stop the current cluster and migrate all databases and tables, which is not only a big undertaking but also a huge interruption to business.

Apache Doris fills these gaps.

So they plan on the migration.

The Replacement Surgery

ClickHouse was the main performance bottleneck in the old data architecture, and why the user wanted the change in the first place, so they started with the ClickHouse.

Changes in SQL statements

Table creation statements

The user-built their own SQL rewriting tool that can convert a ClickHouse table creation statement into a Doris table creation statement. The tool can automate the following changes:

Query statements

Similarly, they have their own tool to transform the ClickHouse query statements into Doris query statements. This is to prepare for the comparison test between ClickHouse and Doris. The key considerations in the conversions include:

Changes in data ingestion methods

Apache Doris provides broad options for data writing methods. For the real-time link, the user adopts Stream Load to ingest data from NSQ and Kafka.

For the sizable offline data, the user tested different methods, and here are the takeouts:

  1. Insert Into

Using Multi-Catalog to read external data sources and ingesting with Insert Into can serve most needs in this use case.

  1. Stream Load

The Spark-Doris-Connector is a more general method. It can handle large data volumes and ensure writing stability. The key is to find the right writing pace and parallelism.

The Spark-Doris-Connector also supports Bitmap. It allows you to move the computation workload of Bitmap data in Spark clusters.

Both the Spark-Doris-Connector and the Flink-Doris-Connector rely on Stream Load. CSV is the recommended format choice. Tests on the user's billions of rows showed that CSV was 40% faster than JSON.

  1. Spark Load

The Spark Load method utilizes Spark resources for data shuffling and ranking. The computation results are put in HDFS, and then Doris reads the files from HDFS directly (via Broker Load). This approach is ideal for huge data ingestion. The more data there is, the faster and more resource-efficient the ingestion is.

Pressure Test

The user compared the performance of the two components on their SQL and join query scenarios, and calculated the CPU and memory consumption of Apache Doris.

SQL query performance

Apache Doris outperformed ClickHouse in 10 of the 16 SQL queries, and the biggest performance gap was a ratio of almost 30. Overall, Apache Doris was 2~3 times faster than ClickHouse.

Join query performance

For join query tests, the user used different sizes of main tables and dimension tables.

The tests include full join queries, and filtering join queries. Full join queries join all rows of the primary table and dimension tables while filtering join queries retrieve data of a certain seller ID with a WHERE filter. The results are concluded as follows:

Primary table (4 billion rows):

Primary table (25 billion rows):

Primary table (96 billion rows):

Doris delivered relatively quick performance in all queries, and ClickHouse was unable to execute all of them.

In terms of CPU and memory consumption, Apache Doris maintained stable cluster loads in all sizes of join queries.

Future Directions

As the migration goes on, the user works closely with the Doris community, and their feedback has contributed to the making of Apache Doris 2.0.0. We will continue assisting them in their migration from Kylin and Druid to Doris, and we look forward to seeing their Doris-based unified data platform come into being.


Also published here.