This data warehousing use case is about scale. The user is China Unicom, one of the world's biggest telecommunication service providers. Using Apache Doris, they deploy multiple petabyte-scale clusters on dozens of machines to support their 15 billion daily log additions from their over 30 business lines. Such a gigantic log analysis system is part of their cybersecurity management. For the need for real-time monitoring, threat tracing, and alerting, they require a log analytic system that can automatically collect, store, analyze, and visualize logs and event records.

From an architectural perspective, the system should be able to undertake real-time analysis of various formats of logs and, of course, be scalable to support the huge and ever-enlarging data size. The rest of this post is about what their log processing architecture looks like and how they realize stable data ingestion, low-cost storage, and quick queries with it.

System Architecture

This is an overview of their data pipeline. The logs are collected in the data warehouse and go through several layers of processing.

Architecture 2.0 evolves from Architecture 1.0, which is supported by ClickHouse and Apache Hive. The transition arose from the user's needs for real-time data processing and multi-table join queries. In their experience with ClickHouse, they found inadequate support for concurrency and multi-table joins, manifested by frequent timeouts in dashboarding and OOM errors in distributed joins.

Now, let's take a look at their practice in data ingestion, storage, and queries with Architecture 2.0.

Real-Case Practice

Stable ingestion of 15 billion logs per day

In the user's case, their business churns out 15 billion logs every day. Ingesting such data volume quickly and stably is a real problem. With Apache Doris, the recommended way is to use the Flink-Doris-Connector. It was developed by the Apache Doris community for large-scale data writing. The component requires simple configuration. It implements Stream Load and can reach a writing speed of 200,000~300,000 logs per second without interrupting the data analytic workloads.

A lesson learned is that when using Flink for high-frequency writing, you need to find the right parameter configuration for your case to avoid data version accumulation. In this case, the user made the following optimizations:

These measures together ensure daily ingestion stability. The user has witnessed stable performance and low compaction scores in the Doris backend. In addition, the combination of data pre-processing in Flink and the Unique Key model in Doris can ensure quicker data updates.

Storage strategies to reduce costs by 50%

The size and generation rate of logs also impose pressure on storage. Among the immense log data, only a part of it is of high informational value, so storage should be differentiated. The user has three storage strategies to reduce costs.

With these three strategies, the user has reduced their storage costs by 50%.

Differentiated query strategies based on data size

Some logs must be immediately traced and located, such as those of abnormal events or failures. To ensure real-time response to these queries, the user has different query strategies for different data sizes:

These strategies have shortened the response time of queries. For example, a query of a specific data item used to take minutes, but now it can be finished in milliseconds. In addition, for big tables that contain 10 billion data records, queries on different dimensions can all be done in a few seconds.

Ongoing Plans

The user is now testing with the newly added inverted index in Apache Doris. It is designed to speed up full-text search of strings as well as equivalence and range queries of numerics and datetime. They have also provided valuable feedback about the auto-bucketing logic in Doris: Currently, Doris decides the number of buckets for a partition based on the data size of the previous partition. The problem for the user is most of their new data comes in during the daytime but little at night. So, in their case, Doris creates too many buckets for night data but too few in daylight, which is the opposite of what they need. They hope to add a new auto-bucketing logic, where the reference for Doris to decide the number of buckets is the data size and distribution of the previous day. They've come to the Apache Doris community, and we are now working on this optimization.


Also published here.