Hi, everyone! Today I'm going to discuss engineering of a scalable price pipeline. Every trading system consists of two major subsystems:

Both of them are equally important. Prices are needed before orders can be placed. After execution, actual data is needed to calculate risk metrics. In this article, I will focus on the price distribution subsystem.

High Level Architecture

Let's first decide which functionality is expected from the price distribution subsystem:

Historical Data Storage

To store data in general we have two options:

The decision on storage must be made based on queries which will be executed against the data. If we need strong transactional guarantees and frequent updates row-based storage is better. This type of storage is good for querying by ID. The whole row is stored in a sequential way so it is efficient to retrieve it if you need all columns.

But for analytical purposes we will not fetch data by ID. Instead, we will query by time range. Our data will also be multidimensional - for each bar we will have at least open, high, low, close, volume fields. For different types of analyses we might need only some of these fields. For example, to calculate moving average we need only close prices. Columnar DB allows reading the whole column efficiently, because all data is stored in one file. Such disk operations are very fast.

There are multiple open-source columnar databases available. Here are some:

Overview

Here are key components which our system will consist of:

It's very important to segregate query functionality from streaming service for several reasons:

At a high level, the architecture looks like:

Dual-buffer Design

Streaming service is a core of the system. Let's dive deeper into its design. Main problem is that we have many-to-many relationship between data producers and consumers.

Service must be designed in such a way that all these relationships are handled efficiently and don't impact each other. The most straightforward way is to use standard queues. But they have several drawbacks:

A better approach is to use dual ring buffer design. Each producer and consumer has their own buffers. At any point of time one buffer is being filled, while the other is being drained. This design has below advantages:

Here is a diagram illustrating such design:

Tick Aggregation

Key piece of the streaming service is tick aggregation. Raw ticks are coming from data sources. For each tick, we need:

One problem is how to handle end of period.

Timer will publish events to instrument buffers. Each aggregator will listen to these events and flush current bar to output buffers if timeframe ended.

Implementation

I implemented this architecture, code is available in repository Price Server.

Key components are:

Modules

The solution is implemented to support modules:

Implementations for Binance and Clickhouse are in modules/ folder. New ones could be added without changing core logic by implementing appropriate interfaces.

Starting the server

All runnable components are packed into Docker images and could be easily started:

docker compose up --build

This command will start:

Then just open http://localhost in a browser.

Conclusion

In this article, we explored the architecture of a scalable price aggregation pipeline. Key takeaways:

Price distribution is a foundational component of any trading system. The patterns discussed here—separation of concerns, dual-buffer design, and modular architecture—provide a foundation for building market data infrastructure.

Source code: https://github.com/zharkomi/price_server