sia.hackernoon.com

We design and benchmark a cross‑platform echo & chat server that scales from laptops to low‑latency Linux boxes. Starting with a Boost.Asio baseline, we add UDP and finally an io_uring implementation that closes the gap with DPDK‑style kernel‑bypass—all while preserving a single, readable codebase.

Full code is available here: https://github.com/hariharanragothaman/nimbus-echo

Motivation

Real‑time collaboration tools, multiplayer games, and HFT gateways all live or die by tail latency. Traditional blocking sockets waste cycles on context switches; bespoke bypass stacks (XDP, DPDK) achieve greatness at the cost of portability.

NimbusNet shows you can split the difference:

Run anywhere with Boost.Asio (macOS, Windows, CI containers).
Drop latency ~2× with UDP by eliminating TCP’s ordering tax.
Unlock sub‑25 µs RTT on Linux via io_uring—no kernel patches, no CAP_NET_RAW.

Build Environment:

Host	Toolchain	Runtime Variant(s)
macOS 14.5 (M2 Pro)	Apple clang 15, Homebrew Boost 1.85	Boost.Asio / TCP & UDP
Ubuntu 24.04 (x86‑64)	GCC 13, `liburing 2.6`	Boost.Asio / TCP & UDP, `io_uring` / TCP
GitHub Actions	macos‑14, ubuntu‑24.04	CI build + tests

Phase 1 – Establishing the Baseline (Boost.Asio, TCP)

We begin with a minimal asynchronous echo service that compiles natively on macOS.

Boost.Asio’s Proactor‑styleasync_read_some/async_writegives us a platform‑agnostic way to experiment before introducing kernel‑bypass techniques.

#include <boost/asio.hpp>
#include <array>
#include <iostream>

using boost::asio::ip::tcp;

class EchoSession : public std::enable_shared_from_this<EchoSession> {
    tcp::socket socket_;
    std::array<char, 4096> buf_{};

public:
    explicit EchoSession(tcp::socket s) : socket_(std::move(s)) {}
    void start() { read(); }

private:
    void read() {
        auto self = shared_from_this();
        socket_.async_read_some(boost::asio::buffer(buf_),
                                [this, self](auto ec, std::size_t n) { if (!ec) write(n); });
    }
    void write(std::size_t n) {
        auto self = shared_from_this();
        boost::asio::async_write(socket_, boost::asio::buffer(buf_, n),
                                 [this, self](auto ec, std::size_t) { if (!ec) read(); });
    }
};

int main() {
    boost::asio::io_context io;
    tcp::acceptor acc(io, {tcp::v4(), 9000});

    std::function<void()> do_accept = [&]() {
        acc.async_accept([&](auto ec, tcp::socket s) {
            if (!ec) std::make_shared<EchoSession>(std::move(s))->start();
            do_accept();
        });
    };
    do_accept();

    std::cout << "⚡  NimbusNet echo listening on 0.0.0.0:9000\n";
    io.run();
}

2 – UDP vs. TCP: When Reliability Becomes a Tax

TCP’s 3‑way handshake, retransmit queues, and head‑of‑line blocking are lifesavers for file transfers—and millstones for chats that can drop an occasional emoji. TCP bakes in ordering, re‑transmission, and congestion avoidance; these guarantees come at the cost of extra context switches and kernel bookkeeping. Swapping to udp::socket: For chat or market‑data fan‑out, “best‑effort but immediate” sometimes wins.

#include <boost/asio.hpp>
#include <array>
#include <iostream>

using boost::asio::ip::udp;

class UdpEchoServer {
    udp::socket socket_;
    std::array<char, 4096> buf_{};
    udp::endpoint remote_;
public:
    explicit UdpEchoServer(boost::asio::io_context& io, unsigned short port)
            : socket_(io, udp::endpoint{udp::v4(), port}) { receive(); }

private:
    void receive() {
        socket_.async_receive_from(
                boost::asio::buffer(buf_), remote_,
                [this](auto ec, std::size_t n) {
                    if (!ec) send(n);
                });
    }
    void send(std::size_t n) {
        socket_.async_send_to(
                boost::asio::buffer(buf_, n), remote_,
                [this](auto /*ec*/, std::size_t /*n*/) { receive(); });
    }
};

int main() {
    try {
        boost::asio::io_context io;
        UdpEchoServer srv(io, 9001);
        std::cout << "⚡  UDP echo on 0.0.0.0:9001\n";
        io.run();
    } catch (const std::exception& ex) {
        std::cerr << ex.what() << '\n';
        return 1;
    }
}

Latency table (localhost, 64‑byte payload):

Layer	TCP	UDP
Conn setup	3‑way handshake	0
HOL blocking	Yes	No
Kernel buffer	per‑socket	shared
RTT (median)	≈ 85 µs	≈ 45 µs

Here we replaced tcp::socket with udp::socket and removed the per‑session heap allocation; the code path is ~40 % shorter in perf traces.

If your application can tolerate an occasional drop (or do its own acks), UDP is the gateway to sub‑50 µs median latencies—even before kernel‑bypass. If you can tolerate packet loss (or roll your own ACK/NACK), UDP buys you ~40 µs on the spot.

Takeaway: if you can tolerate packet loss (or roll your own ACK/NACK), UDP buys you ~40 µs on the spot.

3 – `io_uring`: The Lowest‑Friction Doorway to Zero‑Copy

Linux 5.1 introduced io_uring; by 5.19 it rivals DPDK‑style bypass while staying in‑kernel.

Avoids per‑syscall overhead by batching accept/recv/send in a single submission queue.
Reuses a pre‑allocated ConnData buffer—no heap churn on the fast path.
Achieves ~20 µs RTT on Apple M2‑>QEMU→Ubuntu, a 3× improvement over Boost.Asio/TCP (~85 µs).

// Extremely small io_uring TCP echo server (edge‑triggered)
#include <liburing.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <unistd.h>
#include <cstring>
#include <iostream>

// ---------------------------------------------------------------------------
// Compat shim for old liburing (< 2.2) — Ubuntu 24.04 ships 2.0
// ---------------------------------------------------------------------------
#ifndef io_uring_cqe_get_res
/* If the helper isn't defined, just read the struct field directly */
#define io_uring_cqe_get_res(cqe) ((cqe)->res)
#endif


constexpr uint16_t PORT = 9002;
constexpr unsigned QUEUE_DEPTH = 256;
constexpr unsigned BUF_SZ = 4096;

struct ConnData {
    int fd;
    char buf[BUF_SZ];
};

int main() {
    // 1. Classic BSD socket setup
    int listen_fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
    sockaddr_in addr{}; addr.sin_family = AF_INET; addr.sin_port = htons(PORT);
    addr.sin_addr.s_addr = INADDR_ANY;
    bind(listen_fd, reinterpret_cast<sockaddr*>(&addr), sizeof(addr));
    listen(listen_fd, SOMAXCONN);

    // 2. uring setup
    io_uring ring{};
    io_uring_queue_init(QUEUE_DEPTH, &ring, 0);

    // helper lambda: submit an accept sqe
    auto prep_accept = [&]() {
        io_uring_sqe* sqe = io_uring_get_sqe(&ring);
        sockaddr_in* client = new sockaddr_in;
        socklen_t* len      = new socklen_t(sizeof(sockaddr_in));
        io_uring_prep_accept(sqe, listen_fd,
                             reinterpret_cast<sockaddr*>(client), len, SOCK_NONBLOCK);
        io_uring_sqe_set_data(sqe, client);   // stash ptr so we can free later
    };
    prep_accept();
    io_uring_submit(&ring);

    std::cout << "⚡  io_uring TCP echo on 0.0.0.0:" << PORT << '\n';

    // 3. Main completion loop
    while (true) {
        io_uring_cqe* cqe;
        int ret = io_uring_wait_cqe(&ring, &cqe);
        if (ret < 0) { perror("wait_cqe"); break; }

        void* data = io_uring_cqe_get_data(cqe);
        unsigned op = io_uring_cqe_get_res(cqe);

        // Accept completed → op = client_fd
        if (data && data != nullptr && op >= 0 && op < 0xFFFF) {
            int client_fd = op;
            delete static_cast<sockaddr_in*>(data); // free sockaddr
            io_uring_cqe_seen(&ring, cqe);

            // schedule next accept right away
            prep_accept();

            // schedule first read
            ConnData* cd = new ConnData{client_fd, {}};
            io_uring_sqe* r_sqe = io_uring_get_sqe(&ring);
            io_uring_prep_recv(r_sqe, client_fd, cd->buf, BUF_SZ, 0);
            io_uring_sqe_set_data(r_sqe, cd);
            io_uring_submit(&ring);
            continue;
        }

        // Read completed → if >0 bytes, write them back
        ConnData* cd = static_cast<ConnData*>(data);
        if (op > 0) {
            io_uring_sqe* w_sqe = io_uring_get_sqe(&ring);
            io_uring_prep_send(w_sqe, cd->fd, cd->buf, op, 0);
            io_uring_sqe_set_data(w_sqe, cd);  // reuse struct
            io_uring_submit(&ring);
        } else { // client closed
            close(cd->fd);
            delete cd;
        }
        io_uring_cqe_seen(&ring, cqe);
    }
    close(listen_fd);
    io_uring_queue_exit(&ring);
    return 0;
}

Even without privileged NIC drivers, io_uring brings sub‑50 µs latency into laptop‑class hardware—ideal for prototyping HFT engines before deploying on SO_REUSEPORT + XDP in production.

4 – Running Benchmarks: Quantifying the wins

We wrap each variant into Google Benchmarks

#include <benchmark/benchmark.h>
#include <boost/asio.hpp>
#include <thread>
#include <array>

using boost::asio::ip::tcp;
using boost::asio::ip::udp;

/* ---------- Helpers ------------------------------------------------------ */

// blocking Boost.Asio TCP echo client (loop‑back)
static void tcp_roundtrip(size_t payload) {
    boost::asio::io_context io;
    tcp::socket c(io);
    c.connect({boost::asio::ip::make_address("127.0.0.1"), 9000});
    std::string msg(payload, 'x');
    c.write_some(boost::asio::buffer(msg));
    std::array<char, 8192> buf{};
    c.read_some(boost::asio::buffer(buf, payload));
}

// blocking Boost.Asio UDP echo client
static void udp_roundtrip(size_t payload) {
    boost::asio::io_context io;
    udp::socket s(io, udp::v4());
    udp::endpoint server(boost::asio::ip::make_address("127.0.0.1"), 9001);
    std::string msg(payload, 'x');
    s.send_to(boost::asio::buffer(msg), server);
    std::array<char, 8192> buf{};
    s.receive_from(boost::asio::buffer(buf, payload), server);
}

#if defined(__linux__)
// tiny wrapper for the io_uring server (assumes it’s already running on 9002)
static void uring_tcp_roundtrip(size_t payload) {
    boost::asio::io_context io;
    tcp::socket c(io);
    c.connect({boost::asio::ip::make_address("127.0.0.1"), 9002});
    std::string msg(payload, 'x');
    c.write_some(boost::asio::buffer(msg));
    std::array<char, 8192> buf{};
    c.read_some(boost::asio::buffer(buf, payload));
}
#endif

/* ---------- Benchmarks --------------------------------------------------- */

static void BM_AsioTCP_64B(benchmark::State& s) {
    for (auto _ : s) tcp_roundtrip(64);
}
BENCHMARK(BM_AsioTCP_64B)->Unit(benchmark::kMicrosecond);

static void BM_AsioUDP_64B(benchmark::State& s) {
    for (auto _ : s) udp_roundtrip(64);
}
BENCHMARK(BM_AsioUDP_64B)->Unit(benchmark::kMicrosecond);

#if defined(__linux__)
static void BM_IouringTCP_64B(benchmark::State& s) {
    for (auto _ : s) uring_tcp_roundtrip(64);
}
BENCHMARK(BM_IouringTCP_64B)->Unit(benchmark::kMicrosecond);
#endif

BENCHMARK_MAIN();

With Google Benchmark we measured 10 K in‑process round trips per transport on an M2‑Pro MBP (macOS 14.5, Docker Desktop 4.30):

Table 1 – Median RTT (64 B payload, 10 K iterations)

Transport	Median RTT (µs)
Boost.Asio / TCP	82
Boost.Asio / UDP	38
`io_uring` / TCP	21

Even on consumer hardware, io_uring halves UDP’s latency and crushes traditional TCP by nearly 4×. This validates the architectural decision to build NimbusNet’s high‑fan‑out chat tier on kernel‑bypass primitives while retaining a pure‑userspace codebase.

Takeaways & Future Work

Portability first, performance second pays dividends—macOS dev loop, prod Linux wins.
UDP is “good enough” for most chats; sprinkle FEC / acks for mission‑critical flows.
io_uring slashes latency without root privileges, making kernel‑bypass approachable.

Next steps

SO_REUSEPORT + sharded accept rings → horizontal scale on 64‑core EPYC Processor
TLS off‑loading via kTLS with io_uring::splice.
eBPF tracing to pinpoint queue depth vs. tail latency.

NimbusNet: Building a High‑Performance Echo & Chat Server Across Boost.Asio and Io_uring

Motivation

2 – UDP vs. TCP: When Reliability Becomes a Tax

3 – io_uring: The Lowest‑Friction Doorway to Zero‑Copy

4 – Running Benchmarks: Quantifying the wins

Takeaways & Future Work

Next steps

3 – `io_uring`: The Lowest‑Friction Doorway to Zero‑Copy