sia.hackernoon.com

I first heard of Apache Arrow in late 2021, when I was researching query engines for a blog, and was told to look into it. There was nearly zero information floating around at the time, and it was impossible to get.

A couple of months later, Voltron Data announced its existence and funding, and said it would monetize Arrow. I didn’t see how that would work, and Voltron now seems to be shutting its doors, but it did lead to some brilliant and talented people getting out there and spreading the word, explaining Arrow. This also led to some interesting side projects, namely ADBC, which is relevant when we ask the question, What the heck is dbc? But first…

What are Arrow and ADBC?

Apache Arrow is a cross-language in-memory analytics platform that provides a standardized columnar memory format. It eliminates the need for serialization and deserialization when moving data between different systems and programming languages, enabling zero-copy reads and efficient data sharing. Arrow's language-agnostic format enables interoperability with languages such as Python, R, Java, C++, and JavaScript. The format is optimized for modern CPU and GPU architectures, allowing for vectorized operations and improved analytical query performance.

Arrow Database Connectivity (ADBC) is a subproject that provides a vendor-neutral API standard for database access using Arrow's columnar format. Rather than converting database results into row-based formats like traditional APIs (ODBC, JDBC), ADBC fetches data directly in Arrow format, reducing overhead and improving performance for analytical workloads. The API provides a consistent interface across different database systems, with drivers available for PostgreSQL, SQLite, Snowflake, and Flight SQL servers. By maintaining data in Arrow format throughout the pipeline, ADBC enables more efficient integration between databases and Arrow-based analytical tools, reducing the need for memory copies and data transformations that typically slow down data analysis workflows.

What is dbc?

Well, first understand that the main Arrow devs and evangelists at VoltronData started their own company called Columnar. They are building on Apache Arrow and ADBC to continue providing updates that improve speed, simplicity, and security, and to continue being the standard bearers of the projects. What’s their revenue model? No idea 🙂, but dbc is part of the “simplicity” directive that is in their mandate, so let’s take a look.

The point of dbc is to make ADBC easy to install and therefore, use. When we go to columnar.tech/dbc/ we see this page:

What you see is a nice set of options to pick an operating system, language, and system to connect to. I’m going to select DuckDB now, and show what it presents:

I tried it in WSL2 (Ubuntu) and just followed the instructions, and we see:

Ignore my insecure setup, and just understand how simple this was to do. I’ll show a screenshot of dbc for a golang install next:

All you have to do is copy/paste these instructions for any combination that you’ve selected. This is highly convenient and really reduces the friction and frustration people can experience when adopting new technologies. The GitHub page for dbc is here, where you can explore further.

dbc Commands

Once installed, dbc has a variety of options to let you fully manage, and get information about drivers, the dbc –help command presents the following:

If we just issue a dbc search followed by a dbc search sql you can see the flexibility of the interface. Given the relatively small number of drivers, it may not seem like a big deal, but it shows that they are planning ahead to futureproof it all. The –verbose option is nice. As you can see at the bottom, I searched for duck since I know I have it installed. I get a list of available versions, the version I have installed, and its location. The thoroughness here is just a nice touch, in my opinion.

Following on to that, I wanted to get some information about the duckdb driver, so we issued dbc info duckdb, and there we have it. Knowing all the available platforms and architectures available is again a nice touch for completeness.

Summary

Apache Arrow and ADBC are exciting architectures. Columnar formats like Parquet are particularly popular in the data lake space. Having a format and drivers that allow you to read and process the data while maintaining it in that columnar format, rather than switching back and forth through other drivers like JDBC/ODBC, is an incredible performance boost. What Columnar has done here with dbc is lower the barrier to entry to getting at those formats yourself and remove the fear of the unknown. I know for myself, when I first started looking at these technologies in 2021, it was daunting, it was new, I didn’t “get it”, and it took a mental shift to get there. The docs that Columnar is writing are excellent; one of their founders, Matt Topol, literally wrote the book on Arrow.

dbc is free, it’s open source, and it’s easy to use. If you are dealing with columnar data or want to get involved with it, this is definitely the place to start.

Want to read more in my “What the Heck is???” series? A handy list is below:

What the Heck is dbc?

What are Arrow and ADBC?

What is dbc?

dbc Commands

Summary