sia.hackernoon.com

Why does the field of data engineering even exist? It started with a problem, one that plays out every year on Black Friday.

It’s the single most important day of the year for a major e-commerce website. The company’s production database, likely a PostgreSQL or MySQL system, is humming along, doing exactly what it was designed for: handling thousands of small, fast transactions every second. This work, known as Online Transaction Processing (OLTP), includes essential actions like:

🛒 Add to cart

📝 Update inventory

💵 Process payment

This workload requires high-speed data writes and precise, row-level database locking to ensure two people don’t buy the last product in inventory at the same time.

Then comes the call 📞. It's midday, and the CEO demands a real-time report on "Total Revenue by Region" to inform a critical marketing decision. A Data Analyst, tasked with the request, connects directly to the live production database and runs a massive query to sum up millions of historical sales records:

SELECT SUM(price) FROM orders GROUP BY region

What seems like a simple request triggers a catastrophe. The database CPU spikes to 100% as it tries to read the entire history of the table. Because the database has finite resources for Input/Output, Memory, and CPU, it can no longer process incoming checkout requests. Customer checkout pages freeze, and the entire site effectively goes down 😱. The business loses millions of dollars in a matter of minutes. This exact scenario reveals the fundamental conflict that the entire field of data engineering was born to solve.

Business Databases Are Sprinters, Not Marathon Runners

At the heart of the Black Friday crash is a fundamental mismatch between two different types of work.

“Operational databases”, the kind that run businesses, are optimized for one thing: speed on small tasks. They are “sprinters”.

These databases excel at “Index Seeks”—the ability to find one specific record almost instantly, like locating a single customer’s order out of millions. They are built to handle thousands of these quick, targeted operations every minute.

"Analytical queries", however, are marathon runners. To calculate a result like "Total Revenue by Region," the query must perform a "full table scan," meaning it has to read every single row in the database table. This single, long-running task can consume 100% of the hard drive's read/write bandwidth and spike the CPU 📈. When the marathon runner is on the track, there is no room left for the sprinters. The database becomes completely starved of the resources it needs to process customer checkouts, grinding the business to a halt. 🛑

A Single “Read” Can Freeze Your Entire Business

It seems counter-intuitive that a query designed only to read data could stop a business from writing new data. This is due to a critical database mechanism called locking.

The analyst’s query, however, did something far more drastic. To calculate an accurate report, it placed a table-level “Read Lock” on the entire orders table. This lock acts as a guarantee, ensuring that the data doesn’t change while it’s being counted. While this lock doesn’t prevent other users from reading the data, it critically prevents everyone from writing to it.

This is precisely what caused the Black Friday site to crash 💥. The analyst's report locked the table, blocking all incoming write transactions—including every customer trying to "process payment." As long as the report was running, no new sales could be completed. 🛑

So, do you see the problem here?

We need a way to separate these two workloads while ensuring data integrity and consistency. This is why we need Data Engineers.

Data Isn’t Stored in a Spreadsheet (And That Matters)

To understand why that analytical query was so inefficient, we have to challenge our mental model of how data is stored. We often visualize a database table as a 2D grid, but a physical hard drive is a “linear sequence” of bytes. Hard drives don’t read “rows” or “columns”; they read “blocks” of data, usually in 4KB or 8KB chunks.

The operational database in our story uses “Row-Oriented Storage.” This means that all the data for a single record is stored together in one continuous block. For a table of customer orders with ID, Item, and Sales columns, the physical data on the drive would look something like this:

[1, Apple, $100], [2, Banana, $50], [3, Cherry, $20]

To sum just the sales figures, the database is forced to read through all the irrelevant data (ID, Item) for every single record. Worse, to get the $100 sales figure, it might have to load an entire 4KB block of data off the disk into memory, even though it only needed a few bytes from that block. This is incredibly slow and wasteful.

This row-oriented structure is precisely what makes the database a sprinter—all the information for a single transaction (like a customer’s specific order) is physically grouped together, making individual record retrieval incredibly fast. But for our marathon-running analyst, it’s a disaster.

The Simple Fix: Turning the Table Sideways

The solution to this problem is elegant and transformative: store the data in “columns” instead of “rows”. This method is known as “Column-Oriented Storage.”

Instead of grouping all the information for a single order together, this approach groups all the values from a single column together. All the order IDs are in one block, all the item names are in another, and—most importantly—all the sales figures are in their own consolidated block.

This completely changes the game for analytics. When the CEO asks for total sales, the database can now ignore the ID and Item data entirely. It goes straight to the single, compressed block of Sales data and adds it up. The query becomes lightning fast and uses a fraction of the resources.

This is the core job of a Data Engineer. They set up an automated system to:

✅ Extract the data from the company’s row-oriented operational database every night

✅ Transform it into a column-oriented format

✅ Load it into a separate, specialized Data Warehouse (like Snowflake or BigQuery).

This creates a safe, optimized environment where analysts can run massive reports without any risk of crashing the store. It sounds simple, but trust me when I say that this is not a copy and paste problem.

A Necessary Separation

The Black Friday disaster wasn’t the fault of the CEO or the analyst; it was the result of a technical conflict between two essential but incompatible workloads. The incident reveals the critical need to separate the systems that run the business from the systems that analyze the business. Protecting daily operations while enabling powerful analytics is the foundational problem that the entire field of data engineering was created to solve.

So now we know why Data Engineers exist, but how were these complex data problems handled before this role was created?

We’ll dive into that in the next post.

The Black Friday Query That Invented Data Engineering

Business Databases Are Sprinters, Not Marathon Runners

A Single “Read” Can Freeze Your Entire Business

Data Isn’t Stored in a Spreadsheet (And That Matters)

The Simple Fix: Turning the Table Sideways

A Necessary Separation