Database performance is serious business, but why not have a little fun exploring its challenges and complexities? 😉 Here’s a rather fanciful story we presented in Chapter 1 of  Database Performance at Scale, a free Open Access book.

The technical topics covered here are expanded on throughout the book. But this is the one and only time we talk about poor Patrick. Let his struggles bring you some valuable lessons, solace in your own database performance predicaments… and maybe a few chuckles as well.

***

After losing his job at a FAANG MAANG (MANGA?) company, Patrick decided to strike off on his own and founded a niche online store dedicated to trading his absolute favorite among headwear, green fedoras. Noticing that a certain NoSQL database was recently trending on the front page of Hacker News, Patrick picked it for his backend stack.

After some experimentation with the offering’s free tier, Patrick decided to sign a one-year contract with a major cloud provider to get a significant discount on its NoSQL database-as-a-service offering. With provisioned throughput capable of serving up to 1,000 customers every second, the technology stack was ready and the store opened its virtual doors to the customers. To Patrick’s disappointment, fewer than ten customers visited the site daily. At the same time, the shiny new database cluster kept running, fueled by a steady influx of money from his credit card and waiting for its potential to be harnessed.

Patrick’s Diary of Lessons Learned, Part I

The lessons started right away:

The First Spike

March 17th seemed like an extremely lucky day. Patrick was pleased to notice lots of new orders starting from the early morning. But as the number of active customers skyrocketed around noon, Patrick’s mood started to deteriorate. This was strictly correlated with the rate of calls he received from angry customers reporting their inability to proceed with their orders.

After a short brainstorming session with himself and a web search engine, Patrick realized, to his dismay, that he lacked any observability tools on his precious (and quite expensive) database cluster. Shortly after frantically setting up Grafana and browsing the metrics, Patrick saw that although the number of incoming requests kept growing, their success rate was capped at a certain level, way below today’s expected traffic.

“Provisioned throughput strikes again,” Patrick groaned to himself, while scrolling through thousands of “throughput exceeded” error messages that started appearing around 11am.

Patrick’s Diary of Lessons Learned, Part II

This is what Patrick learned:

The First Loss

Patrick didn’t even manage to recover from the trauma of losing most of his potential income on the only day throughout the year during which green fedoras experienced any kind of demand, when the letter came. It included an angry rant from a would-be customer, who successfully proceeded with his order and paid for it (with a receipt from the payment processing operator as proof), but is now unable to either see any details of his order—and he’s still waiting for the delivery!

Without further ado, Patrick browsed the database. To his astonishment, he didn’t find any trace of the order either. For completeness, Patrick also put his wishful thinking into practice by browsing the backup snapshot directory. It remained empty, as one of Patrick’s initial executive decisions was to save time and money by not scheduling any periodic backup procedures.

How did data loss happen to him, of all people? After studying the consistency model of his database of choice, Patrick realized that there’s consensus to make between consistency guarantees, performance, and availability. By configuring the queries, one can either demand linearizabilityFootnote7 at the cost of decreased throughput, or reduce the consistency guarantees and increase performance accordingly. Higher throughput capabilities were a no-brainer for Patrick a few days ago, but ultimately customer data landed on a single server without any replicas distributed in the system. Once this server failed—which happens to hardware surprisingly often, especially at large scale—the data was gone.

Patrick’s Diary of Lessons Learned, Part III

Further lessons include:

The Spike Strikes Again

Months went by and Patrick’s sleeping schedule was even beginning to show signs of stabilization. With regular backups, a redesigned consistency model, and a reminder set in his calendar for March 16th to scale up the cluster to manage elevated traffic, he felt moderately safe.

If only he knew that a ten-second video of a cat dressed as a leprechaun had just gone viral in Malaysia… which, taking time zone into account, happened around 2am Patrick’s time, ruining the aforementioned sleep stabilization efforts.

On the one hand, the observability suite did its job and set off a warning early, allowing for a rapid response. On the other hand, even though Patrick reacted on time, databases are seldom able to scale instantaneously, and his system of choice was no exception in that regard. The spike in concurrency was very high and concentrated, as thousands of Malaysian teenagers rushed to bulk-buy green hats in pursuit of ever-changing Internet trends. Patrick was able to observe a real-life instantiation of Little’s Law, which he vaguely remembered from his days at the university. With a beautifully concise formula, L = λW, the law can be simplified to the fact that concurrency equals throughput times latency.

TIP: For those having trouble with remembering the formula, think units. Concurrency is just a number, latency can be measured in seconds, while throughput is usually expressed in 1/s. Then, it stands to reason that in order for units to match, concurrency should be obtained by multiplying latency (seconds) by throughput (1/s). You’re welcome!

Throughput depends on the hardware and naturally has its limits (e.g., you can’t expect a NVMe drive purchased in 2023 to serve the data for you in terabytes per second, although we are crossing our fingers for this assumption to be invalidated in the near future!) Once the limit is hit, you can treat it as constant in the formula. It’s then clear that as concurrency rises, so does latency. For the end-users—Malaysian teenagers in this scenario—it means that the latency is eventually going to cross the magic barrier for the average human perception of a few seconds. Once that happens, users get too frustrated and simply give up on trying altogether, assuming that the system is broken beyond repair. It’s easy to find online articles quoting that “Amazon found that 100ms of latency costs them 1 percent in sales”; although it sounds overly simplified, it is also true enough.

Patrick’s Diary of Lessons Learned, Part IV

The lessons continue:

Backup Strikes Back

After redesigning his project yet again to take expected and unexpected concurrency fluctuations into account, Patrick happily waited for his fedora business to finally become ramen profitable.

Unfortunately, the next March 17th didn’t go as smoothly as expected either. Patrick spent most of the day enjoying steady Grafana dashboards, which kept assuring him that the traffic was under control and capable of handling the load of customers, with a healthy safe margin. But then the dashboards stopped, kindly mentioning that the disks became severely overutilized. This seemed completely out of place given the observed concurrency. While looking for the possible source of this anomaly, Patrick noticed, to his horror, that the scheduled backup procedure coincided with the annual peak load…

Patrick’s Diary of Lessons Learned, Part V

Concluding thoughts:


The end.


About Piotr Sarna

Piotr is a software engineer who is keen on open-source projects and the Rust and C++ languages. He previously developed an open-source distributed file system and had a brief adventure with the Linux kernel during an apprenticeship at Samsung Electronics. He's also a long-time contributor and maintainer of ScyllaDB, as well as libSQL. Piotr graduated from University of Warsaw with an MSc in Computer Science. He is a co-author of the books "Database Performance at Scale" and "Writing for Developers: Blogs that Get Read."