sia.hackernoon.com

It started as a simple problem: too many requests, not enough capacity. By March 2023, I stopped blaming the database and started asking different questions. The answer turned out to be distributed caches and some hashing math, but getting there burned us twice.

Growing Pressure

For years, our Hive metastore backed by Cassandra had been just good enough.

By early 2021, I started seeing about 1 M QPS at the API gateway and 2 M QPS hitting Cassandra(JanusGraph layer does some funky calls to the backend). Our Cassandra ran at 45% CPU, so I ignored the warning. Six months later, the load had tripled. Queries began timing out during the early morning rush when all daily jobs rushed to get the dataset's latest partition ready. I added nodes, bought a few months of stability, and kept moving.

Before Thanksgiving break, I observed an audit run, Cassandra hit 98% CPU and stopped responding for ~45 seconds. Most jobs timed out waiting. That's when I knew vertical scaling wasn't a solution anymore.

This pattern repeated through 2022: traffic doubled, I scaled hardware, costs climbed, and incidents became routine. Each time I thought I might have solved the scalability issue, then, out of the blue, during the holiday season, a new wave of jobs would push us back into the red. By early 2023, Cassandra was saturated, CPU over 95%, and 12% of jobs were failing even after a lot of vertical scaling. Adding more hardware only delayed the next fire.

That is when I stopped looking for bigger machines and started questioning the architecture itself and also understand the data and query pattern.

What I Found

Most of the traffic came from metadata requests, schema lookups, partition queries, and file locations, especially for tables that rarely change.

Our monitoring showed that the same ~500 tables out of more than 100k were responsible for about 90% of the queries. Tables like user_accounts, product_catalog, and device_inventory were being queried tens of thousands of times a day even though their metadata barely changed once a week.

Every Spark executor and Trino worker was asking Cassandra the same questions repeatedly. I didn’t need faster storage; I needed a smarter way to reuse answers.

Why Simple Caches Failed

Our first instinct was to put Redis in front of the metastore. It worked briefly, but Redis itself became the new bottleneck around 10 M QPS.

Sharding helped distribute the load but created its own coordination problems, deciding which Redis cluster held which tables, and handling cross-shard queries. We’d traded one point of failure for several smaller ones.

The problem wasn’t caching. It was centralized caching.

The Decentralize: Caching Inside the Metastore Layer

The breakthrough came when I moved caching into the metastore service itself.

Each metastore pod now maintains a local in-memory cache, but not for every table. Instead, rendezvous hashing determines which pods are responsible for which tables. Every pod can compute this mapping independently, with no coordination and no shared state.

When a request arrives:

The pod calculates which pods should own the cache for that table.
If it’s responsible, it serves from local memory or fetches once from Cassandra.
If not, it forwards the request to one of the responsible pods.

All pods run the same calculation and get the same answer. Math replaces orchestration.

Here's a simplified version of what that calculation looks like:

When a pod receives a metadata request for table_name:

pod_id = hash(table_name, pod_weights) % num_replicas responsible_pods = [hash_with_seed(i) for i in range(5)] # 5 replicas if my_pod_id in responsible_pods: return cache[table_name] or fetch_from_cassandra(table_name) else: forward_to_pod(responsible_pods[0]) // Every pod computes the same responsible_pods. No coordination needed.

Why It Scales

With 200 metastore pods and five replicas per table, traffic spreads evenly.
Each pod handles about 250 K QPS, with roughly 97% served from its own cache. That leaves only 3% of requests that need to hit Cassandra or another pod.
Across the cluster, Cassandra now sees about 3 K QPS instead of a million.

This change alone cut our database CPU from 95% to 8%, slashed job startup times from 45 s to 8 s, and saved us roughly $800 K a year in planned database expansion.

Keeping Caches Warm

At first, restarts were rough. New pods started with empty caches, causing a flood of misses that hammered Cassandra. I fixed this with a prewarming step:

A lightweight service tracks the top 1,000 most-queried tables.
When a pod starts, it computes which tables it’s responsible for and preloads them before taking traffic.

That two-minute prewarm brought startup hit rates to 97% and stabilized deployments.

Handling Schema Changes

Metadata isn’t static, so I built version tracking and Kafka-based invalidation. When a table changes, Cassandra increments a version and publishes an event. Each pod checks whether it owns that table’s cache and, if so, refreshes it.
Only the five pods responsible for that table refetch from Cassandra; everyone else ignores the message.
On average, caches update within two seconds of a schema change.

The Iceberg Tables Twist

Hive tables change rarely, but Iceberg tables create new snapshots constantly.
I split caching into two layers:

Schema metadata: cached for four hours or until schema changes.
Snapshot metadata: cached for ten minutes.

This kept snapshot freshness while cutting Cassandra's load from Iceberg queries by about 80%. When I add or remove pods, only a small percentage of cached tables remap to new cohorts - about 2.5% when adding a single pod.
This predictable churn means I can autoscale freely without mass invalidations or traffic spikes.

Weighting and Fairness

Later, I noticed uneven CPU usage between pods. Some handled far more traffic because our hardware wasn’t uniform some had 32 cores, others 64.
The fix was to include pod weights in the hashing function. Larger pods get proportionally more cache assignments, keeping usage balanced.

I also used weights to manage deployments: new versions start with low weight and ramp up gradually, while draining pods get weight = 0 to phase them out safely.

What I Learned Along the Way

Cold start floods: solved with staggered prewarming and rate limits.
Partition drift: partitions change often, so I give them shorter TTLs.
Memory leaks: enforcing size limits and TTLs kept caches stable.
Zombie pods: heartbeat checks remove dead nodes from routing.
Request amplification: the API gateway now routes directly to responsible pods to prevent double forwarding.

The Metric That Matters

I track plenty of metrics, but the one I watch most is Cassandra query rate.

Before: ~5 M QPS at peak.
After: ~3 K QPS.

That 1,600× reduction means our database team barely notices load anymore. Job startup times are down, failures have almost disappeared, and we’ve saved millions in compute.

What’s Next

The system is stable, but there’s room to evolve:

Sharing caches across compute clusters (Spark, Trino, Flink).
Predicting which tables a job will need and pre-fetching metadata.
Grouping related tables in the same cohorts.
Adjusting replica counts dynamically based on access patterns.

The Real Takeaway

I didn’t need new tech like some companies did, rewriting in golang or rust, I needed to stop fighting symptoms, instead understand the query pattern first.

Our real bottleneck was not Cassandra; it was treating metadata as if it changed every second. Once I recognized that, a lightweight distributed cache with HRW hashing was all it took.

The final implementation was about 800 lines of code added to our existing metastore service. No new infrastructure, no coordination layer, and no executive approval for vertical scaling.

That last line about 3 AM fires is no joke, our on-call engineers have actually debugged this system in the middle of the night, and the fact that it's understandable without a wall of infrastructure is what makes it work.

How I Scaled Our Metastore Cache to 1M+ QPS Using Rendezvous Hashing