sia.hackernoon.com

When you're the support engineer, the analyst, and the firefighter all in one person.

The Context: Big Data, Real Clients, Real Problems

I work in a data platform team supporting clients. Our stack includes Snowflake, BigQuery, and the product exposes millions of data points daily: customer metrics, store-level analytics, historical trends, etc.

So, when someone said, “Why is my dashboard empty?” it wasn’t just a UI bug. Something was broken deep in the data pipeline.

Spoiler: entire chunks of data never made it into our platform.

Step 1: Investigating the Black Hole

The first clue came from a product delivery team: some accounts had incomplete datasets. We quickly traced the flow:

Raw data entered MySQL
It was meant to move to Google Cloud Storage (GCS)
But didn’t make it into Snowflake

No alarms. No alerts. Just... silent failure.

The parser in the GCS → Snowflake handoff was skipping records silently due to a malformed schema and lack of validation.

Step 2: Build a Monitoring System (In a Week)

I decided to build an easy but powerful monitoring pipeline using Python and Snowflake. No waiting for a long-term product fix because this had to work now.

Step 2.1: Connect to Snowflake via Python

#a connector
import snowflake.connector
import pandas as pd

conn = snowflake.connector.connect(
    user='user',
    password='password',
    account='account_id',
    warehouse='warehouse',
    database='db',
    schema='schema'
)
#just an example
query = """
SELECT customer_id, module, max(event_date) as last_date
FROM analytics_table
GROUP BY customer_id, module
"""
df = pd.read_sql(query, conn)

Step 2.2: Detect Delays with Pandas

df["lag_days"] = (pd.Timestamp.today() - pd.to_datetime(df["last_date"])).dt.days
df["status"] = df["lag_days"].apply(lambda x: "valid" if x <= 1 else "delayed")

Now, I had a clear view of what modules were fresh and which were weeks behind.

Step 2.3: Store the Results in a New Monitoring Table

from sqlalchemy import create_engine
#example
engine = create_engine(
    'snowflake://user:password@account/db/schema?warehouse=warehouse'
)

df.to_sql("monitoring_status", con=engine, if_exists="replace", index=False)

Step 3: Show the Pain Visually

Using Looker Studio, I connected directly to the monitoring table.

I built a dashboard that showed:

% of modules delayed per customer
Time since last successful sync
New issues as they appeared

This made leadership see the gap and prioritize resources.

Step 4: Fix the Root Cause

Meanwhile, we fixed the broken parser and restarted all affected pipelines. We also retroactively reprocessed all lost data. Thanks to monitoring, it validated its complete.

Outcomes

Recovered 100% of lost data
Created a real-time, always-on monitoring dashboard
Reduced future blind spots by enabling automatic validation
Prevented client escalations before they even noticed

Final Thoughts

Sometimes, all you need is a smart support engineer, a Python script, and a clear view of what’s wrong.

If your data platform has millions of moving parts — assume something is broken. Then go build something to catch it.

We Lost Millions of Records: Here's How I Caught It, Fixed It, and Built a Monitoring System in One