When you're the support engineer, the analyst, and the firefighter all in one person.
The Context: Big Data, Real Clients, Real Problems
I work in a data platform team supporting clients. Our stack includes Snowflake, BigQuery, and the product exposes millions of data points daily: customer metrics, store-level analytics, historical trends, etc.
So, when someone said, “Why is my dashboard empty?” it wasn’t just a UI bug. Something was broken deep in the data pipeline.
Spoiler: entire chunks of data never made it into our platform.
Step 1: Investigating the Black Hole
The first clue came from a product delivery team: some accounts had incomplete datasets. We quickly traced the flow:
- Raw data entered MySQL
- It was meant to move to Google Cloud Storage (GCS)
- But didn’t make it into Snowflake
No alarms. No alerts. Just... silent failure.
The parser in the GCS → Snowflake handoff was skipping records silently due to a malformed schema and lack of validation.
Step 2: Build a Monitoring System (In a Week)
I decided to build an easy but powerful monitoring pipeline using Python and Snowflake. No waiting for a long-term product fix because this had to work now.
Step 2.1: Connect to Snowflake via Python
#a connector
import snowflake.connector
import pandas as pd
conn = snowflake.connector.connect(
user='user',
password='password',
account='account_id',
warehouse='warehouse',
database='db',
schema='schema'
)
#just an example
query = """
SELECT customer_id, module, max(event_date) as last_date
FROM analytics_table
GROUP BY customer_id, module
"""
df = pd.read_sql(query, conn)
Step 2.2: Detect Delays with Pandas
df["lag_days"] = (pd.Timestamp.today() - pd.to_datetime(df["last_date"])).dt.days
df["status"] = df["lag_days"].apply(lambda x: "valid" if x <= 1 else "delayed")
Now, I had a clear view of what modules were fresh and which were weeks behind.
Step 2.3: Store the Results in a New Monitoring Table
from sqlalchemy import create_engine
#example
engine = create_engine(
'snowflake://user:password@account/db/schema?warehouse=warehouse'
)
df.to_sql("monitoring_status", con=engine, if_exists="replace", index=False)
Step 3: Show the Pain Visually
Using Looker Studio, I connected directly to the monitoring table.
I built a dashboard that showed:
- % of modules delayed per customer
- Time since last successful sync
- New issues as they appeared
This made leadership see the gap and prioritize resources.
Step 4: Fix the Root Cause
Meanwhile, we fixed the broken parser and restarted all affected pipelines. We also retroactively reprocessed all lost data. Thanks to monitoring, it validated its complete.
Outcomes
- Recovered 100% of lost data
- Created a real-time, always-on monitoring dashboard
- Reduced future blind spots by enabling automatic validation
- Prevented client escalations before they even noticed
Final Thoughts
Sometimes, all you need is a smart support engineer, a Python script, and a clear view of what’s wrong.
If your data platform has millions of moving parts — assume something is broken. Then go build something to catch it.