sia.hackernoon.com

Introduction:

If you’re piping clickstream or product events into BigQuery, you probably had this experience:

MVP: “This is amazing, it just works.”
A year later: “Why is our analytics bill larger than some production services?”

In most cases, the problem isn’t BigQuery itself, it’s how we use it: full table scans, full refreshes, and schemas that fight the storage engine.

This post walks through a concrete playbook I’ve used to bring event‑style BigQuery pipelines back under control without touching SLAs.

The Setup: Classic Event Pipeline Anti‑Pattern

You have something like:

Raw events landing into raw.events (streaming or daily batch)
A transformed table, mart.fact_events, that powers dashboards and models
A daily job that does:

CREATE OR REPLACE TABLE mart.fact_events AS
SELECT ...
FROM raw.events;

It works. It’s also quietly scanning months or years of history every single day.

If your raw table is a few TB, you’re re‑processing those TBs daily just to ingest a tiny slice of new data. That’s where the BigQuery invoice starts hurting.

Let’s fix that by attacking three things:

Table layout (partitioning + clustering)
Incremental processing (instead of full refresh)
Query hygiene for analytics users

Step 1: Let BigQuery Skip Old Data

Partition by event time

Event workloads are almost always time‑based. Start there

CREATE TABLE raw.events_partitioned
PARTITION BY DATE(event_ts) AS
SELECT * FROM raw.events;

Now, downstream queries like:

SELECT ...
FROM raw.events_partitioned
WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY);

only touch the last 7 partitions instead of your entire history.

Cluster by your hottest key

Within each day, most event queries will still filter or aggregate by something like user_id, session_id, or account_id.

Cluster on that:

CREATE TABLE raw.events_opt
PARTITION BY DATE(event_ts)
CLUSTER BY user_id AS
SELECT * FROM raw.events;

Together, partitioning + clustering let BigQuery prune huge portions of the table before it even starts scanning, which directly translates into lower cost and better latency.

Step 2: Switch `fact_events` to Incremental

The real win is to stop rebuilding mart.fact_events from scratch.

Instead of:

-- Full rebuild (expensive)
CREATE OR REPLACE TABLE mart.fact_events AS
SELECT ...
FROM raw.events_partitioned;

do:

DECLARE cutoff_ts TIMESTAMP DEFAULT TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY);

-- Only process yesterday + today
CREATE TEMP TABLE new_data AS
SELECT ...
FROM raw.events_opt
WHERE event_ts >= cutoff_ts;

MERGE mart.fact_events t
USING new_data s
ON t.event_id = s.event_id
WHEN MATCHED THEN
  UPDATE SET ...
WHEN NOT MATCHED THEN
  INSERT (...columns...) VALUES (...columns...);

Key ideas:

Incremental window: choose a safe window (e.g. last 1–2 days) that covers late‑arriving events but avoids re‑processing months of history.
Upserts via MERGE: handle late/updated events cleanly without full refreshes.

If you’re using dbt, you can codify this pattern in a handful of lines:

-- fact_events.sql
{{ config(materialized='incremental', unique_key='event_id') }}

SELECT
  ...
FROM {{ ref('events_opt') }}
{% if is_incremental() %}
  WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
{% endif %}

Under the hood, dbt will generate the MERGE logic for you.

In practice, this one change (full refresh → incremental) usually cuts BigQuery compute for that model by 50–80%, and often reduces end‑to‑end latency.

Step 3: Make Queries Less Wasteful By Default

Now that the heavy lifting is cheaper, keep your analytics users from undoing the savings.

Don’t `SELECT *` from huge tables

-- Avoid this on fact tables
SELECT * FROM mart.fact_events
WHERE event_ts >= '2025-07-01';

-- Do this instead
SELECT
  event_ts,
  user_id,
  event_type,
  page
FROM mart.fact_events
WHERE event_ts >= '2025-07-01';

SELECT * scans every column’s bytes even if only a few are used.

Filter early

Encourage patterns where filters are applied as close to the source as possible:

WITH base AS (
  SELECT *
  FROM mart.fact_events
  WHERE event_ts >= '2025-07-01'
)
SELECT ...
FROM base
JOIN dim.users USING (user_id);

The optimizer helps, but simple, explicit filtering habits go a long way.

Pre‑aggregate hot metrics

If every dashboard aggregates fact_events the same way (e.g., daily active users, events per user), materialize those metrics once:

CREATE OR REPLACE TABLE mart.daily_user_events AS
SELECT
  DATE(event_ts) AS event_date,
  user_id,
  COUNT(*) AS events
FROM mart.fact_events
GROUP BY event_date, user_id;

Dashboards and ad‑hoc queries then hit a much smaller table, which is both cheaper and faster.

Step 4: Use `INFORMATION_SCHEMA` as Your Feedback Loop

Finally, close the loop by watching what actually happens in production:

-- Find the heaviest event-related queries in the last 7 days
SELECT
  user_email,
  query,
  total_bytes_processed,
  total_slot_ms
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE
  creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
  AND query LIKE '%fact_events%'
ORDER BY total_bytes_processed DESC
LIMIT 20;

When something spikes:

Check if it’s hitting the right (partitioned/clustered) table
Check for SELECT * or missing filters
Consider whether a summary table or materialized view would help

This gives you a concrete, data‑driven way to keep your BigQuery costs in check as usage changes.

Closing Thoughts

You don’t need to rip out BigQuery or throttle your analysts to get costs under control on an event pipeline. You need to:

Layout tables correctly (partition by time, cluster by real keys)
Make models incremental, not full refresh
Nudge queries toward good patterns (no SELECT *, filter early, pre‑aggregate hot metrics)
Watch the top spenders and tighten them over time

Do that, and BigQuery goes back to being what it should be: a fast, boring, predictable part of your stack; not a monthly surprise from your cloud provider.

Stop Rebuilding the World: How to Halve Your BigQuery Bill on Event Data