sia.hackernoon.com

In Digital Healthcare data platforms, data quality is no longer a nice-to-have — it is a hard requirement. Business decisions, regulatory reporting, machine learning models, and executive dashboards all depend on one thing: trustworthy data.

Yet, many data engineering teams still treat data quality as an afterthought, validating data only after it has already propagated downstream.

Databricks introduced a powerful shift in this mindset through Declarative Pipelines using Delta Live Tables (DLT).

Instead of writing complex validation logic manually, engineers can now declare what good data looks like and let the platform enforce, monitor, and govern it automatically.

This blog explores how declarative data quality works in Databricks, why it matters, and how to design production-grade pipelines using this approach.

The Traditional Problem with Data Quality

In traditional healthcare ETL pipelines, data quality is usually handled using:

Custom IF conditions
Separate validation jobs
Manual logging tables
Post-load reconciliation queries

While this approach may work initially, it quickly breaks down at scale:

Validation logic becomes scattered across notebooks
Failures are hard to trace back to the root cause
Metrics are inconsistent across pipelines
Reprocessing bad data becomes complex

Most importantly, bad data often reaches downstream systems silently, where the impact is far more expensive.

Declarative pipelines solve this problem by making data quality a first-class citizen of the pipeline itself.

What Is Declarative Data Quality?

Declarative data quality means defining rules and expectations, not procedural logic.

Instead of saying:

“Check if the amount is positive and then drop the record.”

You say:

“The amount must always be greater than zero.”

In Databricks, this is implemented using Delta Live Tables (DLT) Expectations.

Expectations allow you to attach data quality rules directly to tables, making the pipeline:

Self-documenting
Consistent
Easier to maintain

Delta Live Tables and Expectations

Delta Live Tables provide a declarative framework to build batch and streaming pipelines. Data quality is enforced using Expectations, which are evaluated automatically during pipeline execution.

DLT supports three expectation behaviors:

1. Expect (Monitor Only)

This mode tracks data quality issues but allows all records to pass.

Use cases:

Monitoring upstream data health
Gradual rollout of quality rules

@dlt.expect("valid_date", "order_date IS NOT NULL")

2. Expect or Drop

Records that violate the rule are automatically removed from the dataset.

Use cases:

Removing invalid or corrupt records
Enforcing cleanliness in curated layers

@dlt.expect_or_drop("amount_positive", "amount > 0")

3. Expect or Fail

The pipeline fails immediately if the rule is violated.

Use cases:

Business-critical constraints
Regulatory or financial data

@dlt.expect_or_fail("order_id_present", "order_id IS NOT NULL")

This clear separation allows teams to apply the right level of strictness at the right stage.

Data Quality in the Medallion Architecture

Declarative data quality works best when combined with the Bronze–Silver–Gold (Medallion) Architecture.

Bronze Layer – Raw Data

The Bronze layer focuses on ingestion reliability, not correctness.

Schema-on-read
Minimal validation
Preserve raw data

Declarative expectations are usually avoided here, except for basic technical checks.

Silver Layer – Validated and Cleaned Data

The Silver layer is where most data quality rules live.

Typical rules include:

Non-null checks
Range validations
Referential integrity
Deduplication

Example:

@dlt.table
@dlt.expect_or_drop("valid_amount", "amount > 0")
@dlt.expect("customer_present", "customer_id IS NOT NULL")
def silver_sales():
    return dlt.read("bronze_sales")

This ensures only trusted data flows forward, while still maintaining visibility into quality issues.

Gold Layer – Business-Ready Data

The Gold layer serves analytics, reporting, and machine learning.

Here, expectations are strict:

Business keys must exist
Aggregations must be consistent
No tolerance for invalid records

Fail-fast expectations are commonly used to protect consumers.

Built-In Observability and Metrics

One of the biggest advantages of declarative data quality in Databricks is automatic observability.

For every expectation, Databricks captures:

Total records processed
Passed and failed record counts
Dropped records
Failure reasons

These metrics are available through:

Delta Live Tables UI
Event log tables
Databricks system tables

This eliminates the need for custom monitoring frameworks and significantly improves auditability.

Quarantine Pattern: Don’t Lose Bad Data

Dropping bad records is not always enough. In regulated or enterprise environments, teams often need to retain invalid data for analysis and reprocessing.

A common pattern is to write failed records to a quarantine table:

@dlt.table
def quarantine_sales():
    return dlt.read("bronze_sales") \
        .filter("amount <= 0 OR customer_id IS NULL")

Benefits of this approach:

Root-cause analysis
SLA and vendor issue tracking
Reprocessing after fixes

Why Declarative Data Quality Scales Better

Traditional ETL	Declarative Pipelines
Manual validation code	Built-in expectations
Hard to audit	Automatic metrics
Complex error handling	Clear rule enforcement
Reactive fixes	Preventive design

Declarative pipelines reduce code complexity while increasing reliability — a rare but valuable combination.

Common Mistakes to Avoid

Applying strict rules in the bronze layer
Using expect_or_fail everywhere
Ignoring quarantine tables
Treating data quality as a one-time setup

Declarative quality works best when rules evolve with the data and business requirements.

Sample Data and Expected Output

To make declarative data quality more concrete, let’s walk through a simple end-to-end example using sample data and see how expectations affect the output at each layer.

Sample Input Data (Bronze Layer)

Assume this is raw sales data ingested from a source system into the Bronze table.

order_id	customer_id	amount	order_date
101	C001	250	2024-11-01
102	C002	-50	2024-11-01
103	NULL	120	2024-11-02
104	C003	0	2024-11-02
NULL	C004	300	2024-11-03

At this stage:

No records are rejected
Data is stored as-is for traceability

Data Quality Rules Applied (Silver Layer)

In the Silver layer, we apply declarative expectations:

amount > 0 → expect_or_drop
customer_id IS NOT NULL → expect (monitor only)

@dlt.table
@dlt.expect_or_drop("amount_positive", "amount > 0")
@dlt.expect("customer_not_null", "customer_id IS NOT NULL")
def silver_sales():
    return dlt.read("bronze_sales")

Silver Output Table

order_id	customer_id	amount	quality_status
101	C001	250	PASS
103	NULL	120	WARN
NULL	C004	300	WARN

Dropped Records:

Order 102 (amount = -50)
Order 104 (amount = 0)

DLT automatically records how many rows were dropped and which rule caused it.

Quarantine Table Output

Instead of losing dropped data, we capture it in a quarantine table.

@dlt.table
def silver_sales_quarantine():
    return dlt.read("bronze_sales") \
        .filter("amount <= 0")

Quarantine Output

order_id	customer_id	amount	order_date	Reason
102	C002	-50	2024-11-01	Invalid amount
104	C003	0	2024-11-02	Invalid amount

This table is useful for:

Root-cause analysis
Vendor or upstream system feedback
Reprocessing after fixes

Business Rules Applied (Gold Layer)

In the Gold layer, strict business rules are enforced:

● order_id IS NOT NULL → expect_or_fail

@dlt.table
@dlt.expect_or_fail("order_id_present", "order_id IS NOT NULL")
def gold_sales():
    return dlt.read("silver_sales") \
        .groupBy("customer_id") \
        .agg(sum("amount").alias("total_spend"))

Gold Output Table

customer_id	total_spend
C001	250

Pipeline Failure Triggered:

Record with order_id = NULL causes the pipeline to fail

This protects downstreRecordsam consumers by preventing incorrect aggregations.

What DLT Captures Automatically

For this example, Databricks automatically tracks:

Total records ingested: 5
Records dropped in Silver: 2
Expectation violations per rule
Pipeline failure reason in Gold

All metrics are visible in the DLT UI and event logs, with zero custom code.

Final Thoughts

This simple example demonstrates the real power of declarative data quality:

Rules are clear and self-documenting
Bad data is controlled, not hidden
Outputs are predictable and auditable

Declarative pipelines ensure that every downstream dataset is built on explicit trust guarantees, making them ideal for production-grade data platforms.

Architecting Trustworthy Healthcare Data Platforms Using Declarative Pipelines

The Traditional Problem with Data Quality

What Is Declarative Data Quality?

Delta Live Tables and Expectations

Data Quality in the Medallion Architecture

Gold Layer – Business-Ready Data

Built-In Observability and Metrics

Quarantine Pattern: Don’t Lose Bad Data

Common Mistakes to Avoid

Sample Data and Expected Output

Sample Input Data (Bronze Layer)

Data Quality Rules Applied (Silver Layer)

Quarantine Table Output

Business Rules Applied (Gold Layer)

What DLT Captures Automatically

Final Thoughts