In Digital Healthcare data platforms, data quality is no longer a nice-to-have — it is a hard requirement. Business decisions, regulatory reporting, machine learning models, and executive dashboards all depend on one thing: trustworthy data.

Yet, many data engineering teams still treat data quality as an afterthought, validating data only after it has already propagated downstream.

Databricks introduced a powerful shift in this mindset through Declarative Pipelines using Delta Live Tables (DLT).

Instead of writing complex validation logic manually, engineers can now declare what good data looks like and let the platform enforce, monitor, and govern it automatically.

This blog explores how declarative data quality works in Databricks, why it matters, and how to design production-grade pipelines using this approach.

The Traditional Problem with Data Quality

In traditional healthcare ETL pipelines, data quality is usually handled using:

While this approach may work initially, it quickly breaks down at scale:

Most importantly, bad data often reaches downstream systems silently, where the impact is far more expensive.

Declarative pipelines solve this problem by making data quality a first-class citizen of the pipeline itself.

What Is Declarative Data Quality?

Declarative data quality means defining rules and expectations, not procedural logic.

Instead of saying:

Check if the amount is positive and then drop the record.

You say:

The amount must always be greater than zero.

In Databricks, this is implemented using Delta Live Tables (DLT) Expectations.

Expectations allow you to attach data quality rules directly to tables, making the pipeline:

Delta Live Tables and Expectations

Delta Live Tables provide a declarative framework to build batch and streaming pipelines. Data quality is enforced using Expectations, which are evaluated automatically during pipeline execution.

DLT supports three expectation behaviors:

1. Expect (Monitor Only)

This mode tracks data quality issues but allows all records to pass.

Use cases:

@dlt.expect("valid_date", "order_date IS NOT NULL")

2. Expect or Drop

Records that violate the rule are automatically removed from the dataset.

Use cases:

@dlt.expect_or_drop("amount_positive", "amount > 0")

3. Expect or Fail

The pipeline fails immediately if the rule is violated.

Use cases:

@dlt.expect_or_fail("order_id_present", "order_id IS NOT NULL")

This clear separation allows teams to apply the right level of strictness at the right stage.

Data Quality in the Medallion Architecture

Declarative data quality works best when combined with the Bronze–Silver–Gold (Medallion) Architecture.

Bronze Layer – Raw Data

The Bronze layer focuses on ingestion reliability, not correctness.

Declarative expectations are usually avoided here, except for basic technical checks.

Silver Layer – Validated and Cleaned Data

The Silver layer is where most data quality rules live.

Typical rules include:

Example:

@dlt.table
@dlt.expect_or_drop("valid_amount", "amount > 0")
@dlt.expect("customer_present", "customer_id IS NOT NULL")
def silver_sales():
    return dlt.read("bronze_sales")

This ensures only trusted data flows forward, while still maintaining visibility into quality issues.

Gold Layer – Business-Ready Data

The Gold layer serves analytics, reporting, and machine learning.

Here, expectations are strict:

Fail-fast expectations are commonly used to protect consumers.

Built-In Observability and Metrics

One of the biggest advantages of declarative data quality in Databricks is automatic observability.

For every expectation, Databricks captures:

These metrics are available through:

This eliminates the need for custom monitoring frameworks and significantly improves auditability.

Quarantine Pattern: Don’t Lose Bad Data

Dropping bad records is not always enough. In regulated or enterprise environments, teams often need to retain invalid data for analysis and reprocessing.

A common pattern is to write failed records to a quarantine table:

@dlt.table
def quarantine_sales():
    return dlt.read("bronze_sales") \
        .filter("amount <= 0 OR customer_id IS NULL")

Benefits of this approach:

Why Declarative Data Quality Scales Better

Traditional ETL

Declarative Pipelines

Manual validation code

Built-in expectations

Hard to audit

Automatic metrics

Complex error handling

Clear rule enforcement

Reactive fixes

Preventive design

Declarative pipelines reduce code complexity while increasing reliability — a rare but valuable combination.

Common Mistakes to Avoid

  1. Applying strict rules in the bronze layer
  2. Using expect_or_fail everywhere
  3. Ignoring quarantine tables
  4. Treating data quality as a one-time setup

Declarative quality works best when rules evolve with the data and business requirements.

Sample Data and Expected Output

To make declarative data quality more concrete, let’s walk through a simple end-to-end example using sample data and see how expectations affect the output at each layer.

Sample Input Data (Bronze Layer)

Assume this is raw sales data ingested from a source system into the Bronze table.

order_id

customer_id

amount

order_date

101

C001

250

2024-11-01

102

C002

-50

2024-11-01

103

NULL

120

2024-11-02

104

C003

0

2024-11-02

NULL

C004

300

2024-11-03

At this stage:

Data Quality Rules Applied (Silver Layer)

In the Silver layer, we apply declarative expectations:

@dlt.table
@dlt.expect_or_drop("amount_positive", "amount > 0")
@dlt.expect("customer_not_null", "customer_id IS NOT NULL")
def silver_sales():
    return dlt.read("bronze_sales")

Silver Output Table

order_id

customer_id

amount

quality_status

101

C001

250

PASS

103

NULL

120

WARN

NULL

C004

300

WARN

Dropped Records:

DLT automatically records how many rows were dropped and which rule caused it.

Quarantine Table Output

Instead of losing dropped data, we capture it in a quarantine table.

@dlt.table
def silver_sales_quarantine():
    return dlt.read("bronze_sales") \
        .filter("amount <= 0")

Quarantine Output

order_id

customer_id

amount

order_date

Reason

102

C002

-50

2024-11-01

Invalid amount

104

C003

0

2024-11-02

Invalid amount

This table is useful for:

Business Rules Applied (Gold Layer)

In the Gold layer, strict business rules are enforced:

●        order_id IS NOT NULL → expect_or_fail

@dlt.table
@dlt.expect_or_fail("order_id_present", "order_id IS NOT NULL")
def gold_sales():
    return dlt.read("silver_sales") \
        .groupBy("customer_id") \
        .agg(sum("amount").alias("total_spend"))

Gold Output Table

customer_id

total_spend

C001

250

Pipeline Failure Triggered:

This protects downstreRecordsam consumers by preventing incorrect aggregations.

What DLT Captures Automatically

For this example, Databricks automatically tracks:

All metrics are visible in the DLT UI and event logs, with zero custom code.

Final Thoughts

This simple example demonstrates the real power of declarative data quality:

Declarative pipelines ensure that every downstream dataset is built on explicit trust guarantees, making them ideal for production-grade data platforms.