Does Ensemble ML (3 ML Models) Actually Work? I Tested It on 332K Orders

Ensemble ML is everywhere - every blog post, every conference talk claims "combine multiple models for better results." But does it actually work in production?

I built a data quality monitoring system to find out. Three ML models (Isolation Forest, LSTM, Autoencoder) are working together. 332K synthetic orders processed over 25 days.

Here's what actually happened.

Why I Tested This

"Use ensemble methods" is the standard advice for ML in production. Combine multiple models, get better predictions, and reduce false positives.

Sounds great in theory. But I wanted to know:

Does it actually reduce false positives?
Is the complexity worth it?
Does it work for data quality monitoring specifically?

So I built it. Ran it continuously. Measured everything.

The Setup

Stack:

Apache Kafka streaming orders
Python processing pipeline
PostgreSQL for metrics
Three ML models in ensemble
Docker Compose (runs locally)

Data: Synthetic e-commerce orders with realistic quality issues injected.

Goal: Compare single model vs. ensemble. Which catches more real issues? Which has fewer false positives?

Baseline: Single Model (Isolation Forest)

Started with just Isolation Forest - the standard choice for anomaly detection:

from sklearn.ensemble import IsolationForest

# Train on 24 hours of quality metrics
historical_data = get_metrics(hours=24)
model = IsolationForest(contamination=0.1)
model.fit(historical_data)

# Predict
is_anomaly = model.predict(current_metrics)

Week 1 Results:

93% accuracy
<10ms inference
Caught obvious anomalies
15% false positive rate
Missed gradual changes

The false positive rate was the killer. Every 6-7 alerts, one was wrong. Teams started ignoring alerts.

Adding SHAP (Because Black Boxes Don't Work)

Before adding more models, I needed to understand why the single model flagged things:

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(current_metrics)

# Now I get:
# "Primary driver: validity_score = 45.2 (30% below baseline)"

This helped debug false positives. But didn't reduce them.

That's when I decided to try the ensemble approach everyone talks about.

The Ensemble: 3 Models

The theory: Different algorithms catch different problems.

Model 1: Isolation Forest (40% weight)

Fast, catches statistical outliers
Good at: Sudden spikes, obvious anomalies
Misses: Gradual changes, temporal patterns

Model 2: LSTM (30% weight)

Neural network for sequences
Good at: Temporal patterns, gradual degradation
Misses: Sudden outliers

Model 3: Autoencoder (30% weight)

Reconstruction-based detection
Good at: Unusual feature combinations
Misses: Single-dimension outliers

Voting strategy:

ensemble_score = (
    isolation_score * 0.4 +
    lstm_score * 0.3 +
    autoencoder_score * 0.3
)

# Flag if:
# - Combined score high, OR
# - At least 2 models agree
is_anomaly = ensemble_score > 0.5 or votes >= 2

The Results: Ensemble vs. Single Model

After 25 days processing 332K orders:

Metric                  Single Model    Ensemble    Improvement
─────────────────────────────────────────────────────────────
Accuracy                93.2%          93.8%       +0.6%
False Positives         15%            9.7%        -35%
Real Anomalies Caught   baseline       +30%        +30%
Inference Time          <5ms           <5ms        same
Training Time           <1 min         3 min       acceptable

The big wins:

False positives dropped 35% (15% → 9.7%)
Caught 30% more real issues
Same inference speed

The cost:

Training takes 3 minutes instead of 1
More code to maintain
Need to monitor 3 models

What Each Model Caught

Example 1: Isolation Forest Alone

Sudden spike in missing fields. Obvious outlier.

IF: Detected
LSTM: Missed (not temporal)
Autoencoder: Missed (single dimension)
Ensemble: Detected (1 vote enough)

Example 2: LSTM Alone

Completeness scores dropping 8% over 6 hours.

IF: Missed (each point looked normal)
LSTM: Detected (temporal pattern)
Autoencoder: Missed (not reconstruction issue)
Ensemble: Detected (1 vote enough)

Example 3: Autoencoder Alone

Unusual combination: low volume + high value + weekend.

IF: Missed (individually normal)
LSTM: Missed (no temporal pattern)
Autoencoder: Detected (unusual combination)
Ensemble: Detected (1 vote enough)

Example 4: False Positive Reduction

Weekend pattern that's unusual but valid.

IF: Flagged (outlier)
LSTM: Correctly ignored (normal for weekends)
Autoencoder: Correctly ignored (seen before)
Ensemble: Correctly ignored (2 models voted no)

This is where ensemble shines - reducing false positives through voting.

Does the Complexity Pay Off?

YES, but with conditions:

When ensemble is worth it: You need low false positive rate (<10%). Different anomaly types exist (temporal, statistical, combinatorial) You can afford 3x training time You have enough data for 3 models

When single model is enough: False positives aren't a big problem Only one type of anomaly. Need fastest possible training. Limited data

For data quality monitoring specifically, Ensemble is worth it.

Why? False positives kill trust. Teams ignore alerts if too many are wrong. Reducing 15% → 9.7% false positives makes the system actually usable

The Drift Problem (Affects All Models)

Both single model and ensemble faced the same issue: concept drift.

After 2 weeks, accuracy dropped from 93% to 78%.

Solution: Statistical drift detection with auto-retraining:

from scipy.stats import ks_2samp

reference = get_baseline(feature, hours=24)
current = get_recent(feature, hours=1)

statistic, p_value = ks_2samp(reference, current)

if p_value < 0.01:
    retrain_all_models()

This works for both single model and ensemble. Not specific to ensemble approach.

Performance: Does Ensemble Slow Things Down?

Inference (per order):

Single model: 4.2ms
Ensemble (3 models): 4.8ms
Difference: +0.6ms (14% slower)

Training:

Single model: 52 seconds
Ensemble (3 models): 3 minutes 18 seconds
Difference: ~4x slower

Training every 2 hours, so 3 minutes is acceptable. Inference is still <5ms, fast enough for real-time.

Optimization used:

python

# Run 3 models in parallel
with ThreadPoolExecutor(max_workers=3) as executor:
    if_result = executor.submit(isolation_forest.predict, data)
    lstm_result = executor.submit(lstm.predict, data)
    ae_result = executor.submit(autoencoder.predict, data)

Without parallelization: 12ms. With: 4.8ms.

Production Metrics (25 Days)

Orders Processed: 332,308

Quality Checks: 2.8M+

System Uptime: 603.7 hours (100%)

Average Latency: 4.8ms per order

Ensemble Performance:

Accuracy: 93.8%
False Positives: 9.7%
Precision: 94.2%
Recall: 91.8%

Comparison to Single Model:

False Positives: -35% improvement
Real Anomalies: +30% more caught
Inference Time: +14% slower (acceptable)
Training Time: 4x slower (acceptable)

What I Learned

1. Ensemble DOES reduce false positives

Not by a little. By 35%. This matters a lot for trust.

2. Different models catch different things

Not marketing fluff. Actually true. IF catches outliers, LSTM catches temporal, AE catches combinations.

3. Voting is powerful

When 2/3 models say "not anomalous," they're usually right. This reduces false alarms.

4. The complexity is manageable

Three models instead of one isn't 3x the work. Maybe 1.5x once you have infrastructure.

5. But you need enough data

Each model needs reasonable training data. If you have <100 samples, single model is better.

6. Explainability still required

Ensemble without SHAP is useless. Add SHAP to all three models.

Should You Use Ensemble?

YES if:

False positives are a problem
You have different types of anomalies
You can handle 3-5 minute training
You have 500+ training samples

NO if:

Single model works fine
Training time is critical
Limited data (<100 samples)
Simple anomaly patterns

For data quality monitoring specifically: Yes.
Why? Data quality has multiple anomaly types (missing fields, wrong formats, unusual patterns, temporal issues). Ensemble catches more with fewer false alarms.

The Code

Open source: https://github.com/kalluripradeep/realtime-data-quality-monitor

Compare single model vs. ensemble yourself:

Clone the repo and run: docker compose up -d

Bottom Line

Ensemble ML isn't just hype. It actually works:

35% fewer false positives
30% more real anomalies caught
Same inference speed
Worth the extra complexity

For data quality monitoring, it's worth doing.

About the Author

Pradeep Kalluri is a Data Engineer at NatWest Bank and Apache Airflow/Dbt contributor.