Hypertension risk prediction sounds straightforward until you touch real clinical-style data. Labels are often imbalanced, features can be messy, and it is easy to report great metrics that disappear the moment the model meets a new cohort.

My work sits at the intersection of healthcare AI, predictive modelling, and practical data delivery. In this post, I will focus on one pattern that consistently performs well for tabular medical risk prediction: stacked tree-based ensembles combined with SMOTE Tomek for imbalance handling, evaluated with sensitivity first thinking and strict leakage control.

This is the same mindset I apply when supporting high-volume healthcare operational datasets where data quality, validation checks, and documentation matter as much as the model itself.

Why class imbalance changes everything

In many hypertension datasets, the positive class is smaller than the negative class. If you optimise for accuracy, you can build a model that looks “good” while missing a large fraction of the patients you actually care about identifying.

That is why I treat the modelling goal as a decision problem, not just a score maximisation problem. For screening style use cases, the metrics that matter are:

Step 1: Define the label and prevent leakage

Leakage is the fastest way to get impressive results that fail in practice. In medical risk prediction, leakage can come from:

A simple rule that saves projects is this: do not let the model see information that would not exist at the time the prediction is made. If you have repeated encounters per patient, split by patient so the same person cannot appear in both training and validation.

Step 2: Build baselines before stacking

Before stacking, I want one or two strong baselines. For tabular healthcare risk prediction, tree-based methods are often effective because they capture nonlinear interactions and handle mixed feature types.

A typical baseline set:

Baselines also tell you what is hard about the dataset and whether the target is learnable.

Step 3: Handle imbalance with SMOTE Tomek carefully

SMOTE Tomek combines:

It can improve minority class recall, but only if done correctly. The key constraint is simple:

Resampling must happen only on the training fold inside cross-validation.

If you oversample before splitting, you risk leakage and inflated validation scores.

Step 4: Use stacking to reduce model-specific blind spots

Stacking helps when different base models make different errors. A small stack is often enough:

The safe way to train stacking is to ensure the meta learner is trained on out of fold predictions, not predictions from models that saw the same rows.

Step 5: Choose a decision threshold on purpose

Healthcare style prediction is rarely about “probability above 0.5.” It is about aligning the model with operational reality.

I typically choose a threshold by targeting a minimum sensitivity, then checking the resulting precision and alert volume. That makes the model usable for screening and makes the evaluation honest.

Stacking + SMOTE Tomek + sensitivity first thresholding

This is still compact enough for a blog post, but long enough to be practical. It shows:

import numpy as np

from imblearn.pipeline import Pipeline
from imblearn.combine import SMOTETomek

from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.metrics import (
    roc_auc_score,
    average_precision_score,
    precision_recall_curve,
    confusion_matrix,
)
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression


def choose_threshold_for_min_sensitivity(y_true, y_prob, min_sens=0.85):
    """
    Pick a threshold that achieves at least min_sens (recall for positive class),
    and among those thresholds choose the one with the best precision.
    """
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    thresholds = np.r_[thresholds, 1.0]  # align lengths

    ok = recall >= min_sens
    if not np.any(ok):
        # if target sensitivity cannot be reached, default to a low threshold
        return 0.1

    best = np.argmax(precision[ok])
    return float(thresholds[ok][best])


def report_at_threshold(y_true, y_prob, threshold):
    y_pred = (y_prob >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

    sensitivity = tp / (tp + fn) if (tp + fn) else 0.0
    specificity = tn / (tn + fp) if (tn + fp) else 0.0
    precision = tp / (tp + fp) if (tp + fp) else 0.0

    return {
        "threshold": threshold,
        "tp": int(tp), "fp": int(fp), "tn": int(tn), "fn": int(fn),
        "sensitivity": float(sensitivity),
        "specificity": float(specificity),
        "precision": float(precision),
    }


# X: feature matrix (pandas DataFrame or numpy array)
# y: binary labels (0/1)
# Replace with your dataset:
# X, y = ...

rf = RandomForestClassifier(
    n_estimators=400,
    random_state=42,
    class_weight="balanced_subsample",
    n_jobs=-1,
)

et = ExtraTreesClassifier(
    n_estimators=600,
    random_state=42,
    class_weight="balanced",
    n_jobs=-1,
)

meta = LogisticRegression(max_iter=2000)

stack = StackingClassifier(
    estimators=[("rf", rf), ("et", et)],
    final_estimator=meta,
    stack_method="predict_proba",
    n_jobs=-1,
)

model = Pipeline(steps=[
    ("balance", SMOTETomek(random_state=42)),
    ("clf", stack),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Out of fold predicted probabilities (honest performance estimate)
oof_prob = cross_val_predict(
    model,
    X, y,
    cv=cv,
    method="predict_proba",
    n_jobs=-1,
)[:, 1]

auroc = roc_auc_score(y, oof_prob)
auprc = average_precision_score(y, oof_prob)

threshold = choose_threshold_for_min_sensitivity(y, oof_prob, min_sens=0.90)
summary = report_at_threshold(y, oof_prob, threshold)

print("AUROC:", round(auroc, 4))
print("AUPRC:", round(auprc, 4))
print("Threshold:", summary["threshold"])
print("Sensitivity:", round(summary["sensitivity"], 4))
print("Specificity:", round(summary["specificity"], 4))
print("Precision:", round(summary["precision"], 4))
print("Confusion (tp, fp, tn, fn):", summary["tp"], summary["fp"], summary["tn"], summary["fn"])

What this code is doing correctly:

Step 6: Validate generalisation, not just performance

A single split is rarely enough. I look for stability across folds, and I pay special attention to false negatives. In a hypertension setting, false negatives are the cases you most want to understand because they represent missed risk.

If subgroup information exists, evaluate across groups. If the dataset spans time, evaluate across time windows. If multiple cohorts exist, validate across cohorts. This is where many models fail, and it is better to find that out early.

Step 7: Treat documentation and monitoring as part of the model

In healthcare adjacent contexts, a model is not just a notebook. It is an artefact with assumptions.

I document:

If deployed, I monitor:

Conclusion

Hypertension prediction on imbalanced data is not solved by a single algorithm. It is solved by discipline: leakage control, sensitivity-aligned evaluation, proper handling of imbalances, and validation that reflects how the model will be used.

Stacked tree-based ensembles combined with SMOTE Tomek can be a strong approach when the goal is to improve recall for high-risk patients while maintaining acceptable precision. The real value is not in the model choice alone, but in the workflow that makes the results trustworthy.