sia.hackernoon.com

Introduction

In practice, businesses often need to objectively assess how certain events impact key performance metrics. It's a broad and complex challenge that's typically addressed through A/B testing. But what if running a clean, randomized experiment just isn’t an option?

In such cases, the Propensity Score Matching (PSM) method is useful — it compensates for the lack of randomization by matching comparable user groups, reduces the impact of hidden biases, and provides a more accurate estimate of the causal effect.

In this article, we’ll break down what Propensity Score Matching (PSM) is and why it becomes essential when running an A/B test isn’t an option. We’ll cover its mathematical foundation without unnecessary theory, walk through a step-by-step practical algorithm, and show how to check whether the results can be trusted.

What is PSM?

Propensity Score Matching (PSM) is a statistical method that allows for accurate group comparisons by accounting for systematic differences between them.

Imagine two groups of users: one received a treatment (such as access to a new feature or participation in a promotion), and the other did not (control). Suppose these groups differ in their initial characteristics—age, income, activity level, and so on. These differences make it hard to fairly assess the effect, since the outcome may be influenced not only by the treatment itself but also by those background factors.

PSM addresses this issue by finding, for each treated user, the most similar user in the control group based on key characteristics. This creates matched pairs where the influence of external factors is minimized. As a result, we can isolate and more accurately estimate the true impact of the treatment, rather than the pre-existing differences between the groups.

A Simple Example

Let’s consider a typical case: a service already offers a paid premium subscription, and the goal is to understand how it affects user engagement and satisfaction.

Running an A/B test isn’t feasible here—users decide for themselves whether to subscribe, and forcing or canceling subscriptions within an experiment would be both unethical and technically challenging. On top of that, subscribers and non-subscribers may differ from the start in terms of interest, activity level, and other factors. So, directly comparing the two groups would give a biased result.

This is where Propensity Score Matching comes in. First, you define a set of user characteristics that influence the likelihood of subscribing. Based on these, you calculate each user’s individual probability of becoming a subscriber. Then, for each subscriber, you find a “twin” among non-subscribers with a very similar propensity score.

This creates balanced pairs where key differences between users are minimized. As a result, you can compare the groups and draw more accurate conclusions about the actual impact of the subscription on user behavior and metrics.

Why Use PSM?

Propensity Score Matching (PSM) is useful whenever there is selection bias—systematic distortion that occurs when the comparison groups differ in characteristics that significantly affect the outcome being studied.

Here are common scenarios where PSM helps:

1️⃣ When A/B Testing Is Impossible or Difficult

Sometimes running a proper randomized experiment isn’t feasible due to technical limitations, business constraints, ethical concerns, or when fast results are needed and there’s no time to set up an experiment.

2️⃣ When Groups Are Formed Non-Randomly

In cases where groups form naturally and already differ in meaningful ways, making direct comparisons unreliable.

3️⃣ When Retrospective Impact Analysis Is Required

When you need to analyze historical data to determine whether a past change had an effect, even though no experiment was conducted at the time.

4️⃣ To Improve the Accuracy of Existing A/B Tests

Even in randomized experiments, group differences may still exist by chance. PSM can be used to align groups more closely on key characteristics, reducing variance and improving the precision of effect estimates.

The Theory Behind PSM

Core idea behind PSM

When we want to measure the impact of an intervention (like a new feature, discount, or subscription) on user behavior, we’re essentially trying to answer the question: what would have happened if the user hadn’t received the treatment?

Let’s say we have two groups of users: one received the treatment (the test group), and the other did not (the control group). For the test group, we can observe the actual outcome — they were exposed to the intervention, and we can measure their behavior afterward. But to estimate the true effect, we need to know how those same users would have behaved without the treatment. That’s a counterfactual scenario — we can’t observe it directly.

If users were randomly assigned to test and control groups (i.e., there are no systematic differences between them), we can simply compare average metrics between the groups. The difference gives us an unbiased estimate of the treatment effect — this is a classic A/B test, and PSM isn’t needed.

However, if the assignment wasn’t random, meaning the groups differ in their characteristics, a simple comparison would yield a biased result. In such cases, a traditional A/B test won’t work.

This is where PSM becomes useful. It builds a “pseudo-control” group by selecting, from the broader control population, those users who are most similar in characteristics to the treated users. This allows us to approximate how the treated users might have behaved without the intervention — and to estimate the treatment effect while minimizing the bias caused by initial group differences.

PSM works in two main steps:

Propensity score estimation — it calculates the probability that a user would receive the treatment based on their characteristics.
Matching — for each user in the treatment group, it finds the most similar user from the control group based on that probability.

Propensity score

Let’s take a closer look at each step. Both are solved using machine learning algorithms — we’ll start with the propensity score.

A propensity score is the estimated probability that a user ends up in the treatment group based on their characteristics. In essence, this is a binary classification task: we train a model to distinguish between users who received the treatment and those who didn’t, using only their pre-treatment features.

You can use various machine learning algorithms to estimate this probability — logistic regression, gradient boosting, random forest, and others. In my experience, gradient boosting tends to work best. The key requirement is that the model provides well-calibrated probabilities and captures all meaningful features.

If you’re confident you’ve included all relevant features and trained the model correctly, its performance can reveal whether the group split was random or not.

If the model accurately predicts which users ended up in the treatment group, it means there are systematic differences between test and control — and you do need PSM.
If the model performs poorly, the groups are likely similar and can be compared directly without adjustment.

Matching

After training the propensity score model, the next step is matching—creating a pseudo-control group that closely resembles the treatment group. There are several matching approaches, including:

Nearest Neighbor Matching (NNM, k-nearest neighbors). For each unit in the treatment group, find the closest (or several closest) unit(s) in the control group based on the propensity score.
Caliper Matching. A variation of NNM that only allows matches where the difference in propensity scores does not exceed a specified threshold.
Stratification Matching. Instead of finding individual pairs, this method groups observations into bins (e.g., quantiles) based on their propensity scores and compares outcomes within those bins.

It’s important to note that when using Nearest Neighbor Matching or Caliper Matching, the pseudo-control group may include duplicate observations. In other words, a single control group user can be the closest match for multiple users in the treatment group.

Matching effectively fills in the missing counterfactual data—what would have happened to treated users had they not received the treatment—without altering real control group observations. This makes it possible to estimate the treatment effect by comparing the treatment group to the resulting pseudo-control group.

PSM in Five Steps

Let’s define treatment_date as the point in time after which users start receiving the treatment (T).

The algorithm includes the following steps:

1️⃣ Prepare the initial dataset with both treatment and control groups. The treatment group should have size n, and the control group should be significantly larger (N >> n, e.g., 5n) to ensure good match quality. Add a treatment indicator (T=1 for treated, T=0 for control).

2️⃣ Collect relevant features for all users based on data from the period before treatment_date.

3️⃣ Apply PSM: build a model to calculate the Propensity Score and perform the matching procedure. Extract a “pseudo-control” group from the control pool.

4️⃣ Evaluate match quality to ensure PSM has worked effectively (more on this later).

5️⃣ Calculate target metrics after treatment_date separately for the treatment group and the pseudo-control group, and estimate the treatment effect.

Evaluate match quality

You’ve conducted an analysis using Propensity Score Matching—how can you be confident in the results? To ensure validity, you need to verify that the assignment of users into treatment and pseudo-control groups was done correctly.

Here’s what to check:

1️⃣ No Pre-Treatment Differences (Pre-treatment Validation)

Confirm that the outcome metric shows no statistically significant difference between the treatment and pseudo-control groups before the treatment date.

2️⃣ Propensity Score Model Quality

The model should reasonably distinguish between treatment and control groups: look for ROC AUC ≥ 0.6 and ensure the predictions are well-calibrated. High classification accuracy isn’t the goal—what matters is that predicted probabilities meaningfully reflect the likelihood of receiving the treatment.

3️⃣ Propensity Score Balance After Matching

After matching, the distribution of propensity scores in the treatment and pseudo-control groups should be statistically indistinguishable.

4️⃣ Covariate Balance After Matching

The distributions of key features should be as similar as possible between the treatment and pseudo-control groups after matching.

5️⃣ Group Size Ratio (Matching Proportion)

The number of matched users in the pseudo-control group should be close to that of the treatment group, with no significant imbalance.

General Recommendations

Here are some practical tips to keep in mind when applying Propensity Score Matching in your analysis:

🔹 Use a Larger Control Group

Always ensure the control group is significantly larger than the treatment group—ideally at least 5 times larger. This gives the matching algorithm a wider pool of candidates to find suitable matches and form a balanced pseudo-control group.

🔹 Run Multiple Variants of the Analysis

Experiment with different control groups, feature sets, matching methods, definitions of the control population, treatment dates, etc. This helps validate robustness and uncover edge cases.

🔹 Don’t Chase High ROC-AUC

Your goal isn’t to perfectly separate treatment and control groups—it’s to accurately estimate the probability of receiving the treatment (the propensity score). Prioritize model calibration over raw classification performance.

🔹 Help the Model During Group Selection

If possible, pre-select a treatment group that’s more similar to the potential control group based on observed features. This will improve match quality and make the evaluation more reliable.

🔹 Interpret the Results

Always try to explain why the treatment group performed better (or worse). Adding a layer of interpretation strengthens the credibility and business relevance of your conclusions.

Conclusion

Propensity Score Matching is a strong alternative when A/B testing isn’t feasible but effect estimation is still necessary. It’s not a replacement for classic randomized experiments, but rather a valuable addition to the analyst’s toolkit in complex scenarios.

The method is actively adopted by leading tech companies and is increasingly used in cases where standard approaches fall short. Yet it remains relatively unfamiliar to the broader analytics community, making it a great opportunity to strengthen analytical practice.

From my experience, PSM proves its value in real-world use cases. It enables sound conclusions where intuition would otherwise be the only option, such as evaluating the impact of already-launched features or dealing with inherently non-random group assignment.

That said, this method does demand more effort than a traditional A/B test, both in terms of modeling and in validating that the results are trustworthy. Still, with the right expertise and implementation, PSM can be efficiently automated and become a powerful part of everyday analytics work.

How to Run Impact Analysis Without an A/B Test?