sia.hackernoon.com

Linear regression or T-test. How to choose?

We often get caught up in the buzz around fancy machine learning models and deep learning breakthroughs, but let’s not overlook the humble linear regression.

In a world of LLM and cutting-edge architectures, linear regression quietly plays a crucial role, and it’s time we shine a light on how it can be beneficial even today.

Consider a scenario where an e-commerce company introduces a new banner, and we aim to assess the impact of it on the average session length. To achieve this, an experiment was conducted, and data was gathered for analysis. Let’s analyze the results.

T-test

Let’s employ a familiar tool for this task: the t-test.

The results are pretty promising:

The uplift in the metric is simply the difference between the sample averages of the control and treatment groups. In our case, the estimated uplift is 0.56 minutes, indicating that users, on average, spend 33 seconds longer using our product.

Linear Regression

Now, let’s employ linear regression with the treatment vector (whether the new banner is shown or not) as the independent variable and the average session length as the output variable.

Then we print the summary of our model:

The p-value for the treatment coefficient in linear regression is 0.005 (rounded), the same as the p-value obtained from the t-test.

Notably, the coefficient for the treatment variable aligns with our earlier uplift estimate of 0.56. It is worth noting that R-squared is just 0.008, and we don’t explain too much of the variance with this model.

Coincidence?

Is this a coincidence that the uplift we got from the t-test and the treatment coefficient are the same? Let’s delve into the connection.

Let’s think about what the treatment variable reflects. When it equals 1, it indicates the average session length for users who viewed the banner; when it equals 0, it indicates the average session length for users who did not see the banner. It means the treatment variable (or slope in linear regression terms) signifies the change in mean between the control and treatment groups.

What is the null hypothesis for the treatment variable in linear regression?

What is the null hypothesis when we apply the T-test for the experiment? It’s totally the same.

Hence, when computing the t-statistics and p-value for identical hypotheses, our findings remain consistent and identical.

Why do we want to use linear regression?

However, what is the reason behind using linear regression? We do not want to just overcomplicate things.

First, let’s think about whether only the treatment is responsible for the change in our primary metric.

In reality, this may not be entirely accurate due to the presence of selection bias.

Selection bias in A/B testing is a type of error when there is a systematic difference between the groups being compared that is not due to random chance, for example:

We witness that old users get exposed to a new banner more often than new customers.

Random allocation that we use in AB tests helps us to mitigate it, but it’s hard to eliminate completely.

Let’s formulate how to estimate the true effect.

ATE: average treatment effect that we aim to estimate.

ATT: average treatment effect of those treated. We can also call it ACE: average causal effect. We actually can calculate it. It is the difference between the sample averages of the control and treatment groups.

SB: selection bias that we aim to minimize.

How can we minimize it?

Linear regression allows us to add covariates/confounding variables. Let’s try it out and add as one of confounding variable the average session length for users before the experiment.

And print the summary of the model:

Our R-squared has skyrocketed! Now, we explain 86% of the variance.

Our treatment effect now is 0.47.

Which one to choose?

So, we have two treatment effects: 0.47 and 0.56; which one is correct?

In this case, we know for sure the true effect because I have simulated data and the real uplift: 0.5

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm

np.random.seed(45)
n = 500

x = np.random.normal(loc = 10 ,scale = 3, size= 2 * n)
y = x + np.random.normal(loc = 2 , scale = 1 ,size = len(x))
# For 50% of users we simulate treatment effect
treat =  1 * (np.random.rand(2 * n) <= 0.5)

experiment = pd.DataFrame(x, columns=["covariate"])
experiment['metric'] = y 
experiment['treatment'] = treat
experiment['noise'] = np.random.normal(size = len(experiment))

# Add noise and uplift to 'metric' for rows where 'treat' is equal to 1
# The real uplift is 0.5
experiment['metric'] = experiment.apply(lambda row: row['metric'] + 0.5 * row['treatment'] + row['noise'] if row['treatment'] == 1 else row['metric'], axis=1)

That means 0.47 is better in terms of absolute difference and is closer to reflecting the actual uplift.

Conclusion

Using linear regression has the following advantages:

It provides a deeper comprehension of our data and how well the model aligns with the data.
By using covariates, we can mitigate selection bias, resulting in a more accurate estimation of the treatment effect.

Can we use linear regression for other tests, like the Welch t-test or the Chi-square test?