Setting the Stage: A New Feature, a Big Question

Let me narrate a data scientist approach as if its a real life event. Suppose, for example, that we launched our new recommendation engine on e-commerce. Now customers will be able to view individual product picks on our own front-end for the first time. Exciting, yes, but also frightening. Being the data scientist on the team, we would have to focus on one burning question: Just how much extra revenue did this shiny new feature actually generate? If I was doing it. What would I do? Let me explain like a story.

In a perfect world we’d run an A/B test and show the feature to half the users and compare it with a control. But reality had other plans. The product team couldn’t restrict the feature to a control group and the leadership wanted the answers now, not 3 months from now. So I was right there, watching metrics of revenue rising after launch but the question running in my mind is how much of that growth had to do with our recommendation engine, or some other seasonal and market reason. Traditional time-series forecasting wasn’t adequate in itself; I wanted a means of peeling back the curtain of the other world, the world we never built into the feature, to judge its impact. This is when the counterfactual forecasting came in to play.

Why not simply Traditional Time Series?

We did not want to use simply our time series toolkit. I should have just taken historical revenue data and projected it forward, with ARIMA or some other form. But there is a catch: Traditional forecasts simply predict the future based on historic patterns but they do not tell you what changed because of an intervention. In our case, the intervention was recommendation feature launch. A pure model would bake the post-launch bump into its predictions instead of just siloing it out. The end result? You'd get a forecast of rising sales revenue, but no hint at how much of it came from the feature. To give just one example, our leadership first examined growth per year. Revenues for the month immediately post-launch increased by about 15 percent last month compared to the same month just last year. Great! But hang on - e-commerce sales typically increase year-over-year on the back of trend, marketing and so forth, and e-commerce is on the rise. That 15% is all the things except the kitchen sink. What we wanted was the percentage of growth attributable only to the new recommender. That is not directly answerable with traditional time series methods or naive YoY comparisons. They conflate the effect of the feature with organic growth and seasonal effects. This is the difference, conceptually, between a causal approach and a non-causal approach. In a textbook context, we’d apply a methodology such as difference-in-differences if we had a control group. One would then calculate things like in a diff-in-diff setup:

Impact = (Y_post^T - Y_pre^T) – (Y_post^C - Y_pre^C).

in other words, how much the treated group’s metric changed compared to how much the control group’s change. If we had a small group of users that were never exposed to the new recommendations, we have “Control” (C) and “Exposed” (Treated). The difference-in-differences formula in words is:

Feature’s effect = (uplift in treated group) – (uplift in control group)

But we also have to get creative without an explicit control group for this kind of thing. That’s where counterfactual forecasting comes in - we effectively have made our control with a forecasting model. Instead of physical control users, it will be the synthetic control that generates this counterfactual (what would have happened without the feature) for us. It closes the gap between pure prediction and causal inference by estimating impact retrospectively.

Counterfactual Forecasting: Predicting the Unseen.

Therefore, what in itself is counterfactual forecasting? Simply, it’s forecasting what X would be like if Y hadn’t happened. Y and X are the two outputs, which here is Y's the introduction of the recommendation engine, and X is our revenue. We want to understand; should the recommender have not been launched, how would our revenue have gone? A counterfactual is anything else it might have experienced, which there never was. This concept is based on causal inference, we’re doing an “alternate present” that has not truly occurred. In practice, counterfactual forecasting simply means that it uses only pre-intervention data to make prediction on what might happen after the intervention. By doing this, the model is oblivious to the feature launch or if the new feature is not launched, but projects a forecast based on history to get revenue like everything would have been normal. So the forecast for the post-launch period is then our baseline, revenue we would expect without the new feature (the counterfactual situation). Such deviations from this forecast of actual revenue are due to the feature (if our model described all other patterns correctly).

In other words, we’re predicting the present under an alternate setting. This approach has a number of key implications: It isolates a feature’s impact: We compare the actual outcomes we see to the counterfactual, removing generalized growth or seasonal uplifts that an overly naive comparison would mistake as part of the impact of the featured data set. No control group needed: Our “control” is created by a model, unlike a standard A/B test or diff-in-diff. This is particularly useful when controlled experiments are impossible in real-world settings. Model assumptions are important: The accuracy of our counterfactual is only as good as its time-series model at measuring the true baseline. Our impact estimate will be wrong if we have a badly specified model. Having the idea of my approach in place, I set out to construct the counterfactual. Before launch, I had an entire year of weekly revenue data and then a few weeks of data after launch.

The Plan: build time series model on the pre-launch data, predict the post-launch period and compare it with the actuals.

Building the Counterfactual Scenario with Data

Creating a Counterfactual Situation with Data For a demonstration of this approach, let me walk you through a simplified version of the analysis using artificial data. Think about, say, two years of weekly revenue. The recommendation feature was launched at the start of Q3 of the second year. We'll see revenue pick-up we think with a little bit of lift in revenue after launch. Now let's build out the toy dataset to recreate this setup and then conduct a counterfactual analysis step-by-step.

Step 1 - Generate synthetic data.

We’ll create a simulation of revenue trends that increase with seasonality. - We’ll give it an additional boost after the 80th week (We’ll consider this the week for feature kick off).

# Simulate 104 weeks of revenue data (2 years)
weeks = 104
time = np.arange(weeks)
baseline = 1000 + 5*time  # linear growth starting from 1000
seasonality = 100 * np.sin(2 * np.pi * time / 52)  # yearly seasonality
noise = np.random.normal(scale=50, size=weeks)
baseline = baseline + seasonality + noise
 
# Feature launch at week 80: apply 10% uplift thereafter
feature_start = 80
actual = baseline.copy()
actual[feature_start:] *= 1.10
 
# Create DataFrame
dates = pd.date_range(start="2022-01-01", periods=weeks, freq="W")
df = pd.DataFrame({"Revenue": actual}, index=dates)

For this simulated data, we use baseline as revenue without feature. actual is the revenue at week 80 after introducing the feature uplift. In reality, we don’t get to see the baseline after week 80 but here we have it behind the scenes for verification.

Step 2: Train a Time Series Model (Pre-Feature Data Only)

Now, we split the data into “train” (pre-launch) and “test” (post-launch) periods. We train an ARIMA model on the pre-launch period only. The intuition is that the model will learn the regular patterns of revenue without any feature impact. We chose a simple ARIMA configuration for this example; in practice one might use auto.ARIMA or Prophet to capture trends and seasonality more robustly.

from statsmodels.tsa.arima.model import ARIMA
 
# Split into train and test sets with launch dates befoe and after
launch_date = df.index[feature_start]  
train = df[:launch_date - pd.Timedelta(days=1)]  
test = df[launch_date:]
 
# Fit ARIMA on pre-launch data
model = ARIMA(train["Revenue"], order=(2,1,0))
model_fit = model.fit()
 
# Forecast the post-launch period
forecast_steps = len(test)
forecast = model_fit.forecast(steps=forecast_steps)
forecast.index = test.index

By training on weeks 0–79, the ARIMA model has no knowledge of the week 80 jump. It will extrapolate the trend and seasonality it learned. The forecast we obtain for weeks 80–103 is our model’s expectation of revenue if nothing had changed – this is the counterfactual series.

Step 3: Compare Actual vs. Counterfactual

Finally, we compare the test actuals to the forecast. The difference between the two is the estimated effect of the feature. We can present this in a table and a chart. For brevity, here’s a summary for a few points in time: launch week, a mid-point, and the end of the year.

Date (Week)

Actual Revenue

Counterfactual Revenue (No Feature)

Difference (Lift)

2023-07-02(Launch)

$1,449

$1,378

$71

2023-10-15

$1,508

$1,380

$128

2023-12-31

$1,707

$1,380

$327

In the launch week, actual revenue was $1,449 (in thousands, for example) versus a predicted $1,378 if we hadn’t launched the feature, so about $71 extra was earned due to the feature that week. By October 15, the weekly uplift is around $128, and by year’s end about $327 per week. These values suggest the feature’s impact not only persisted but grew over time (likely as more users engaged with recommendations or due to compounding with seasonal demand).

With this feature, we have the actual weekly revenue in a line in blue and what our model predicted would have happened without it: in this chart above. I’ve also filled the feature launch point out with a vertical dotted line. (And where the actual lines do overlap and counterfactual lines do)

After launch, there are two remarkable things:

1. The actual revenue (blue) rises above the counterfactual (red) – this indicates that the feature had positive effects. The distinction is the causal effect we’re targeting. In our simulation the lines diverge by around 10% of revenue that was the known uplift we included. In a real world, we see our measured uplift or incremental revenue due to this feature being the size of this gap.

2. The gap endures or even expands slightly over time which means the feature didn’t deliver a one-week bump only and it left its mark on customer purchasing behavior. (Business exegesis: users continue to engage with the recommendations, adding more products to their cart, increasing the frequency of their order).

That’s a lot of money. This counterfactual perspective allowed us to come across as convincing to leadership: “Our recommender system yielded about X dollars in incremental revenue over the last quarter, or a Y percent increase over the baseline scenario.” That is a far more tangible statement than saying “revenue was up 20 percent.

From Analysis to Impact: Bringing the Numbers to Life

The last step of this process was translating these insights into a narrative for stakeholders. Here’s how we framed the results and what they meant:

Baseline vs. Actual:

A simple chart of actual vs. counterfactual revenue (i.e., one of the charts above) showed what would have happened without the feature, which the team calculated to help the customer track the impact. This visual presented the actual line and counterfactual line immediately with the actual line being higher, and that amount dramatically. The shaded area between the lines (extra dollars earned) is what really drove home that.

Cumulative Impact:

We found the cumulative extra revenue for the whole post launch period. In our case it was a few million dollars, something that got the room rapping. In summing the weekly differences, we addressed the question “How much total revenue did the feature add?”. Percentage Lift: We also reported the impact as proportionate to baseline revenue. Saying “12% lift” set expectations that could be met realistically on similar initiatives (and established an anchor to which the recommendation algorithm could return future).

Business Context:

We brought the analysis back into business decision making. Awareness of the feature’s impact (as noted by the original team) helped justify the resources spent on it as well as inform how aggressively similar features should be rolled out. And it played a role in our financial projections, now that we had an estimate of the lift, the finance team could have modified forecasts for next year.

Finally we reviewed assumptions and next steps. We concluded that a counterfactual model is only as solid as the data and assumptions behind it. For example, if any other major change (marketing campaigns, changing prices, etc.) coincided with the feature launch, those changes would need to be accounted for (e.g. introduce control variables, or we will rely on more powerful causal models like Bayesian structural time series). In our situation, the feature launch was the only big event in that period, and we were really lucky. Even so, we saw the results as an estimate, not a set of absolute truths that were carved in stone. Relating to business decisions: Last, we talked about what to do next. Learning about the feature’s impact helped in some ways:

Prioritization:

We justified additional investments in the recommendation engine (e.g., more R&D for improvements, scaling up infrastructure, etc.) because we had established value.

Forecasting:

Our finance team now would be able to add the uplift into future projections. We basically adjusted our business forecast for the next year, increasing the baseline to accommodate the “new normal” with the feature.

Experimentation:

This opened other conversations around how we monitor future features. The leadership appreciated the rigor of the analysis and emphasized that whenever A/B tests can’t be done, we should follow similar counterfactual methods.

Naturally, however, we also recognized the limitations. This was an observational approach, not one of randomization. We had to guess that no major external events took place during the feature launch. When it comes to our cases, we double-checked that no more major campaigns or price changes happened all at once. If they did, the analysis would have to be refined (perhaps adding those as controls in a multivariate time series model, or using methods such as synthetic controls or causal impact modeling). We also considered the model’s prediction the best estimate we could get of the counterfactual, but it’s not a perfect oracle. There’s a margin of error. We did so with these caveats to allow for transparency.

Conclusion

Ultimately, and to the best of our understanding, our mindset went from pure forecasting to counterfactual forecasting, and this enabled us to formulate a much clearer answer to the original question. Instead of simply saying “revenue went up after launch,” we quantified how much of that uptick was attributed to our effort. It is so powerful for business. It makes analytics more than passive reporting; it’s actionable decision support.

Because many times, the takeaway for data scientists is that measuring causal impact requires a perspective that goes beyond time series approaches. Using data you’ve got (even without a control group) and using tools like time series models, you can create compelling counterfactuals as well as respond to the types of “what-if” questions that business people value. In our work not only did this show the value of a new feature that had just been implemented in our product, but also demonstrated the crucial part of use of data science in business thinking. And personally, it was a genuine delight in working hard because then that result translated to the company: real dollars and direction.