The Fire Behind the Green

Has it ever happened to you that even though your dashboard is green, you have incoming support tickets flowing in from a specific faction of your users? You spend hours digging up and find out that Android users using a particular app_version in India are getting affected. Well, you are not alone!

Often teams just monitor the top level metric and assume they have enough observability into the system until customers start getting affected while the overall dashboard remains flat. This often increases the MTTD (Mean Time To Detect), which in turn results in a higher MTTR (Mean Time To Resolve) costing businesses millions. This leads to reactive investigations, stressful debugging and war rooms which we all agree isn’t a pretty situation. This article is a guide on multidimensional Anomaly Detection and how to manage real-time production systems at scale. Multidimensional anomaly detection essentially means monitoring not just the overall metric but multiple dimensional slices of the metric as independent time series.

Terminology and Absolute Basics

Maintaining reliable production systems is crucial. This is turn requires observability on key system metrics. Automated monitoring using anomaly detection is a central piece that ensures that the system is being investigated in case of any outlier behavior. Since most metrics are emitted across time and the branch of analysis is commonly referred to as Time Series Anomaly Detection.

Time series metrics offer massive diversity in terms of metric behavior. For example, an ad revenue fluctuation might look very different from, say, an e-commerce metric tracking  shipment delays. Therefore, the definition of an anomaly or outlier can be significantly different across domains and often require specific context and domain knowledge to capture the right issues with high precision and high recall.

Multidimensional Anomaly Detection involves monitoring the same metric over different values of a dimension or combination of dimensions. For example, a metric like pageviews may have a dimension country which may have different values like US, India, Germany. The behavior of pageviews | country=US and pageviews | country=IN may be quite different.

Problem Space

Visibility on the top level metrics alone is not enough. In production systems, you can encounter issues in specific use cases that could be in specific factions of your user base. The literal opposite to that is monitoring metrics on every dimension slice. This is not the solution either since this is not practical for even small use cases. It is also perhaps not necessary in order to ensure reliability, especially to the users you care about. In this section, I’ll spend some time defining the problem space based on the underlying business.

Metric Dimensionality

Consider a metric nSucessResponses that captures the number of successful responses to API requests. In the fig below, the overall metric behaves very differently than its slices. This could be because

In this example, monitoring the slices are essential if your underlying slices have different behavior or simply are at different scale.

Combinatorial explosion of Dimensions

If you monitor every combination, the space explodes: |geo| × |device| × |version|. Even modest cardinalities of dimensions can compound quickly to hundreds of thousands of unique time series making it a scalability nightmare. For example: 20 geos × 5 device types × 10 app versions × 3 tiers amounts to 3,000 dimensions combinations.

Outlier Type

Metric Behavior heavily determines the type of detection required.

For example

Note that there is also the performance aspect of detection algorithms when it comes to scaling across time series. For example, scaling a threshold detector across multiple timeseries may be much easier than scaling a matrix profile detector given the history and computation involved.

Approach

The above problem can be broken down into key areas that can be tackled individually.

Let’s start with the data pipeline.

Observability Pipeline

Production metric systems typically use a producer-consumer mechanism or equivalent where applications can emit events to a topic/stream which is consumed downstream in a metrics database. An example architecture is shared below.

Disclosure: I work at Startree Inc on Apache Pinot and Startree ThirdEye, which is why I’m using it here as an example implementation. The concepts directly extend to other stacks too.

The above is an example of a metrics pipeline where the high level components are

Next, I want to talk about the anomaly detection piece since it is the primary consumer of the data for this use case.

Anomaly Detection

The detection itself is usually another data pipeline within the application. Using Startree Thirdeye as an example, the simplified architecture looks something like this.

There are a few things worth calling out in the above pipeline

  1. The Anomaly Detector is an interface that allows both internal and external detector implementations to be used easily used.
  2. Data Preprocessing Step ensures that the Anomaly Detector receives a clean and consistent time series to work with. This may involve transforming, cleaning or interpolating the data.
  3. The Anomaly Post processing step is useful to identify a continuing anomaly over a period of time, where it can be merged with a previous anomaly that was already reported. In other cases, it may also be ignored based on alert configuration.

The alert configuration is determined by the timeseries and the detector. Here is an example of a global alert on signup counts on a daily basis.

{
  "name": "DAU ETS Alert - Global",
  "description": "Alert if total daily active users deviate from an ETS forecast (all users, all regions, all devices)",
  "template": {
    "name": "startree-ets"
  },
  "templateProperties": {
    "dataSource": "pinot-prod",
    "dataset": "user_activity_events",

    "aggregationColumn": "user_id",

    /* in prod, prefer approx counts DISTINCTCOUNTULL/HLL over accurate DISTINCTCOUNT for perf reasons */
    "aggregationFunction": "DISTINCTCOUNTULL",

    "monitoringGranularity": "PT1D",

    /* ETS-specific tuning */
    "seasonalityPeriod": "P7D",   // weekly seasonality
    "lookback": "P30D",           // use last 30 days to train
    "sensitivity": "1"            // tighten/loosen as needed
  }
}

In this alert, we are using a trend seasonality based ETS alert to track the overall DAU.

Multidimensional Alerts

We can monitor slices of the global metric to better understand the performance of a faction of interest. Let’s say want to monitor the daily active users (DAU) growth over time in following factions

This is essentially a WHERE clause on the time series query assuming SQL.

Most modern anomaly detection platforms including Startree Thirdeye support multidimensional alerting. Here is a sample configuration for a multidimensional alert modeling the above scenario.

{
  "name": "DAU ETS Alert - Key Slices",
  "description": "Alert if daily active users deviate from an ETS forecast for key country/device slices",
  "template": {
    "name": "startree-ets-dx"
  },
  "templateProperties": {
    "dataSource": "pinot-prod",
    "dataset": "user_activity_events",

    "aggregationColumn": "user_id",
    "aggregationFunction": "DISTINCTCOUNTHLL",

    "monitoringGranularity": "PT1D",

    /* ETS-specific tuning */
    "seasonalityPeriod": "P7D",
    "lookback": "P30D",
    "sensitivity": "1",

    /* Dimension exploration */
    "queryFilters": "${queryFilters}",
    "enumerationItems": [
      {
        "name": "US-iOS",
        "params": {
          "queryFilters": " AND country='US' AND device='iOS'"
        }
      },
      {
        "name": "IN-Android",
        "params": {
          "queryFilters": " AND country='IN' AND device='Android'"
        }
      },
      {
        "name": "DE-AllDevices",
        "params": {
          "queryFilters": " AND country='DE'"
        }
      }
    ]
  }
}

The underlying execution engine follows a Directed Acyclic Graph (DAG) architecture. The core pipeline still involves the simple alert pipeline with the DataFetcher → Anomaly Detector while being wrapped in a fork join driven loop.

Managing Multidimensional Alerts in Prod

Multidimensional alerting can quickly scale up and become challenging to operate. Here are some common issues and how to work around them.

Wrap up

We all love to show green dashboards to our executives. However, the main objective must always be the quality of the actual product/service that we ship. Observability and especially, granular observability is key to make sure that we put out the candles before they cause widespread fires.

In this article, we walked through:

If there’s one takeaway, it’s this:

Don’t wait for your users to tell you that your system is broken. Be proactive and granular with monitoring to make sure you have eyes and ears at the right places.

From here, there are two natural next steps:

  1. Go deeper on performance. How do you keep per-slice queries fast when you have high cardinality and tight SLAs? That’s where time bucketing, pre-aggregations, and indexing strategies in your OLAP store come in.
  2. Get smarter about which slices to watch. Instead of hand-picking specific cohorts, you can use objective functions and dimension trees to prioritize slices under compute and alert budgets.

Please thumbs up/let me know if above topics would be helpful and I can cover them in subsequent follow ups.

Disclosure: I work at Startree Inc on Apache Pinot and Startree ThirdEye.