Monitoring is your guard tower, continuously watching over your application & infra. Without actively monitoring your full stack, it is difficult to make informed decisions around optimization and resource allocation. Want to roll out a new feature or update? Monitoring gives you the confidence to make changes without fearing unexpected hiccups and late-night zoom calls.

Metrics, the quantitative measurements of various aspects of your system's behavior, are the lifeblood of effective monitoring. They provide real-time insights into resource usage, error rates, response times, and much more. Without metrics, monitoring becomes akin to flying blind, leaving you unaware of potential issues until they escalate. In this post, lets delve into the fundamentals about metrics.

Metrics, Events, Logs, and Traces: The Foundation of Monitoring

The acronym M.E.L.T. is often used to describe the four essential data types in monitoring: metrics, events, logs and traces:

Together, these building blocks form the foundation of your monitoring strategy. Logs provide historical context, metrics offer quick insights, events catch your attention when something important happens, and traces provide a detailed path for in-depth investigation.

Metrics β€” Cornerstone for Monitoring

As I mentioned before, metrics consist of raw measurements reflecting resource usage and system behavior, which are systematically observed and collected across your infrastructure. These measurements may include low-level usage stats provided by the operating system, as well as higher-level data related to the specific functions, services or operations of a component, such as requests processed per second or pods running in a K8s cluster.

Some metrics are presented in relation to total capacity, while others are represented as a rate that indicates the "busyness" of a component.

A practical starting point for metrics typically involves leveraging the readily available data provided by your operating system, which reflects the utilization of essential physical resources. Information regarding disk space, CPU load, swap usage, and similar metrics are easily accessible, deliver immediate insights, and can be effortlessly transmitted to a monitoring system. These are often referred to as Infrastructure Metrics.

Many web servers, database servers, and other software also provide their own metrics, which can be passed forward as well.

Collecting and exposing metrics is called as instrumenting to your service. This usually including adding additional code to expose metrics in proper OTel/prometheus metric format for things that matter to you.

But Why Metrics?

Metrics are useful because they provide insight into the behavior and health of your systems, especially when analyzed in aggregate.

Metrics are hence at the heart of monitoring, and for good reason:

But, What if I have my logs?

Logs are easy to integrate into your application, and they give you the ability to represent any type of data in the form of strings. Metrics, on the other hand, are numerical representations of data. These are often used to count or measure a value and are aggregated over a period of time. Metrics give us insights into the historical and current state of a system. Since they are just numbers, we can also use them to perform statistical analysis and predictions about the system's future behavior.

Let’s take a deeper look at metrics

Hopefully by now you are intrigued enough to dive deeper. Metrics consist of usually these key building blocks:

The Metric Lifecycle

For understanding how metrics can be used in your server monitoring tool, we need to understand the different stages a metrics goes through.

The lifecycle of a metric in monitoring systems typically consists of several key stages:

  1. Instrumentation: Metrics are first introduced through instrumentation. This involves adding code or configurations to the components of your system, such as applications, services, or infrastructure, to collect and expose data. This can include defining what data should be captured and how it should be labeled.

  2. Collection: Once the metrics are instrumented, monitoring agents or systems periodically collect these metrics. Metrics are collected from the instrumented components and then prepared for storage and analysis. In the case of Prometheus, for example, metrics are collected using the Prometheus server's scraping process.

  3. Storage: Collected metrics are stored in a time-series database. This database organizes metrics based on their names, labels, and timestamps. Storing metrics over time allows for historical analysis, trend detection, and the ability to answer questions about past system behavior.

  4. Analysis and Visualization: Metrics data is analyzed and visualized using tools like Grafana, Kibana, or custom dashboards. These tools enable users to create charts, graphs, and dashboards that provide a real-time view of system performance. Analysis can also involve identifying anomalies or trends that may require attention.

  5. Alerting: Metrics data is used to set up alerting rules. These rules define conditions or thresholds that, when met, trigger alerts. For example, an alert can be configured to notify operators when CPU usage exceeds a certain limit or when error rates spike beyond an acceptable level.

  6. Querying: Users or automated systems query the stored metrics to retrieve specific information or aggregated data. This retrieval may be for diagnostic purposes, troubleshooting, or for generating reports.

  7. Archival: Over time, older metrics data may be archived to reduce storage costs, but still, be accessible for historical analysis. Archiving helps maintain long-term records while optimizing storage resources.

  8. Maintenance and Scaling: Monitoring systems need ongoing maintenance to ensure they continue to collect, store, and analyze metrics effectively. As the system scales or the monitored environment evolves, adjustments may be needed to accommodate additional metrics or resources.

  9. Deletion: Some metrics may become irrelevant over time or may no longer serve a purpose. Pruning or deleting old, unused metrics can help keep the monitoring system lean and efficient.

The lifecycle of a metric in a monitoring tool is a continuous process. It begins with the initial instrumentation and spans throughout the operation of the system, providing crucial insights into the health, performance, and behavior of the monitored components.

This data is essential for maintaining system reliability, identifying issues, and making informed decisions about system improvements and optimizations.


In the next post of this series, we will talk about getting started with Prometheus - the defacto open-source monitoring solution that is widely adopted by DevOps and SRE teams across the industry.