sia.hackernoon.com

Monitoring is your guard tower, continuously watching over your application & infra. Without actively monitoring your full stack, it is difficult to make informed decisions around optimization and resource allocation. Want to roll out a new feature or update? Monitoring gives you the confidence to make changes without fearing unexpected hiccups and late-night zoom calls.

Metrics, the quantitative measurements of various aspects of your system's behavior, are the lifeblood of effective monitoring. They provide real-time insights into resource usage, error rates, response times, and much more. Without metrics, monitoring becomes akin to flying blind, leaving you unaware of potential issues until they escalate. In this post, lets delve into the fundamentals about metrics.

Metrics, Events, Logs, and Traces: The Foundation of Monitoring

The acronym M.E.L.T. is often used to describe the four essential data types in monitoring: metrics, events, logs and traces:

Metrics: Metrics are quantifiable measurements that represent specific aspects of your system's performance and behavior. Metrics are the numbers that give you a quick overview of how things are running. For example, CPU utilization, memory usage, and request response times are common metrics.
Events: Events are like a spotlight on critical incidents or changes in your system. They often serve as notifications of significant occurrences. Think of them as the flashing lights that alert you to important events.
Logs: Logs are detailed records of events, actions, and messages generated by your applications and systems. They provide a historical account of what happened. Think of logs as the diary of your digital world.
Traces: Traces help you follow the journey of a request or transaction as it moves through different components of your system. Traces provide a detailed, end-to-end view of a process, helping you identify bottlenecks or performance issues.

Together, these building blocks form the foundation of your monitoring strategy. Logs provide historical context, metrics offer quick insights, events catch your attention when something important happens, and traces provide a detailed path for in-depth investigation.

Metrics — Cornerstone for Monitoring

As I mentioned before, metrics consist of raw measurements reflecting resource usage and system behavior, which are systematically observed and collected across your infrastructure. These measurements may include low-level usage stats provided by the operating system, as well as higher-level data related to the specific functions, services or operations of a component, such as requests processed per second or pods running in a K8s cluster.

Some metrics are presented in relation to total capacity, while others are represented as a rate that indicates the "busyness" of a component.

A practical starting point for metrics typically involves leveraging the readily available data provided by your operating system, which reflects the utilization of essential physical resources. Information regarding disk space, CPU load, swap usage, and similar metrics are easily accessible, deliver immediate insights, and can be effortlessly transmitted to a monitoring system. These are often referred to as Infrastructure Metrics.

Many web servers, database servers, and other software also provide their own metrics, which can be passed forward as well.

Collecting and exposing metrics is called as instrumenting to your service. This usually including adding additional code to expose metrics in proper OTel/prometheus metric format for things that matter to you.

But Why Metrics?

Metrics are useful because they provide insight into the behavior and health of your systems, especially when analyzed in aggregate.

Metrics are hence at the heart of monitoring, and for good reason:

Quantifiable Insights: Metrics are numeric measurements that provide quantifiable insights into the behavior of your systems. They can represent various aspects of your infrastructure, from CPU usage to error rates.
Historical Context: Metrics provide historical context. By analyzing trends over time, you can identify patterns and make predictions, helping you plan for future needs.
Efficient Alerting: Metrics are ideal for alerting. You can set up alerts based on specific metric thresholds. For instance, you can be alerted if your server's CPU usage exceeds a certain percentage.
Visualization: Metrics can be visualized effectively using tools like Grafana, making it easy to understand complex data and trends at a glance.

But, What if I have my logs?

Logs are easy to integrate into your application, and they give you the ability to represent any type of data in the form of strings. Metrics, on the other hand, are numerical representations of data. These are often used to count or measure a value and are aggregated over a period of time. Metrics give us insights into the historical and current state of a system. Since they are just numbers, we can also use them to perform statistical analysis and predictions about the system's future behavior.

Let’s take a deeper look at metrics

Hopefully by now you are intrigued enough to dive deeper. Metrics consist of usually these key building blocks:

Metric Name: Describes what the metric measures (e.g., http_requests_total).
Value: The actual measurement value (e.g., HTTP Code 500).
Labels: Key-value pairs that add context to metrics (e.g., "method"="GET", "status_code"="500").
Timestamp: Indicates when the measurement was taken.

The Metric Lifecycle

For understanding how metrics can be used in your server monitoring tool, we need to understand the different stages a metrics goes through.

The lifecycle of a metric in monitoring systems typically consists of several key stages:

Instrumentation: Metrics are first introduced through instrumentation. This involves adding code or configurations to the components of your system, such as applications, services, or infrastructure, to collect and expose data. This can include defining what data should be captured and how it should be labeled.
Collection: Once the metrics are instrumented, monitoring agents or systems periodically collect these metrics. Metrics are collected from the instrumented components and then prepared for storage and analysis. In the case of Prometheus, for example, metrics are collected using the Prometheus server's scraping process.
Storage: Collected metrics are stored in a time-series database. This database organizes metrics based on their names, labels, and timestamps. Storing metrics over time allows for historical analysis, trend detection, and the ability to answer questions about past system behavior.
Analysis and Visualization: Metrics data is analyzed and visualized using tools like Grafana, Kibana, or custom dashboards. These tools enable users to create charts, graphs, and dashboards that provide a real-time view of system performance. Analysis can also involve identifying anomalies or trends that may require attention.
Alerting: Metrics data is used to set up alerting rules. These rules define conditions or thresholds that, when met, trigger alerts. For example, an alert can be configured to notify operators when CPU usage exceeds a certain limit or when error rates spike beyond an acceptable level.
Querying: Users or automated systems query the stored metrics to retrieve specific information or aggregated data. This retrieval may be for diagnostic purposes, troubleshooting, or for generating reports.
Archival: Over time, older metrics data may be archived to reduce storage costs, but still, be accessible for historical analysis. Archiving helps maintain long-term records while optimizing storage resources.
Maintenance and Scaling: Monitoring systems need ongoing maintenance to ensure they continue to collect, store, and analyze metrics effectively. As the system scales or the monitored environment evolves, adjustments may be needed to accommodate additional metrics or resources.
Deletion: Some metrics may become irrelevant over time or may no longer serve a purpose. Pruning or deleting old, unused metrics can help keep the monitoring system lean and efficient.

The lifecycle of a metric in a monitoring tool is a continuous process. It begins with the initial instrumentation and spans throughout the operation of the system, providing crucial insights into the health, performance, and behavior of the monitored components.

This data is essential for maintaining system reliability, identifying issues, and making informed decisions about system improvements and optimizations.