sia.hackernoon.com

Logging is arguably the most important element of your observability solution. Logs provide foundational and rich information about system behavior. In an ideal world, you would make all the decisions about logging and implement a consistent approach across your entire system.

However, in the real world, you might work with legacy software or deal with different programming languages, frameworks, and open-source packages, each with its own format and structure for logging.

With such a diversity in log formats across your system, what steps can you take to extract the most value from all your logs? That’s what we’ll cover in this post.

We’ll look at how logs can be designed, the challenges and solutions to logging in large systems, and how to think about log-based metrics and long-term retention.

Let’s dive in with a look at log levels and formats.

Logging Design

Many considerations go into log design, but the two most important aspects are the use of log levels and whether to use structured or unstructured log formats.

Log Levels

Log levels are used to categorize log messages based on their severity. Specific log levels used may vary depending on the logging framework or system. However, commonly used log levels include (in order of verbosity, from highest to lowest):

TRACE: Captures every action the system takes, for reconstructing a comprehensive record and accounting for any state change.

DEBUG: Captures detailed information for debugging purposes. These messages are typically only relevant during development and should not be enabled in production environments.

INFO: Provides general information about the system's operation to convey important events or milestones in the system's execution.

WARNING: Indicates potential issues or situations that might require attention. These messages are not critical but should be noted and investigated if necessary.

ERROR: Indicates errors that occurred during the execution of the system. These messages typically highlight issues that need to be addressed and might impact the system's functionality.

Logging at the appropriate level helps with understanding the system's behavior, identifying issues, and troubleshooting problems effectively.

When it comes to system components that you build, we recommend that you devote some time to defining the set of log levels that are useful. Understand what kinds of information should be included in messages at each log level, and use the log levels consistently.

Later, we’ll discuss how to deal with third-party applications, where you have no control over the log levels. We’ll also look at legacy applications that you control but are too expansive to migrate to the standard log levels.

Structured Versus Unstructured Logs

Entries in structured logs have a well-defined format, usually as key-value pairs or JSON objects. This allows for consistent and machine-readable log entries, making it easier to parse and analyze log data programmatically.

Structured logging enables advanced log querying and analysis, making it particularly useful in large-scale systems.

On the other hand, unstructured (free-form) logging captures messages in a more human-readable format, without a predefined structure. This approach allows developers to log messages more naturally and flexibly.

However, programmatically extracting specific information from the resulting logs can be very challenging.

Choosing between structured and unstructured logs depends on your specific needs and the requirements and constraints of your system. If you anticipate the need for advanced log analysis or integration with log analysis tools, structured logs can provide significant benefits.

However, if all you need is simplicity and readability, then unstructured logs may be sufficient.

In some cases, a hybrid approach can also be used, where you use structured logs for important events and unstructured logs for more general messages.

For large-scale systems, you should lean towards structured logging when possible but note that this adds another dimension to your planning. The expectation for structured log messages is that the same set of fields will be used consistently across system components. This will require strategic planning.

Logging Challenges

With systems comprising multiple components, each component will most likely have its own model to manage its logs. Let’s review the challenges this brings.

Disparate Destinations

Components will log to different destinations—files, system logs, stdout, or stderr. In distributed systems, collecting these scattered logs for effective use is cumbersome.

For this, you’ll need a diversified approach to log collection, such as using installed collectors and hosted collectors from Sumo Logic.

Varying Formats

Some components will use unstructured, free-form logging, not following any format in particular. Meanwhile, structured logs may be more organized, but components with structured logs might employ completely different sets of fields.

Unifying the information you get from a diversity of logs and formats requires the right tools.

Inconsistent Log Levels

Components in your system might use different ranges of log levels. Even if you consolidate all log messages into a centralized logging system (as you should), you will need to deal with the union of all log levels.

One challenge that arises is when different log levels ought to be treated the same. For example, ERROR in one component might be the same as CRITICAL in another component, requiring immediate escalation.

You face the opposite challenge when the same log level in different components means different things. For example, INFO messages in one component may be essential for understanding the system state, while in another component they might be too verbose.

Log Storage Cost

Large distributed systems accumulate a lot of logs. Collecting and storing these logs isn’t cheap. Log-related costs in the cloud can make up a significant portion of the total cost of the system.

Dealing With These Challenges

While the challenges of logging in large, distributed systems are significant, solutions can be found through some of the following practices.

Aggregate Your Logs

When you run a distributed system, you should use a centralized logging solution. As you run log collection agents on each machine in your system, these collectors will send all the logs to your central observability platform.

Sumo Logic, which has always focused on log management and analytics, is best in class when it comes to log aggregation.

Move Toward a Unified Format

Dealing with logs in different formats is a big problem if you want to correlate log data for analytics and troubleshooting across applications and components. One solution is to transform different logs into a unified format.

The level of effort for this task can be high, so consider doing this in phases, starting with your most essential components and working your way down.

Establish a Logging Standard Across Your Applications

For your own applications, work to establish a standard logging approach that adopts a uniform set of log levels, a single structured log format, and consistent semantics.

If you also have legacy applications, evaluate the level of risk and cost associated with migrating them to adhere to your standard.

If a migration is not feasible, treat your legacy applications like you would third-party applications.

Enrich Logs From Third-Party Sources

Enriching logs from third-party sources involves enhancing log data with contextual information from external systems or services. This brings a better understanding of log events, aiding in troubleshooting, analysis, and monitoring activities.

To enrich your logs, you can integrate external systems (such as APIs or message queues) to fetch supplementary data related to log events (such as user information, customer details, or system metrics).

Manage Log Volume, Frequency, and Retention

Carefully managing log volume, frequency, and retention is crucial for efficient log management and storage.

Volume: Monitoring generated log volume helps you control resource consumption and performance impacts.

Frequency: Determine how often to log, based on the criticality of events and desired level of monitoring.

Retention: Define a log retention policy appropriate for compliance requirements, operational needs, and available storage.

Rotation: Periodically archive or purge older log files to manage log file sizes effectively.

Compression: Compress log files to reduce storage requirements.

Log-Based Metrics

Metrics that are derived from analyzing log data can provide insights into system behavior and performance. Working log-based metrics has its benefits and challenges.

Benefits

Granular insights: Log-based metrics provide detailed and granular insights into system events, allowing you to identify patterns, anomalies, and potential issues.

Comprehensive monitoring: By leveraging log-based metrics, you can monitor your system comprehensively, gaining visibility into critical metrics related to availability, performance, and user experience.

Historical analysis: Log-based metrics provide historical data that can be used for trend analysis, capacity planning, and performance optimization. By examining log trends over time, you can make data-driven decisions to improve efficiency and scalability.

Flexibility and customization: You can tailor your extraction of log-based metrics to suit your application or system, focusing on the events and data points that are most meaningful for your needs.

Challenges

Defining meaningful metrics: Because the set of metrics available to you across all your components is incredibly vast—and it wouldn’t make sense to capture them all—identifying which metrics to capture and extract from logs can be a complex task.

This identification requires a deep understanding of system behavior and close alignment with your business objectives.

Data extraction and parsing: Parsing logs to extract useful metrics may require specialized tools or custom parsers. This is especially true if logs are unstructured or formatted inconsistently from one component to the next.

Setting this up can be time-consuming and may require maintenance as log formats change or new log sources emerge.

Need for real-time analysis: Delays in processing log-based metrics can lead to outdated or irrelevant metrics. For most situations, you will need a platform that can perform fast, real-time processing of incoming data in order to leverage log-based metrics effectively.

Performance impact: Continuously capturing component profiling metrics places additional strain on system resources. You will need to find a good balance between capturing sufficient log-based metrics and maintaining adequate system performance.

Data noise and irrelevance: Log data often includes a lot of noise and irrelevant information, not contributing toward meaningful metrics. Careful log filtering and normalization are necessary to focus data gathering on relevant events.

Long-Term Log Retention

After you’ve made the move toward log aggregation in a centralized system, you will still need to consider long-term log retention policies. Let’s cover the critical questions for this area.

How Long Should You Keep Logs Around?

How long you should keep a log around depends on several factors, including:

Log type: Some logs (such as access logs) can be deleted after a short time. Other logs (such as error logs) may need to be kept for a longer time in case they are needed for troubleshooting.

Regulatory requirements: Industries like healthcare and finance have regulations that require organizations to keep logs for a certain time, sometimes even a few years.

Company policy: Your company may have policies that dictate how long logs should be kept.

Log size: If your logs are large, you may need to rotate them or delete them more frequently.

Storage cost: Regardless of where you store your logs—on-premise or in the cloud—you will need to factor in the cost of storage.

How Do You Reduce the Level of Detail and Cost of Older Logs?

Deleting old logs is, of course, the simplest way to reduce your storage costs. However, it may be a bit heavy-handed, and you sometimes may want to keep information from old logs around.

When you want to keep information from old logs, but also want to be cost-efficient, consider taking some of these measures:

Downsampling logs: In the case of components that generate many repetitive log statements, you might ingest only a subset of the statements (for example, 1 out of every 10).

Trimming logs: For logs with large messages, you might discard some fields. For example, if an error log has an error code and an error description, you might have all the information you need by keeping only the error code.

Compression and archiving: You can compress old logs and move them to cheaper and less accessible storage (especially in the cloud). This is a great solution for logs that you need to store for years to meet regulatory compliance requirements.

Conclusion

In this article, we’ve looked at how to get the most out of logging in large-scale systems.

Although logging in these systems presents a unique set of challenges, we’ve looked at potential solutions to these challenges, such as log aggregation, transforming logs to a unified format, and enriching logs with data from third-party sources.

Logging is a critical part of observability. By following the practices outlined in this article, you can ensure that your logs are managed effectively, enabling you to troubleshoot problems, identify issues, and gain insights into the behavior of your system.

And you can do this while keeping your logging costs at bay.

Also published here

How to Extract the Maximum Value From Logs