Logging is arguably the most important element of your observability solution. Logs provide foundational and rich information about system behavior. In an ideal world, you would make all the decisions about logging and implement a consistent approach across your entire system.

However, in the real world, you might work with legacy software or deal with different programming languages, frameworks, and open-source packages, each with its own format and structure for logging.

With such a diversity in log formats across your system, what steps can you take to extract the most value from all your logs? That’s what we’ll cover in this post.

We’ll look at how logs can be designed, the challenges and solutions to logging in large systems, and how to think about log-based metrics and long-term retention.

Let’s dive in with a look at log levels and formats.

Logging Design

Many considerations go into log design, but the two most important aspects are the use of log levels and whether to use structured or unstructured log formats.

Log Levels

Log levels are used to categorize log messages based on their severity. Specific log levels used may vary depending on the logging framework or system. However, commonly used log levels include (in order of verbosity, from highest to lowest):

Logging at the appropriate level helps with understanding the system's behavior, identifying issues, and troubleshooting problems effectively.

When it comes to system components that you build, we recommend that you devote some time to defining the set of log levels that are useful. Understand what kinds of information should be included in messages at each log level, and use the log levels consistently.

Later, we’ll discuss how to deal with third-party applications, where you have no control over the log levels. We’ll also look at legacy applications that you control but are too expansive to migrate to the standard log levels.

Structured Versus Unstructured Logs

Entries in structured logs have a well-defined format, usually as key-value pairs or JSON objects. This allows for consistent and machine-readable log entries, making it easier to parse and analyze log data programmatically.

Structured logging enables advanced log querying and analysis, making it particularly useful in large-scale systems.

On the other hand, unstructured (free-form) logging captures messages in a more human-readable format, without a predefined structure. This approach allows developers to log messages more naturally and flexibly.

However, programmatically extracting specific information from the resulting logs can be very challenging.

Choosing between structured and unstructured logs depends on your specific needs and the requirements and constraints of your system. If you anticipate the need for advanced log analysis or integration with log analysis tools, structured logs can provide significant benefits.

However, if all you need is simplicity and readability, then unstructured logs may be sufficient.

In some cases, a hybrid approach can also be used, where you use structured logs for important events and unstructured logs for more general messages.

For large-scale systems, you should lean towards structured logging when possible but note that this adds another dimension to your planning. The expectation for structured log messages is that the same set of fields will be used consistently across system components. This will require strategic planning.

Logging Challenges

With systems comprising multiple components, each component will most likely have its own model to manage its logs. Let’s review the challenges this brings.

Disparate Destinations

Components will log to different destinations—files, system logs, stdout, or stderr. In distributed systems, collecting these scattered logs for effective use is cumbersome.

For this, you’ll need a diversified approach to log collection, such as using installed collectors and hosted collectors from Sumo Logic.

Varying Formats

Some components will use unstructured, free-form logging, not following any format in particular. Meanwhile, structured logs may be more organized, but components with structured logs might employ completely different sets of fields.

Unifying the information you get from a diversity of logs and formats requires the right tools.

Inconsistent Log Levels

Components in your system might use different ranges of log levels. Even if you consolidate all log messages into a centralized logging system (as you should), you will need to deal with the union of all log levels.

One challenge that arises is when different log levels ought to be treated the same. For example, ERROR in one component might be the same as CRITICAL in another component, requiring immediate escalation.

You face the opposite challenge when the same log level in different components means different things. For example, INFO messages in one component may be essential for understanding the system state, while in another component they might be too verbose.

Log Storage Cost

Large distributed systems accumulate a lot of logs. Collecting and storing these logs isn’t cheap. Log-related costs in the cloud can make up a significant portion of the total cost of the system.

Dealing With These Challenges

While the challenges of logging in large, distributed systems are significant, solutions can be found through some of the following practices.

Aggregate Your Logs

When you run a distributed system, you should use a centralized logging solution. As you run log collection agents on each machine in your system, these collectors will send all the logs to your central observability platform.

Sumo Logic, which has always focused on log management and analytics, is best in class when it comes to log aggregation.

Move Toward a Unified Format

Dealing with logs in different formats is a big problem if you want to correlate log data for analytics and troubleshooting across applications and components. One solution is to transform different logs into a unified format.

The level of effort for this task can be high, so consider doing this in phases, starting with your most essential components and working your way down.

Establish a Logging Standard Across Your Applications

For your own applications, work to establish a standard logging approach that adopts a uniform set of log levels, a single structured log format, and consistent semantics.

If you also have legacy applications, evaluate the level of risk and cost associated with migrating them to adhere to your standard.

If a migration is not feasible, treat your legacy applications like you would third-party applications.

Enrich Logs From Third-Party Sources

Enriching logs from third-party sources involves enhancing log data with contextual information from external systems or services. This brings a better understanding of log events, aiding in troubleshooting, analysis, and monitoring activities.

To enrich your logs, you can integrate external systems (such as APIs or message queues) to fetch supplementary data related to log events (such as user information, customer details, or system metrics).

Manage Log Volume, Frequency, and Retention

Carefully managing log volume, frequency, and retention is crucial for efficient log management and storage.

Log-Based Metrics

Metrics that are derived from analyzing log data can provide insights into system behavior and performance. Working log-based metrics has its benefits and challenges.

Benefits

Challenges

Long-Term Log Retention

After you’ve made the move toward log aggregation in a centralized system, you will still need to consider long-term log retention policies. Let’s cover the critical questions for this area.

How Long Should You Keep Logs Around?

How long you should keep a log around depends on several factors, including:

How Do You Reduce the Level of Detail and Cost of Older Logs?

Deleting old logs is, of course, the simplest way to reduce your storage costs. However, it may be a bit heavy-handed, and you sometimes may want to keep information from old logs around.

When you want to keep information from old logs, but also want to be cost-efficient, consider taking some of these measures:

Conclusion

In this article, we’ve looked at how to get the most out of logging in large-scale systems.

Although logging in these systems presents a unique set of challenges, we’ve looked at potential solutions to these challenges, such as log aggregation, transforming logs to a unified format, and enriching logs with data from third-party sources.

Logging is a critical part of observability. By following the practices outlined in this article, you can ensure that your logs are managed effectively, enabling you to troubleshoot problems, identify issues, and gain insights into the behavior of your system.

And you can do this while keeping your logging costs at bay.


Also published here