Why?

Understanding and utilizing the concepts of SLI (Service Level Indicators), SLO (Service Level Objectives), and SLA (Service Level Agreement) is crucial for businesses, startups, and the development of new features or application launches for several reasons:

Setting Goals and Expectations

SLI and SLO allow for clear definition and measurement of performance and availability goals for a service. This is important for businesses as it sets specific expectations regarding the quality of service provided. For example, by defining SLI as service availability percentage or average response time, a business can set SLO at a level that meets customer or user needs.

Ensuring Quality and Reliability

Using SLI and SLO helps businesses ensure the high quality and reliability of their services or products. Having well-defined SLOs enables developers and engineers to aim for specific goals when developing new features or application updates.

Monitoring and Management

SLI and SLO provide the foundation for real-time performance monitoring of a service. This allows for timely detection and response to potential issues or failures, minimizing downtime and improving overall user experience.

Assessing the Effectiveness of New Features

When developing new features or launching new products, SLI and SLO can be used to assess their effectiveness. For instance, if a new feature impacts response time or error rates, analyzing changes in SLI and SLO can help evaluate the positive or negative impact on user experience.

Aligning with Customers and Partners

SLA, based on SLI and SLO, is an important tool for establishing agreements with customers and partners. It provides confidence that the business is ready to provide the necessary level of service and ensure compensation in case agreed standards are not met.


What is that?

The concepts of SLA (Service Level Agreement), SLO (Service Level Objective), and SLI (Service Level Indicator) can be visualized as a pyramid, with SLA at the top, representing the overarching and general agreement, while SLO and SLI are positioned below, refining and detailing this agreement.

This pyramid reflects the hierarchical structure and interrelationship between SLA, SLO, and SLI. SLA represents the overall commitment to achieving a certain level of service, SLO specifies this level with target metrics, and SLI provides data to assess the performance against these targets.

It's important to understand that each level of this pyramid plays its role in ensuring a quality and reliable service. While the SLA sets the foundational agreement, without sufficiently specific SLOs and accurate SLIs, fulfilling and evaluating the SLA can be challenging. Therefore, this entire hierarchy is important for businesses and the development of new features or application launches, as it provides a clear understanding of service requirements and enables monitoring in line with these requirements.

SLI

Let's start our discussion with SLI, which stands for Service Level Indicators.

Imagine a service, for example, a fairly simple one with a single functional method - a search method. This example will be quite sufficient for our understanding. Let's focus on this.

So, we have a service and metrics associated with it. Essentially, we have data that can be measured and analyzed. Our task is to identify key indicators that will help us evaluate the service's performance, its response time to requests, and possibly the speed of processing these requests.

To apply these indicators in practice, it's important to determine which parameters are critically important for our service. For example, if it's a search engine, key indicators could be the response time to a user's query and the accuracy of the search results. Based on this, a monitoring system can be built to track these metrics in real-time.

Introducing SLI allows us not only to monitor the current state of the system but also to take steps to improve the quality of the service, based on specific data. This helps increase user satisfaction and makes the service more predictable and reliable.

Most often, as Service Level Indicators (SLIs), we encounter the following metrics:

With the formation of such metrics, we have the opportunity not only to monitor the state of our service in real time but also to set clear goals for the development team to maintain and improve the level of service quality. This creates a foundation for the stable operation of the service and increases user trust, as they see that the service fulfils its quality and availability promises.

Furthermore, these indicators allow us to quickly respond to emerging problems, prioritize work on errors and improvements, and analyze how changes in code or infrastructure affect the overall performance and quality of the service.

SLO

We've refined our promise to the "provider" by setting specific, numerically expressed goals, known as Service Level Objectives (SLOs). SLOs describe specific targets we aim for, answering the question, "What goal are we pursuing?".

So, we already have some key indicators, and now our task is to match our customers' expectations with each of these indicators. Let's take our service with a single search method as an example and define the following SLOs:

We face the task of taking a closer look at response time frames. This is a critically important parameter, as it directly affects users' perception of the service. Response time is the period between when a user sends a request and when they receive a response from the service. For a search service, response time is extremely important, as users expect quick and relevant results. By setting a goal for response time, we commit to providing users not only with accurate but also timely responses, improving the overall quality of interaction with the service.

It's quite common to evaluate service performance using the average response time. However, the average is not the best metric for determining the "typical" response time because it doesn't account for how many users experience specific delays.

The difference between the average and percentiles arises from how they are calculated.

0.700, 0.720, 0.680, 0.660, 0.740, 0.750,

0.730, 0.670, 0.710, 0.200, 0.150, 0.300,

0.350, 0.400, 0.450, 0.500, 0.550, 0.600, 

0.250, 0.320, 0.380, 0.420, 0.490, 0.530, 

0.580, 0.620, 0.310, 0.370, 0.440, 0.510, 

0.560, 0.610, 0.290, 0.340, 0.390

Average = 0.429

0.150, 0.200, 0.250, 0.290, 0.300, 0.310, 0.320, 0.340,

0.350, 0.370, 0.380, 0.390, 0.400, 0.420, 0.440, 0.450,

0.490, 0.500, 0.510, 0.530, 0.550, 0.560, 0.580, 0.600,

0.610, 0.620, 0.660, 0.670, 0.680, 0.700, 0.710, 0.720,

0.730, 0.740, 0.750

50 Percentil = 0.510

For instance, the 99th percentile signifies the value below which 99% of all data points fall. Only 1% of data points have values exceeding this percentile. This helps understand how often extremely long response times occur, which can be crucial for defining guaranteed performance levels.

To visually compare the average and percentiles, a data distribution plot can be constructed.

For example, when evaluating the performance of a search method, you can set the following SLO (Service Level Objective):

This means that only 1% of requests may exceed a response time of 500 ms, which is a stricter and more informative performance indicator than simply the average.

This approach allows for consideration of various usage scenarios and ensures more stable and predictable service performance for users.

SLA

SLA (Service Level Agreement) is an agreement established between us (the service provider) and our clients or integrators. It defines our commitments and expectations regarding the performance and quality of service, as well as outlines measures for response in case these commitments are not met. SLA specifies specific metrics and performance indicators that must be achieved or maintained.

Examples of inclusions in an SLA:

These conditions establish transparent and clear expectations for all parties involved and motivate the service provider to maintain a high level of performance and availability. SLA often includes monitoring measures, reporting, and regular metric updates to ensure that the terms of the agreement are upheld throughout the contract term.

Conclusion

Studying the concepts of SLI, SLO, and SLA is fundamentally important for businesses, startups, and application development. These terms form a hierarchy, where SLA (Service Level Agreement) is the foundational agreement between the provider and the client, defining the terms of service provision. SLO (Service Level Objective) specifies performance and availability goals that must be achieved to fulfil the SLA. In turn, SLI (Service Level Indicator) represents specific metrics and performance indicators used to measure the level of service.

The significance of these concepts for businesses lies in the ability to establish clear expectations regarding the quality of service provided and ensure high levels of customer satisfaction. They also help optimize performance monitoring, enabling more efficient responses to potential issues. For developing new features or launching applications, SLI, SLO, and SLA are important tools for evaluating the effectiveness of changes and ensuring compliance with established quality standards. These concepts provide transparency, stability, and reliability in service delivery, fostering business growth and meeting customer needs effectively.