In this post, I want to share my understanding of burn monitor and the calculations behind it. I’ll cover the definitions, the link between SLO target, burn rate target and error rate target. We’ll go on a journey to understand how we will arrive at the number 14.4. Hopefully this gives a different perspective or serves as a guide when reading the [chapter in the SRE book: alerting on SLO]1.

SLA, SLO, and SLI with examples

Let’s start with the big picture. Why are SLOs important?

  • Cloud services have SLA to attract businesses.
  • SLAs are SLO with consequences.
  • SLO in turn are conditions for SLI.
  • SLI is a measurement of something.

An example is:

  • SLI: Number of successful requests
  • SLO: Number of successful requests per month must be > 99.9%
  • SLA: If number of successful requests per month (30 days) is < 99.9%, you get a day of using the product for free.

Defining Failure Conditions as SLO target

How can we make sure that the SLO is met? We can analyze the conditions in which we will fail to meet the SLO and alert on them. This means that if the error rate is a constant 0.1%, or 1 failure every 1000 requests, we could do nothing for the whole month and still meet the SLO target set.

Hence, we might consider getting alerts only when the error rate is > 0.1%. But this could be noisy because any jitter will cause an alert. Furthermore, our system might not have a constant error rate. For example, if the error rate oscillates between 0% and 0.1% error every evaluation window, we’ll get paged multiple times but we’d still meet our SLO target. Thus, it would appear that we could afford to have higher error rate for a short period of time and still be OK. What we really want is to express something like “alert me if the error rate within a window is high enough”. But what should the value of “high enough” be? To answer that, we look at a new term called burn rate.

Burn Rate as a Normalized Unit

Let’s approach the failure condition from another perspective. Think about the failure condition as how many requests can fail within the reporting period while still meeting the SLO. For example, if we have 100 million requests made to a service per month, we can allow up to 100K failed requests without breaching the SLO. This slack is known as the error budget.

With that perspective, let’s define a term called burn rate. This is the rate at which we consume the error budget. A constant burn rate of 1 means that we’ll finish up the budget exactly at the end of the reporting period. Similarly, a burn rate of 2 means we’ll finish up the budget halfway through the reporting period.

By thinking in terms of burn rate, we have a normalized unit for error that is independent of SLO target value. The same reasoning will work for an SLO target of 99.9% or 99% or 90%.

Measuring Burn Rate Using Error Rate

You might be thinking how we can measure burn rate without knowing the total number of requests ahead of time. If we assume the error rate is evenly distributed across all requests, we can use the ratio of successful requests as a proxy! For example, a burn rate of 1 would correspond to a constant error rate of 0.1% for a SLO target of 99.9%. Similarly, a burn rate of 2 would correspond to a constant error rate of 0.2% for a SLO target of 99.9%. Note that this varies with the SLO target. A burn rate of 1 would correspond to a constant error rate of 1% for a SLO target of 99%

We now have a way to graph burn rate using error rate.

Hence, we can alert on error rate (SLI), which is measurable. It’s like saying “if we do not address this high error rate soon, we’ll burn through our SLO budget within X days”.

SLO Target and Error Rate

Putting what we know so far together, we can set a SLO target and create a monitor for error rate, but what threshold should we use?

Going back to the idea of budget, normal (no action needed) would mean we’ve spent 10% of the budget in 3 days. It would be acceptable because 3 days is about 10% of the month (30 days). The book recommends alerting when we spent 5% of the budget in 6 hours and 2% of the budget in 1 hour. But the framework is generic enough to plug in other numbers.

Let’s think about how we might express “burning through 2% in 1 hour”. If we assume constant burn rate, we would burn through 2 * 24 * 30 = 1440% of our budget. By the definition of burn rate above, this is a burn rate of 14.4. With this, we can set an alert for a error rate of 14.4 over 1 hour using the graph for error rate. This means that when the alert fires, we’ve consumed 2% of our budget and if we do nothing, we would exhaust our budget for the month within 50 hours.

For completeness, the same exercise for “burning through 5% in 6 hours” will result in 5 * (24/6) * 30 = 600% or a burn rate of 6.

Conclusion

We have taken a journey from defining SLO target all the way to setting an alert on error rates. We’ve seen that 14.4 is special because it corresponds to the burn rate where if error rates were left unchecked, it would consume 2% of the error budget within the hour.