Engineering, Product

Managing Software Reliability Metrics: How to Build SRE Dashboards That Drive Positive Business Outcomes

By | | 7 min read


Summary
SLI, SLO, SLA and error budget: These terms are essential to determining if your system is reliable, available and useful to your users. You should be able to measure these metrics and tie them to your business objectives—with the ultimate goal of providing value to your customers.

Customers expect your business application to perform consistently and reliably at all times—and for good reason. Many have built their own business systems based on the reliability of your application. This reliability target is your service level objective (SLO), the measurable characteristics of a service level agreement (SLA) between a service provider and its customer.

The SLO sets target values and expectations on how your service(s) will perform over time. It includes service level indicators (SLIs)—quantitative measures of key aspects of the level of service—which may include measurements of availability, frequency, response time, quality, throughput and so on.

If your application goes down for longer than the SLO dictates, fair warning: All hell may break loose, and you may experience frantic pages from customers trying to figure out what’s going on. Furthermore, a breach to your SLO error budget—the rate at which service level objectives can be missed—could have serious financial implications as defined in the SLA.

Why an Error Budget?

Developers are always eager to release new features and functionality. But these upgrades don’t always turn out as expected, and this can result in an SLO violation. With that being said, your SRE team should be able to do deployments and system upgrades as needed, but anytime you make changes to applications, you introduce the potential for instability.

An error budget states the numeric expectations of SLA availability. Without one, your customer may expect 100% reliability at all times. The benefit of an error budget is that it allows your product development and site reliability engineering (SRE) teams to strike a balance between innovation and reliability. If you frequently violate your SLO, the teams will need to decide whether its best to pull back on deployment and spend more time investigating the cause of the SLO breach.

For example, imagine that an SLO requires a service to successfully serve 99.999% of all queries per quarter. This means the service’s error budget has a failure rate of 0.001% for a given quarter. If a problem causes a 0.0002% failure rate, it will consume 20% of the service’s quarterly error budget.

Don’t Aim for Perfection

Developing a workable SLO isn’t easy. You need to set realistic goals, as aiming for perfection (e.g. 100% availability) can prove very expensive and nearly  impossible to achieve. Your SRE team, which is responsible for the daily operation of an application in production, must work with interested parties (e.g., product owners) to find the correct transactions to monitor for your SLO.

To begin, you must define your SLIs to determine healthy levels of service, and then use metrics that expose a negative user experience. Your engineering and application teams must decide which metric(s) to monitor, since they know the application best. A typical approach is to find a key metric that represents your SLO. For instance, Netflix uses its starts-per-second metric as an indicator of overall system health, because its baselining has led the company to expect X number of starts within any given timeframe.

Once you’ve found the right metrics, make them visible on a dashboard. Of course, not all metrics are useful. Some won’t need alerts or dashboard visibility, and you’ll want to avoid cluttering your dashboard with too many widgets. Treat this as an iterative process. Start with just a few metrics as you gain a better understanding of your system’s performance. You also can implement alerting—email, Slack, ticketing and so on—to encourage a quick response to outages and other problems.

People often ask, “What happens when SLOs aren’t met?”

Because an SLA establishes that service availability will meet certain thresholds over time, there may be serious consequences for your business—including the risk of harming your reputation and, of course, financial loss resulting from an SLO breach and a depletion of your error budget. Since the penalty for an SLA violation can be severe, your SRE team should be empowered to fix problems within the application stack. Depending on the team’s composition, it’s possible they’ll either release a fix to the feature code, make changes to the underlying platform architecture or, in a severe case, ask the feature team to halt all new development until your service returns to an acceptable level of stability as defined by the error budget.

How AppDynamics Helps You

AppDynamics enables you to track numerous metrics for your SLI.

But you may be wondering, “Which metrics should I use?”

AppD users are often excited—maybe even a bit overwhelmed—by all the data collected, and they assume everything is important. But your team shouldn’t constantly monitor every metric on a dashboard. While our core APM product provides many valuable metrics, AppDynamics includes many additional tools that deliver deep insights as well, including End User Monitoring (EUM), Business iQ and Browser Synthetic Monitoring.

 Let’s break down which AppDynamics components your SRE team should use to achieve faster MTTR:

  • APM: Say your application relies heavily on APIs and automation. Start with a few API you want to monitor and ask, “Which one of these APIs, if it fails, will impact my application or affect revenue?”  These calls usually have a very demanding SLO.

  • End User Monitoring: EUM is the best way to truly understand the customer experience because it automatically captures key metrics, including end-user response time, network requests, crashes, errors, page load details and so on.

  • Business iQ: Monitoring your application is not just about reviewing performance data.  Biz iQ helps expose application performance from a business perspective, whether your app is generating revenue as forecasted or experiencing a high abandon rate due to degraded performance.

  • Browser Synthetic Monitoring: While EUM shows the full user experience, sometimes it’s hard to know if an issue is caused by the application or the user. Generating synthetic traffic will allow you to differentiate between the two.

So how does AppDynamics help monitor your error budget?

After determining the SLI, SLO and error budget for your application, you can display your error budget on a dashboard. First, convert your SLA to minutes—for example, 99.99% SLO allows 0.01% error budget and only 8.77 hours (526 minutes) of downtime per year. You can create a custom metric to count the duration of SLO violation and display it in a graph. Of course, you’ll need to take maintenance and planned downtime into consideration as well.

With AppDynamics you can use key metrics such as response time, HTTP error count, and timeout errors. Try to avoid using system metrics like CPU and memory because they tell you very little about the user experience. In addition, you can configure Slow Transaction Percentile to show which transactions are healthy.

Availability is another great metric to measure, but keep in mind that even if your application availability is 100%, that doesn’t mean it’s healthy. It’s best to start building your dashboard in the pre-prod environment, as you’ll need to time tweak thresholds and determine which metric to use with each business transaction. The sooner AppDynamics is introduced to your application SDLC, the more time your developers and engineers will have to get acclimated to it.

 What does the ideal SRE dashboard look like? Make sure it has these KPIs:

  • SLO violation duration graph, response time (99th percentile) and load for your critical API calls

  • Error rate

  • Database response time

  • End-user response time (99th percentile)

  • Requests per minute

  • Availability

  • Session duration

Providing Value to Customers with Software Reliability Metric Monitoring

SLI, SLO, SLA and error budget aren’t just fancy terms. They’re critical to determining if your system is reliable, available or even useful to your users. You should be able to measure these metrics and tie them to your business objectives, as the ultimate goal of your application is to provide value to your customers.

Learn how AppDynamics can help measure your business success.