AIOps, Engineering

Successfully Deploying AIOps, Part 1: Deconstructing MTTR

By | | 6 min read

Quantifying the value of successful AIOps deployment requires tracking subsidiary metrics within the industry default of mean time to resolution (MTTR). This post breaks out the metrics that form MTTR and divides them into two categories: problem and solution.

Somewhere between waking up today and reading this blog post, AI/ML has done something for you. Maybe Netflix suggested a show, or DuckDuckGo recommended a website. Perhaps it was your photos application asking you to confirm the tag of a specific friend in your latest photo. In short, AI/ML is already embedded into our lives.

The quantity of metrics in development, operations and infrastructure makes development and operations a perfect partner for machine learning. With this general acceptance of AI/ML, it is surprising that organizations are lagging in implementing machine learning in operations automation, according to Gartner.

The level of responsibility you will assign to AIOps and automation comes from two factors:

  • The level of business risk in the automated action
  • The observed success of AI/ML matching real world experiences

The good news is this is not new territory; there is a tried-and-true path for automating operations that can easily be adjusted for AIOps.

It Feels Like Operations is the Last to Know

The primary goal of the operations team is to keep business applications functional for enterprise customers or users. They design, “rack and stack,” monitor performance, and support infrastructure, operating systems, cloud providers and more. But their ability to focus on this prime directive is undermined by application anomalies that consume time and resources, reducing team bandwidth for preemptive work.

An anomaly deviates from what is expected or normal. A crashing application is clearly an anomaly, yet so too is one that was updated and now responds poorly or inconsistently. Detecting an anomaly requires a definition of “normal,” accompanied with monitoring of live streaming metrics to spot when the environment exhibits abnormal behaviour.

The majority of enterprises are alerted to an anomaly by users or non-IT teams before IT detects the problem, according to a recent AppDynamics survey of 6,000 global IT leaders. This disappointing outcome can be traced to three trends:

  • Exponential growth of uncorrelated log and metric data triggered by DevOps and Continuous Integration and Continuous Delivery (CI/CD) in the process of automating the build and deployment of applications.
  • Exploding application architecture complexity with service architectures, multi-cloud, serverless, isolation of system logic and system state—all adding dynamic qualities defying static or human visualization.
  • Siloed IT operations and operational data within infrastructure teams.

Complexity and data growth overload development, operations and SRE professionals with data rather than insight, while siloed data prevents each team from seeing the full application anomaly picture.

Enterprises adopted agile development methods in the early 2000s to wash away the time and expense of waterfall approaches. This focus on speed came with technical debt and lower reliability. In the mid-2000s manual builds and testing were identified as the impediment leading to DevOps, and later to CI/CD.

DevOps allowed development to survive agile and extreme approaches by transforming development—and particularly by automating testing and deployment—while leaving production operations basically unchanged. The operator’s role in maintaining highly available and consistent applications still consisted of waiting for someone or something to tell them a problem existed, after which they would manually push through a solution. Standard operating procedures (SOPs) were introduced to prevent the operator from accidentally making a situation worse for recurring repairs. There were pockets of successful automation (e.g., tuning the network) but mostly the entire response was still reactive. AIOps is now stepping up to allow operations to survive in this complex environment, as DevOps did for the agile transformation.

Reacting to Anomalies

DevOps automation removed a portion of production issues. But in the real world there’s always the unpredictable SQL query, API call, or even the forklift driving through the network cable. The good news is that the lean manufacturing approach that inspired DevOps can be applied to incident management.

To understand how to deploy AIOps, we need to break down the “assembly line” used to address an anomaly. The time spent reacting to an anomaly can be broken into two key areas: problem time and solution time.

Problem time: The period when the anomaly has not yet being addressed.

Anomaly management begins with time spent detecting a problem. The AppDynamics survey found that 58% of enterprises still find out about performance issues or full outages from their users. Calls arrive and service tickets get created, triggering professionals to examine whether there really is a problem or just user error. Once an anomaly is accepted as real, the next step generally is to create a war room (physical or Slack channel), enabling all the stakeholders to begin root cause analysis (RCA). This analysis requires visibility into the current and historical system to answer questions like:

  • How do we recreate the timeline?
  • When did things last work normally or when did the anomaly began?
  • How are the application and underlying systems currently structured?
  • What has changed since then?
  • Are all the errors in the logs the result of one or multiple problems?
  • What can we correlate?
  • Who is impacted?
  • Which change is most likely to have caused this event?

Answering these questions leads to the root cause. During this investigative work, the anomaly is still active and users are still impacted. While the war room is working tirelessly, no action to actually rectify the anomaly has begun.

Solution time: The time spent resolving the issues and verifying return-to-normal state.

With the root cause and impact identified, incident management finally crosses over to spending time on the actual solution. The questions in this phase are:

  • What will fix the issue?
  • Where are these changes to be made?
  • Who will make them?
  • How will we record them?
  • What side effects could there be?
  • When will we do this?
  • How will we know it is fixed?
  • Was it fixed?

Solution time is where we solve the incident rather than merely understanding it. Mean time to resolution (MTTR) is the key metric we use to measure the operational response to application anomalies. After deploying the fix and verifying return-to-normal state, we get to go home and sleep.

Deconstructing MTTR

MTTR originated in the hardware world as “mean time to repair”— the full time from error detection to hardware replacement and reinstatement into full service (e.g., swapping out a hard drive and rebuilding the data stored on it). In the software world, MTTR is the time from software running abnormally (an anomaly) to the time when the software has been verified as functioning normally.

Measuring the value of AIOps requires breaking MTTR into subset components. Different phases in deploying AIOps will improve different portions of MTTR. Tracking these subdivisions before and after deployment allows the value of AIOps to be justified throughout.

With this understanding and measurement of existing processes, the strategic adoption of AIOps can begin, which we discuss in part two of this series.