AIOps: A Self-Healing Mentality

The first time I began watching Minority Report back in 2002, the film’s premise made me optimistic: Crime could be prevented with the help of Precogs, a trio of mutant psychics capable of “previsualizing” crimes and enabling police to stop murderers before they act. What a great utopia!

I quickly realized, however, that this “utopia” was in fact a dystopian nightmare. I left the theater feeling confident that key elements of Minority Report’s bleak future—city-wide placement of iris scanners, for instance—would never come to pass. Fast forward to today, however, and ubiquitous iris-scanning doesn’t seem so far-fetched. Don’t believe me? Simply glance at your smartphone and the device unlocks.

This isn’t dystopian stuff, however. Rather, today’s consumer is enjoying the benefits that machine learning and artificial intelligence provide. From Amazon’s product recommendations to Netflix’s show suggestions to Lyft’s passenger predictions, these services—while not foreseeing crime—greatly enhance the user experience.

The systems that run these next-generation features are vastly complex, ingesting a large corpus of data and continually learning and adapting to help drive different decisions. Similarly, a new enterprise movement is underway to combine machine learning and AI to support IT operations. Gartner calls it “AIOps,” while Forrester favors “Cognitive Operations.”

A Hypothesis-Driven World

Hypothesis-driven analysis is not new to the business world. It impacts the average consumer in many ways, such as when a credit card vendor tweaks its credit-scoring rules to determine who should receive a promotional offer (and you get another packet in your mailbox). Or when the TSA decides to expand or contract its TSA PreCheck program.

Of course, systems with AI/ML are not new to the enterprise. Some parts of the stack, such as intrusion detection, have been using artificial intelligence and machine learning for some time.

But with AIOps, we are entering an age where the entire soup-to-nuts of measuring user sentiment—everything from A/B testing to canary deployment—can be automated. And while there’s a sharp increase in the number of systems that can take action—CI/CD, IaaS, and container orchestrators are particularly well-suited to instruction—the harder part is the conclusions process, which is where AIOps systems will come into play.

The ability to make dynamic decisions and test multiple hypotheses without administrative intervention is a huge boon to business. In addition to myriad other skills, AIOps platforms could monitor user sentiment in social collaboration tools like Slack, for instance, to determine if some type of action or deeper introspection is required. This action could be something as simple as redeploying with more verbose logging, or tracing for a limited period of time to tune, heal, or even deploy a new version of an application.

AIOps: Precog, But in a Good Way

AIOps and cognitive operations may sound like two more enterprise software buzzwords to bounce around, but their potential should not be dismissed. According to Google’s Site Reliability Engineering workbook, self-healing and auto-healing infrastructures are critically important to the enterprise. What’s important to remember about AIOps and cognitive operations is that they enable self-healing before a problem occurs.

Of course, this new paradigm is no replacement for good development and operation practices. But more often than not, we take on new projects that may be ill-defined, or find ourselves dropped into the middle of a troubled project (or firestorm). In what I call the “fog of development,” no one person has an unobstructed, 360-degree view of the system.

What if the system could deliver automated insights that you could incorporate into your next software release? Having a systematic record of real-world performance and topology—rather than just tribal knowledge—is a huge plus. Similar to the security world having a runtime application self-protection (RASP) platform, engineers should address underlying issues in future versions of the application. In some ways, AIOps and cognitive operations have much in common with the CAMS Model, the core values of the DevOps Movement: culture, automation, measurement and sharing. Wouldn’t it be nice to automate the healing as well?

Forrester Talks Automation and Application Performance Management

Jean-Pierre (JP) Garbani of the analyst firm Forrester recently wrote a research paper titled “Technology Spotlight: Automate Application Performance Management – Automate Your Incident Management Process With Run Book Automation“. In this paper JP discusses the fact that IT must embrace automation if they are to be successful moving forward. He goes on to describe Run Book Automation (RBA) and how it applies to Application Performance Management (APM).

One of the most interesting parts of the research note is a graphical depiction of the intersection between APM and RBA. While I can’t show that image in this blog post I can say that it is a spot on depiction of how RBA works within AppDynamics. For those who are not familiar, AppDynamics decided to lead the APM industry in a new direction by listening to our customers and making RBA an important and fully integrated part of our product.

If you take nothing else away from this blog post or from JPs research paper you need to understand this key insight… The reason APM based RBA works so much better than traditional RBA is because APM understands exactly which application nodes are impacted at any given time and can perform run book remediation on those nodes without any input from a user.

The Old Way

Traditional RBA requires that you write tests to act as triggers for run book workflows. You would run these tests against pre-defined sets of infrastructure and application components when there is a problem so that your runbooks can fix any known issues. If your application and infrastructure change you need to manually modify the list components that are getting tested. This manual update process has been the downfall of RBA and other technologies (CMDB anyone?) as people have come to realize the time investment required to keep everything current. This problem is amplified by todays dynamic application technologies like virtualization and cloud computing.

Classic RBA Process

Process flow and timeline of classic RBA.

The New Way

Forrester and AppDynamics agree that the answer to the traditional RBA problem described above is by using APM to dynamically track the current state of application and infrastructure components as well as identify problems that trigger run book workflows for resolution. Identification of issues from within the application and in real time is a giant step forward from the pre-determined interval testing of traditional RBA. And when you combine this capability with real time business metrics you get a new capability that enables the business to react immediately to problems that have nothing to do with IT.

APM RBA Process

Process flow and timeline of APM based RBA.

Application run book automation can be used by any organization large or small. If there are issues within your environment that have known fixes then you can use application RBA to automatically detect and remediate those problems within seconds. Stop wasting your time doing repetitive tasks and try AppDynamics for free today. If you’d like to read the Forrester research paper in its entirety you can download it for free by clicking here.