Engineering

AIOps: A Self-Healing Mentality

By | | 3 min read


Summary
'AIOps' and 'cognitive operations' are more than just industry buzzwords. They enable self-healing before a problem occurs.

The first time I began watching Minority Report back in 2002, the film’s premise made me optimistic: Crime could be prevented with the help of Precogs, a trio of mutant psychics capable of “previsualizing” crimes and enabling police to stop murderers before they act. What a great utopia!

I quickly realized, however, that this “utopia” was in fact a dystopian nightmare. I left the theater feeling confident that key elements of Minority Report’s bleak future—city-wide placement of iris scanners, for instance—would never come to pass. Fast forward to today, however, and ubiquitous iris-scanning doesn’t seem so far-fetched. Don’t believe me? Simply glance at your smartphone and the device unlocks.

This isn’t dystopian stuff, however. Rather, today’s consumer is enjoying the benefits that machine learning and artificial intelligence provide. From Amazon’s product recommendations to Netflix’s show suggestions to Lyft’s passenger predictions, these services—while not foreseeing crime—greatly enhance the user experience.

The systems that run these next-generation features are vastly complex, ingesting a large corpus of data and continually learning and adapting to help drive different decisions. Similarly, a new enterprise movement is underway to combine machine learning and AI to support IT operations. Gartner calls it “AIOps,” while Forrester favors “Cognitive Operations.”

A Hypothesis-Driven World

Hypothesis-driven analysis is not new to the business world. It impacts the average consumer in many ways, such as when a credit card vendor tweaks its credit-scoring rules to determine who should receive a promotional offer (and you get another packet in your mailbox). Or when the TSA decides to expand or contract its TSA PreCheck program.

Of course, systems with AI/ML are not new to the enterprise. Some parts of the stack, such as intrusion detection, have been using artificial intelligence and machine learning for some time.

But with AIOps, we are entering an age where the entire soup-to-nuts of measuring user sentiment—everything from A/B testing to canary deployment—can be automated. And while there’s a sharp increase in the number of systems that can take action—CI/CD, IaaS, and container orchestrators are particularly well-suited to instruction—the harder part is the conclusions process, which is where AIOps systems will come into play.

The ability to make dynamic decisions and test multiple hypotheses without administrative intervention is a huge boon to business. In addition to myriad other skills, AIOps platforms could monitor user sentiment in social collaboration tools like Slack, for instance, to determine if some type of action or deeper introspection is required. This action could be something as simple as redeploying with more verbose logging, or tracing for a limited period of time to tune, heal, or even deploy a new version of an application.

AIOps: Precog, But in a Good Way

AIOps and cognitive operations may sound like two more enterprise software buzzwords to bounce around, but their potential should not be dismissed. According to Google’s Site Reliability Engineering workbook, self-healing and auto-healing infrastructures are critically important to the enterprise. What’s important to remember about AIOps and cognitive operations is that they enable self-healing before a problem occurs.

Of course, this new paradigm is no replacement for good development and operation practices. But more often than not, we take on new projects that may be ill-defined, or find ourselves dropped into the middle of a troubled project (or firestorm). In what I call the “fog of development,” no one person has an unobstructed, 360-degree view of the system.

What if the system could deliver automated insights that you could incorporate into your next software release? Having a systematic record of real-world performance and topology—rather than just tribal knowledge—is a huge plus. Similar to the security world having a runtime application self-protection (RASP) platform, engineers should address underlying issues in future versions of the application. In some ways, AIOps and cognitive operations have much in common with the CAMS Model, the core values of the DevOps Movement: culture, automation, measurement and sharing. Wouldn’t it be nice to automate the healing as well?