Successfully Deploying AIOps, Part 3: The AIOps Apprenticeship

Part one of our series on deploying AIOPs identified how an anomaly breaks into two broad areas: problem time and solution time. Part two described the first deployment phase, which focuses on reducing problem time. With trust in the AIOps systems growing, we’re now ready for part three: taking on solution time by automating actions.

Applying AIOps to Mean Time to Failure (MTTF)

The power of AIOps comes from continuous enhancement of machine learning powered by improved algorithms and training data, combined with the decreasing cost of processing power. A measured example was Google’s project for accurately reading street address numbers from its street image systems—a necessity in countries where address numbers don’t run sequentially but rather are based on the age of the buildings. Humans examining photos of street numbers have an accuracy of 98%. Back in 2011, the available algorithms and training data produced a trained model with 91% accuracy. By 2013, improvements and retraining boosted this number to 97.5%. Not bad, though humans still had the edge. In 2015, the latest ML models passed human capability at 98.1%. This potential for continuous enhancement makes AIOps a significant benefit for operational response times.

You Already Trust AI/ML with Your Life

If you’ve flown commercially in the past decade, you’ve trusted the autopilot for part of that flight. At some major airports, even the landings are automated, though taxiing is still left to pilots. Despite already trusting AI/ML to this extent, enterprises need more time to trust AI/ML in newer fields such as AIOps. Let’s discuss how to build that trust.

Apprenticeships allow new employees to learn from experienced workers and avoid making dangerous mistakes. They’ve been used for ages in multiple professions; even police departments have a new academy graduate ride along with a veteran officer. In machine learning, ML frameworks need to see meaningful quantities of data in order to train themselves and create nested neural networks that form classification models. By treating automation in AIOps like an apprenticeship, you can build trust and gradually weave AIOps into a production environment.

By this stage, you should already be reducing problem time by deploying AIOps, which delivers significant benefits before adding automation to the mix. These advantages include the ability to train the model with live data, as well as observe the outcomes of baselining. This is the first step towards building trust in AIOps.

Stage One: AIOps-Guided Operations Response

With AIOps in place, operators can address anomalies immediately. At this stage, operations teams are still reviewing anomaly alerts to ensure their validity. Operations is also parsing the root cause(s) identified by AIOps to select the correct issue to address. While remediation is manual at this stage, you should already have a method of tracking common remediations.

In stage one, your operations teams oversee the AIOps system and simultaneously collect data to help determine where auto-remediation is acceptable and necessary.

Stage Two: Automate Low Risk

Automated computer operations began around 1964 with IBM’s OS/360 operating system allowing operators to combine multiple individual commands into a single script, thus automating multiple manual steps into a single command. Initially, the goal was to identify specific, recurring manual tasks and figure out how to automate them. While this approach delivered a short-term benefit, building isolated, automated processes incurred technical debt, both for future updates and eventual integration across multiple domains. Ultimately it became clear that a platform approach to automation could reduce potential tech debt.

Automation in the modern enterprise should be tackled like a microservices architecture: Use a single domain’s management tool to automate small actions, and make these services available to complex, cross-domain remediations. This approach allows your investment in automation to align with the lifespan of the single domain. If your infrastructure moves VMs to containers, the automated services you created for networking or storage are still valid.

You will not automate every single task. Selecting what to automate can be tricky, so when deciding whether to fully automate an anomaly resolution, use these five questions to identify the potential value:

  • Frequency: Does the anomaly resolution occur often enough to warrant automation?
  • Impact: Are you automating the solution to a major issue?
  • Coverage: What proportion of the real-world process can be automated?
  • Probability: Does the process always produce the desired result, or can it be impacted by environmentals?
  • Latency: Will automating the task achieve a faster resolution?

Existing standard operating procedures (SOPs) are a great place to start. With SOPs, you’ve already decided how you want a task performed, have documented the process, and likely have some form of automation (scripts, etc.) in place. Another early focus is to address resource constraints by adding front-end web servers when traffic is high, or by increasing network bandwidth. Growing available resources is low risk compared to restarting applications. While bandwidth expansion may impact your budget, it’s unlikely to break your apps. And by automating resource constraint remediations, you’re adding a rapid response capability to operations.

In stage two, you augment your operations teams with automated tasks that can be triggered in response to AIOps-identified anomalies.

Stage Three: Connect Visibility to Action (Trust!)

As you start to use automated root cause analysis (RCA), it’s critical to understand the probability concept of machine learning. Surprisingly, for a classical computer technology, ML does not output a binary, 0 or 1 result, but rather produces statistical likelihoods or probabilities of the outcome. The reason this outcome sometimes looks definitive is that a coder or “builder” (the latter if you’re AWS’s Andy Jassy) has decided an acceptable probability will be chosen as the definitive result. But under the covers of ML, there is always a percentage likelihood. The nature of ML means that RCA sometimes will result in a selection of a few probable causes. Over time, the system will train itself on more data and probabilities and grow more accurate, leading to single outcomes where the root cause is clear.

Once trust in RCA is established (stage one), and remediation actions are automated (stage two), it’s time to remove the manual operator from the middle. The low-risk remediations identified in stage two can now be connected to the specific root cause as a fully automated action.

The benefits of automated operations are often listed as cost reduction, productivity, availability, reliability and performance. While all of these apply, there’s also the significant benefit of expertise time. “The main upshot of automation is more free time to spend on improving other parts of the infrastructure,” according to Google’s SRE project. The less time your experts spend in MTTR steps, the more time they can spend on preemption rather than reaction.

Similar to DevOps, AIOps will require a new mindset. After a successful AIOps deployment, your team will be ready to transition from its existing siloed capabilities. Each team member’s current specialization(s) will need to be accompanied with broader skills in other operational silos.

AIOps augments each operations team, including ITOps, DevOps and SRE. By giving each team ample time to move into preemptive mode, AIOps ensures that applications, architectures and infrastructures are ready for the rapid transformations required by today’s business.

Successfully Deploying AIOps, Part 2: Automating Problem Time

In part one of our Successfully Deploying AIOps series, we identified how an anomaly breaks into two broad areas: problem time and solution time. The first phase in deploying AIOps focuses on reducing problem time, with some benefit in solution time as well. This simply requires turning on machine learning within an AIOps-powered APM solution. Existing operations processes will still be defining, selecting and implementing anomaly rectifications. When you automate problem time, solution time commences much sooner, significantly reducing an anomaly’s impact.

AIOps: Not Just for Production

Anomalies in test and quality assurance (QA) environments cost the enterprise time and resources. AIOps can deliver significant benefits here. Applying the anomaly resolution processes seen in production will assist developers navigating the deployment cycle.

Test and QA environments are expected to identify problems before production deployment. Agile and DevOps approaches have introduced rapid, automated building and testing of applications. Though mean time to resolution (MTTR) is commonly not measured in test and QA environments (which aren’t as critical as those supporting customers), the benefits to time and resources still pay off.

Beginning your deployment in test and QA environments allows a lower-risk, yet still valuable, introduction to AIOps. These pre-production environments have less business impact, as they are not visited by customers. Understanding performance changes between application updates is critical to successful deployment. Remember, as the test and QA environments will not have the production workload available, it’s best to recreate simulated workloads through synthetics testing.

With trust in AIOps built from first applying AIOps to mean time to detect (MTTD), mean time to know (MTTK) and mean time to verify (MTTV) in your test and QA environments, your next step will be to apply these benefits to production. Let’s analyze where you’ll find these initial benefits.

Apply AI/ML to Detection (MTTD)

An anomaly deviates from what is expected or normal. Detecting an anomaly requires a definition of “normal” and a monitoring of live, streaming metrics to see when they become abnormal. A crashing application is clearly an anomaly, as is one that responds poorly or inconsistently after an update.

With legacy monitoring tools, defining “normal” was no easy task. Manually setting thresholds required operations or SRE professionals to guesstimate thresholds for all metrics measured by applications, frameworks, containers, databases, operating systems, virtual machines, hypervisors and underlying storage.

AIOps removes the stress of threshold-setting by letting machine learning baseline your environment. AI/ML applies mathematical algorithms to different data features seeking correlations. With AppDynamics, for example, you simply run APM for a week. AppDynamics observes your application over time and creates baselines, with ML observing existing behavioral metrics and defining a range of normal behavior with time-based and contextual correlation. Time-based correlation removes alerts related to the normal flow of business—for example, the login spike that occurs each morning as the workday begins; or the Black Friday or Guanggun Jie traffic spikes driven by cultural events. Contextual correlation pairs metrics that track together, enabling anomaly identification and alerts later when the metrics don’t track together.

AIOps will define “normal” by letting built-in ML watch the application and automatically create a baseline. So again, install APM and let it run. If you have specific KPIs, you can add these on top of the automatic baselines as health rules. With baselines defining normal, AIOps will watch metric streams in real time, with the model tuned to identify anomalies in real time, too.

Apply AI/ML to Root Cause Analysis (MTTK)

The first step to legacy root cause analysis (RCA) is to recreate the timeline: When did the anomaly begin, and what significant events occurred afterward? You could search manually through error logs to uncover the time of the first error. This can be misleading, however, as sometimes the first error is an outcome, not a cause (e.g., a crash caused by a memory overrun is the result of a memory leak running for a period of time before the crash).

In the midst of an anomaly, multiple signifiers often will indicate fault. Logs will show screeds of errors caused by stress introduced by the fault, but fail to identify the underlying defect. The operational challenge is unpacking the layers of resultant faults to identify root cause. By pinpointing this cause, we can move onto identifying the required fix or reconfiguration to resolve the issue.

AIOps creates this anomaly timeline automatically. It observes data streams in real time and uses historical and contextual correlation to identify the anomaly’s origin, as well as any important state changes during the anomaly. Even with a complete timeline, it’s still a challenge to reduce the overall noise level. AIOps addresses this by correlating across domains to filter out symptoms from possible causes.

There’s a good reason why AIOps’ RCA output may not always identify a single cause. Trained AI/ML models do not always produce a zero or one outcome, but rather work in a world of probabilities or likelihoods. The output of a self-taught ML algorithm will be a percentage likelihood that the resulting classification is accurate. As more data is fed to the algorithm, these outcome percentages may change if new data makes a specific output classification more likely. Early snapshots may indicate a priority list of probable causes that later refine down to a single cause, as more data runs through the ML models.

RCA is one area where AI/ML delivers the most value, and the time spent on RCA is the mean time to know (MTTK). While operations is working on RCA, the anomaly is still impacting customers. The pressure to conclude RCA quickly is why war rooms get filled with every possible I-shaped professional (a deep expert in a particular silo of skills) in order to eliminate the noise and get to the signal.

Apply AI/ML to Verification

Mean time to verify (MTTV) is the remaining MTTR portion automated in phase one of an AIOps rollout. An anomaly concludes when the environment returns to normal, or even to a new normal. The same ML mechanisms used for detection will minimize MTTV, as baselines already provide the definition of normal you’re seeking to regain. ML models monitoring live ETL streams of metrics from all sources provide rapid identification when the status returns to normal and the anomaly is over.

Later in your rollout when AIOps is powering fully automated responses, this rapid observation and response is critical, as anomalies are resolved without human intervention.  Part three of this series will discuss connecting this visibility and insight to action.

Successfully Deploying AIOps, Part 1: Deconstructing MTTR

Somewhere between waking up today and reading this blog post, AI/ML has done something for you. Maybe Netflix suggested a show, or DuckDuckGo recommended a website. Perhaps it was your photos application asking you to confirm the tag of a specific friend in your latest photo. In short, AI/ML is already embedded into our lives.

The quantity of metrics in development, operations and infrastructure makes development and operations a perfect partner for machine learning. With this general acceptance of AI/ML, it is surprising that organizations are lagging in implementing machine learning in operations automation, according to Gartner.

The level of responsibility you will assign to AIOps and automation comes from two factors:

  • The level of business risk in the automated action
  • The observed success of AI/ML matching real world experiences

The good news is this is not new territory; there is a tried-and-true path for automating operations that can easily be adjusted for AIOps.

It Feels Like Operations is the Last to Know

The primary goal of the operations team is to keep business applications functional for enterprise customers or users. They design, “rack and stack,” monitor performance, and support infrastructure, operating systems, cloud providers and more. But their ability to focus on this prime directive is undermined by application anomalies that consume time and resources, reducing team bandwidth for preemptive work.

An anomaly deviates from what is expected or normal. A crashing application is clearly an anomaly, yet so too is one that was updated and now responds poorly or inconsistently. Detecting an anomaly requires a definition of “normal,” accompanied with monitoring of live streaming metrics to spot when the environment exhibits abnormal behaviour.

The majority of enterprises are alerted to an anomaly by users or non-IT teams before IT detects the problem, according to a recent AppDynamics survey of 6,000 global IT leaders. This disappointing outcome can be traced to three trends:

  • Exponential growth of uncorrelated log and metric data triggered by DevOps and Continuous Integration and Continuous Delivery (CI/CD) in the process of automating the build and deployment of applications.
  • Exploding application architecture complexity with service architectures, multi-cloud, serverless, isolation of system logic and system state—all adding dynamic qualities defying static or human visualization.
  • Siloed IT operations and operational data within infrastructure teams.

Complexity and data growth overload development, operations and SRE professionals with data rather than insight, while siloed data prevents each team from seeing the full application anomaly picture.

Enterprises adopted agile development methods in the early 2000s to wash away the time and expense of waterfall approaches. This focus on speed came with technical debt and lower reliability. In the mid-2000s manual builds and testing were identified as the impediment leading to DevOps, and later to CI/CD.

DevOps allowed development to survive agile and extreme approaches by transforming development—and particularly by automating testing and deployment—while leaving production operations basically unchanged. The operator’s role in maintaining highly available and consistent applications still consisted of waiting for someone or something to tell them a problem existed, after which they would manually push through a solution. Standard operating procedures (SOPs) were introduced to prevent the operator from accidentally making a situation worse for recurring repairs. There were pockets of successful automation (e.g., tuning the network) but mostly the entire response was still reactive. AIOps is now stepping up to allow operations to survive in this complex environment, as DevOps did for the agile transformation.

Reacting to Anomalies

DevOps automation removed a portion of production issues. But in the real world there’s always the unpredictable SQL query, API call, or even the forklift driving through the network cable. The good news is that the lean manufacturing approach that inspired DevOps can be applied to incident management.

To understand how to deploy AIOps, we need to break down the “assembly line” used to address an anomaly. The time spent reacting to an anomaly can be broken into two key areas: problem time and solution time.

Problem time: The period when the anomaly has not yet being addressed.

Anomaly management begins with time spent detecting a problem. The AppDynamics survey found that 58% of enterprises still find out about performance issues or full outages from their users. Calls arrive and service tickets get created, triggering professionals to examine whether there really is a problem or just user error. Once an anomaly is accepted as real, the next step generally is to create a war room (physical or Slack channel), enabling all the stakeholders to begin root cause analysis (RCA). This analysis requires visibility into the current and historical system to answer questions like:

  • How do we recreate the timeline?
  • When did things last work normally or when did the anomaly began?
  • How are the application and underlying systems currently structured?
  • What has changed since then?
  • Are all the errors in the logs the result of one or multiple problems?
  • What can we correlate?
  • Who is impacted?
  • Which change is most likely to have caused this event?

Answering these questions leads to the root cause. During this investigative work, the anomaly is still active and users are still impacted. While the war room is working tirelessly, no action to actually rectify the anomaly has begun.

Solution time: The time spent resolving the issues and verifying return-to-normal state.

With the root cause and impact identified, incident management finally crosses over to spending time on the actual solution. The questions in this phase are:

  • What will fix the issue?
  • Where are these changes to be made?
  • Who will make them?
  • How will we record them?
  • What side effects could there be?
  • When will we do this?
  • How will we know it is fixed?
  • Was it fixed?

Solution time is where we solve the incident rather than merely understanding it. Mean time to resolution (MTTR) is the key metric we use to measure the operational response to application anomalies. After deploying the fix and verifying return-to-normal state, we get to go home and sleep.

Deconstructing MTTR

MTTR originated in the hardware world as “mean time to repair”— the full time from error detection to hardware replacement and reinstatement into full service (e.g., swapping out a hard drive and rebuilding the data stored on it). In the software world, MTTR is the time from software running abnormally (an anomaly) to the time when the software has been verified as functioning normally.

Measuring the value of AIOps requires breaking MTTR into subset components. Different phases in deploying AIOps will improve different portions of MTTR. Tracking these subdivisions before and after deployment allows the value of AIOps to be justified throughout.

With this understanding and measurement of existing processes, the strategic adoption of AIOps can begin, which we discuss in part two of this series.

Think 2019: Embracing the Cognitive Enterprise

Twenty years ago, companies were just beginning to leverage the internet to revolutionize the way they do business. Much has changed since then, of course, with a scrappy online bookseller emerging to become the world’s second largest e-commerce company and one of the biggest cloud providers—not to mention having one of the highest market caps in the financial world.

So what’s next? At Think 2019, a lot of discussion centered on IBM’s concept of the cognitive enterprise, one that uses artificial intelligence and distributed technologies such as blockchain to power—and disrupt—the markets of the future.

Bridging Cloud Native to Your Enterprise

KubeCon + CloudNativeCon might be the epicenter of Cloud Native and Kubernetes, but IBM Think isn’t far off. More than 70 sessions at Think 2019 involved Kubernetes in some capacity, with over 140 Cloud Native technologies represented. A popular topic at the conference was how Cloud Native could enhance traditional enterprise applications, particularly as businesses look to rearchitect their platforms to meet increased demand. This makes sense, as a key component of the cognitive enterprise is the ability to proactively meet customer and system needs.

Is Your Enterprise Cognitive?

Ignoring the cognitive enterprise today might prove as big a strategic blunder as missing the e-business bandwagon back in the 90s. But the path to cognition isn’t easy. For starters, building confidence in artificial intelligence takes time, as enterprises start to adopt technology that impacts their customers and operations.

To keep up with constant change, we need to be more cognizant of the problems our clients face. But replatforming and adopting complex technology is no mean feat. Decisions around paying off technical debt vs. future feature capabilities are not easy for product owners. Validating how/if change is impacting our business is an art form, with the objective and subjective merging. For instance, does providing a better customer and user experience lead to a higher conversion rate?

AppDynamics has the capability to track this customer journey—from mobile phone to mainframe—a fact highlighted by Jonah Kowall, AppDynamics Vice President of Market Development and Insights, in his excellent Think 2019 session. We enable your enterprise to proactively monitor, analyze and repair issues across a large swatch of your application infrastructure.


Jonah Kowall’s IBM Think session

Artificial Intelligence Goes Mainstream

The historic 1990s chess matches between human and machine—Garry Kasparov vs. IBM Deep Blue—showed that a computer could beat the best chess player in the world. Artificial intelligence has come a long way since then, with modern enterprises embracing AI/ML technology in a big way. Today, deep learning frameworks are some of the most popular open source packages available. At Think 2019, the number and variety of vendors featuring AI/ML capabilities, from robotic process automation to AppDynamics’ AIOps, showed just how far artificial intelligence has advanced in the enterprise.

AppDynamics Looks Forward with AIOps

As organizations start to navigate the cognitive enterprise, they’ll need to make many critical decisions when upgrading and rearchitecting their platforms. AppDynamics is embarking on an AIOps journey to help businesses strengthen their consumer and enterprise portfolios with cognitive technologies. We are excited to continue and deepen our IBM partnership, and to help each organization become a cognitive enterprise.

AppDynamics and Cisco To Host Virtual Event on AIOps and APM


To mark the two year anniversary of Cisco’s intent to acquire AppDynamics, the worldwide leader in IT, networking, and cybersecurity solutions will join AppDynamics for a one-of-a-kind virtual launch event on January 23, 2019. At AppDynamics Transform: AIOps and the Future of Performance Monitoring, David Wadhwani, CEO of AppDynamics, will share what’s next for the two companies, and lead a lively discussion with Cisco executives, Okta’s Chief Information Officer, Mark Settle, and Nancy Gohring, Senior Analyst at 451 Research. At the event, we’ll talk through what challenges leaders face and how they’re preparing for the future of performance monitoring.

Technology Leaders to Weigh In On the Impact of AI and the Future of Performance Monitoring

Today, application infrastructure is increasingly complex. Organizations are building and monitoring public, private, and hybrid cloud infrastructure alongside microservices and third party integrations. And while these developments have made it easier for businesses to scale quickly, they’ve introduced a deluge of data into the IT environment, making it challenging to identify issues and resolve them quickly.

APM solutions like AppDynamics continue to lead the way when it comes to providing real-time business insights to power mission critical business decisions. However, recent research has revealed a potential blind spot for IT teams: A massive 91% of global IT leaders say that monitoring tools only provide data on the performance of their own area of responsibility. For IT teams that want to mitigate risk as a result of performance problems, and business leaders who want to protect their bottom line, this blind spot represents a huge opportunity for improvement.

The Next Chapter in the AppDynamics and Cisco Story

As application environments continue to grow in complexity, so does the need for more comprehensive insight into performance. But technology infrastructure is simply too large and too dynamic for IT operations teams to manage manually. Automation for remediation and optimization is key–and that’s where innovations in artificial intelligence (AI) have the potential to make a huge difference in monitoring activities.

So, what does the future of performance monitoring look like?

Join us at the virtual event on January 23, 2019, to find out. David Wadhwani, alongside Cisco executives, will make an exciting announcement about our next chapter together. During the broadcast, we’ll also feature industry analysts and customers as we engage in a lively conversation about the emerging “AIOps” category, and what impact it will have on the performance monitoring space.

You won’t want to miss this unique virtual event.

Register now for AppDynamics Transform

 

Gartner Report Reveals Why Your APM Strategy Needs AIOps

Is application performance monitoring (APM) without artificial intelligence a waste of resources?

It turns out, the answer may be yes. Gartner’s newly released report, Artificial Intelligence for IT Operations Delivers Improved Business Outcomes, reveals that using artificial intelligence for IT operations (AIOps) in tandem with APM might be the key to optimizing business performance.

So why exactly do AIOps and APM make such a powerful pair and, perhaps more importantly, how can you start applying an AIOps mindset to APM in your own organization?

AIOps and APM: Great Alone, Better Together

Application performance monitoring (APM) is the key to proactively diagnosing and fixing performance issues, but a new study from Gartner reveals the many incremental benefits IT teams can derive from leveraging AIOps in conjunction with APM. Adding artificial intelligence into the mix gives IT and business leaders visibility into the right data at the right time to make decisions that maximize business impact. The power of AI in relation to APM is that most APM environments generate massive quantities of data that humans can’t possibly parse and derive meaning from fast enough to make it useful. Through machine learning, we can ingest that data, and over time, develop intelligence around what matters within an application ecosystem. As Gartner reveals, “AIOps with APM can deliver the actionable insight needed to achieve business outcomes such as improved revenue, cost and risk.”

Consider the process of assessing customer satisfaction based on customer sentiment data and related service desk data. Without using both AIOps and APM, infrastructure and operations (I&O) leaders might come to the conclusion that customers are delighted based on fast page load times. But by using AI to also ingest and analyze data from order management and customer service applications, I&O leaders can find correlations between IT metrics and business data such as revenue or customer retention. This level of insight offered by AIOps allows business leaders to make informed decisions and prioritize actions that will quickly improve customer satisfaction and, ultimately, the bottom line.

Applying AIOps to APM

Here are three ways I&O leaders can leverage AIOps together with APM to achieve incremental benefits—the step-by-step technical strategies for which can be found in Gartner’s new report:

1. Map application performance metrics to business objectives by using AIOps to detect unsuspected dependencies.

AIOps can be used to help measure IT’s activities in terms of benefits to the business—such as an increase in orders or improved customer satisfaction. To do this, I&O leaders should start by collaborating with key business stakeholders to identify the mission-critical priorities of the business relative to applications. Next, acquire the data supporting the measurement of these selected objectives by capturing the flow of business transactions such as orders, registrations and renewals. After inspecting their payloads, you can then use AIOps algorithms to detect patterns or clusters in the combined business and IT data, infer relationships, and determine causality.

2. Expand the ability to support prediction by using AIOps to forecast high probability future problems.

“AIOps provides insight into future events using its ability to extrapolate what is likely to happen next, enabling I&O leaders to take action in order to prevent impact,” Gartner states. As such, I&O leaders should take advantage of the many ways machine learning algorithms can provide value: predicting trends, detecting anomalies, determining causality and classifying data. Use AIOps algorithms to predict future values of time-series data such as end-user response time, engage in root-cause analysis of predicted issues to determine the true fault, and take preventative measures to prevent the impact of predicted problems.

3. Improve business outcomes by applying AIOps to customer and transaction data.

The pattern recognition, advanced analytics and machine learning capabilities of an AIOps solution can extend APM’s historical insight into application availability and performance to provide business impact. By using AIOps’ machine learning capabilities—including anomaly detection, classification, clustering and extrapolation—you can analyze behavior (e.g., customer actions during the order process) and relate that behavior to events afflicting the underlying IT infrastructure. Use the clustering and extrapolation algorithms contained within AIOps to detect unexpected patterns or groupings in your data and predict future outcomes. From there, you can correlate IT problems with changes in business metrics and establish how changes in application performance and availability impact customer sentiment.

Augmenting APM with Artificial Intelligence

The verdict is in and the evidence is compelling: AIOps is the key to maximizing the business impact of your APM investment.

Using AIOps together with APM can help I&O leaders more effectively align IT and business objectives, expand the ability to support prediction, and improve business performance. Leveraging AIOps can take your APM strategy to the next level, giving IT and business leaders the deep insight they need to make decisions that increase revenue, reduce costs, and lower risk.

Application performance management is already a critical tool that belongs in every IT leader’s toolbox, and AIOps is a game-changing technology set to transform APM and IT operations in a major way. As one analyst recently wrote for Forbes, “AIOps is gearing up to be the next big thing in IT management…When the power of AI is applied to operations, it will redefine the way infrastructure is managed.” In today’s competitive business landscape, companies need an edge to survive and thrive—and it seems APM with AIOps might just be the golden ticket.

Access the Full Research

For more exclusive insights into Gartner’s research on why—and how—you should apply AIOps to APM, download the report Artificial Intelligence for IT Operations Delivers Improved Business Outcomes.

Gartner, Artificial Intelligence for IT Operations Delivers Improved Business Outcomes, Charley Rich, 12 June 2018

AWS re:Invent Recap: Freedom for Builders

AWS re:Invent continues to build momentum. Amazon’s annual user conference, now in its seventh year, took place last week in Las Vegas with more than 50,000 builders on hand to share stories of cloud successes and challenges. The atmosphere was exciting and fast-paced, with Amazon once again raising the bar in the public cloud space. And AWS, which unveiled a plethora of new capabilities before and during the show, continues to delight and innovate at a rapid pace.

Are We Builders?

In his opening keynote, AWS CEO Andy Jassy shared a new theme that reverberated throughout the event: software engineers are now “builders,” not “developers.” Indeed, the internet recently has been ablaze with discussion and debate on this moniker shift.

Whether you see yourself as a builder or developer, Jassy helped categorize builders into different personas for enterprises that either like or dislike guardrails. (In AWS, a guardrail is a high-level rule that prevents deployment of resources that don’t conform to policies, thereby providing ongoing governance for the overall environment.)

If you like guardrails you could easily implement machine learning labeling with SageMaker. Not a fan of guardrails? AWS still focuses on core compute, which helps autoscale, for example, the exact number of GPUs a machine learning process needs with Amazon Elastic Inference. Still, the question remains: Are we builders or developers? The debate likely won’t end soon, but one thing is certain: the lower bar of entry for application infrastructure gives us the freedom to pick the best building blocks for our AWS environments.

Simplifying Cloud Migration

Depending on where you are in your cloud migration journey, the move to a public cloud vendor could cannibalize your development stack. AWS certainly has lowered the bar of entry, making diverse application infrastructure available to more individuals and organizations. In fact, you potentially could build a robust API without writing a single line of code on AWS.

Freedom to Transform

Another builder-related term popular at re:Invent was “freedom.” The most notable example was the series of announcements for databases evolving to Amazon Timestream, a managed time series database—check out the hashtag #databasefreedom to learn more. Regardless of whether you agree or disagree with Jassy’s “builder” designation, today’s enterprise is ripe for transformation. Partnering with AppDynamics can help you become an Agent of Transformation.

AWS in Your DataCenter

Some big news from the show made migrating your datacenter to AWS even more tangible. AWS has partnered with VMware to introduce AWS Outposts, an interesting proposition that seems to address the one place AWS was not reaching—the physical datacenter. Microsoft Azure has a similar product in Azure Stack, a combined software and hardware offering with the established management capabilities of VMware. Innovation in both platforms is sure to drive greater competition.

A Driverless Future

The dream of autonomous vehicles has been building ever since the first automobile hit the road. For re:Invent attendees who used Lyft to get around, this dream is now far more real. Lyft partnered with Aptiv to provide autonomous transport up and down the Vegas Strip, allowing those who opted in to be taken from session to session in a self-driving car.

Amazon Web Services also garnered a lot of buzz by introducing AWS DeepRacer, a scaled-down autonomous model race car. This isn’t just another toy, however. DeepRacer is a great tool for teaching machine learning concepts such as reinforcement learning, and for mastering underlying AWS services such as AWS SageMaker. Coupled with the large number of autonomous rides powered by Aptiv, many attendees will no doubt be inspired to use DeepRacer to study up on ML concepts.

AppD at re:Invent


AppDynamics’ Subarno Mukherjee leading an AWS session.

AppDynamics Senior Solutions Architect Subarno Mukherjee led an AWS session called “Five Ways Application Insights Impact Migration Success.” Subarno offered great insights into the importance of the customer and user experience, and how it impacts the success of your public cloud migration.

AppDynamics ran ongoing presentations in our booth throughout the show, covering the shift of cloud workloads to serverless and containers, maturing DevOps capabilities and processes, and the impending shift to AIOps.

Attendees showed a lot of interest in our Lambda Beta Program, too. Feel free to take a look at our program and, if interested, sign up today!

AppD Social Team

The AppDynamics social team was in full swing at re:Invent. From handing out prizes and swag to helping expand the AppDynamics community, we had a great time meeting our current and future customers. AppDynamics and AWS also co-hosted a fantastic happy hour at The Yardbird after Thursday’s sessions.

See You Next Year

We were thrilled to be a part of this year’s re:Invent. The great conversations and shared insights were fantastic. We’re looking forward to expanding our AWS ecosystem and partnerships—and to returning to re:Invent in 2019!

 

 

AIOps: A Self-Healing Mentality

The first time I began watching Minority Report back in 2002, the film’s premise made me optimistic: Crime could be prevented with the help of Precogs, a trio of mutant psychics capable of “previsualizing” crimes and enabling police to stop murderers before they act. What a great utopia!

I quickly realized, however, that this “utopia” was in fact a dystopian nightmare. I left the theater feeling confident that key elements of Minority Report’s bleak future—city-wide placement of iris scanners, for instance—would never come to pass. Fast forward to today, however, and ubiquitous iris-scanning doesn’t seem so far-fetched. Don’t believe me? Simply glance at your smartphone and the device unlocks.

This isn’t dystopian stuff, however. Rather, today’s consumer is enjoying the benefits that machine learning and artificial intelligence provide. From Amazon’s product recommendations to Netflix’s show suggestions to Lyft’s passenger predictions, these services—while not foreseeing crime—greatly enhance the user experience.

The systems that run these next-generation features are vastly complex, ingesting a large corpus of data and continually learning and adapting to help drive different decisions. Similarly, a new enterprise movement is underway to combine machine learning and AI to support IT operations. Gartner calls it “AIOps,” while Forrester favors “Cognitive Operations.”

A Hypothesis-Driven World

Hypothesis-driven analysis is not new to the business world. It impacts the average consumer in many ways, such as when a credit card vendor tweaks its credit-scoring rules to determine who should receive a promotional offer (and you get another packet in your mailbox). Or when the TSA decides to expand or contract its TSA PreCheck program.

Of course, systems with AI/ML are not new to the enterprise. Some parts of the stack, such as intrusion detection, have been using artificial intelligence and machine learning for some time.

But with potential AIOps use cases, we are entering an age where the entire soup-to-nuts of measuring user sentiment—everything from A/B testing to canary deployment—can be automated. And while there’s a sharp increase in the number of systems that can take action—CI/CD, IaaS, and container orchestrators are particularly well-suited to instruction—the harder part is the conclusions process, which is where AIOps systems will come into play.

The ability to make dynamic decisions and test multiple hypotheses without administrative intervention is a huge boon to business. In addition to myriad other skills, AIOps platforms could monitor user sentiment in social collaboration tools like Slack, for instance, to determine if some type of action or deeper introspection is required. This action could be something as simple as redeploying with more verbose logging, or tracing for a limited period of time to tune, heal, or even deploy a new version of an application.

AIOps: Precog, But in a Good Way

AIOps and cognitive operations may sound like two more enterprise software buzzwords to bounce around, but their potential should not be dismissed. According to Google’s Site Reliability Engineering workbook, self-healing and auto-healing infrastructures are critically important to the enterprise. What’s important to remember about AIOps and cognitive operations is that they enable self-healing before a problem occurs.

Of course, this new paradigm is no replacement for good development and operation practices. But more often than not, we take on new projects that may be ill-defined, or find ourselves dropped into the middle of a troubled project (or firestorm). In what I call the “fog of development,” no one person has an unobstructed, 360-degree view of the system.

What if the system could deliver automated insights that you could incorporate into your next software release? Having a systematic record of real-world performance and topology—rather than just tribal knowledge—is a huge plus. Similar to the security world having a runtime application self-protection (RASP) platform, engineers should address underlying issues in future versions of the application. In some ways, AIOps and cognitive operations have much in common with the CAMS Model, the core values of the DevOps Movement: culture, automation, measurement and sharing. Wouldn’t it be nice to automate the healing as well?

The Human Masterminds Behind AI at AppDynamics

A renown data scientist at Bell Laboratories, Tian Bu was ready for a new challenge in early 2015. But of all the places he imagined himself working, Cisco wasn’t on the list. Bu thought of Cisco as a hardware company whose business appeared to lack the very thing that mattered most to him—compelling problems that could be solved through a deep understanding of data. However, at the urging of a friend, Bu agreed to take a closer look.

What he found surprised and intrigued him. Earlier that year, Cisco had begun talking up a more software-centric approach with the announcement of the Cisco ONE software licensing program. But there was a great deal more to the new software-centric strategy than what had been publicly announced. Cisco was planning to disrupt the market and itself with a highly secure, intelligent networking platform designed to continually learn, adapt, automate, and protect. Such a platform would depend on machine learning and artificial intelligence. Cisco was offering Bu an opportunity he had been preparing for his entire career.

Bu had joined the Labs in 2002 as a member of the technical staff after distinguishing himself as a Ph.D. student at the University of Massachusetts, Amherst. With the support of DARPA and in collaboration with the Lawrence Berkeley National Laboratory, he had applied the same tomographic techniques used in medical imaging to the Internet, creating algorithms for predicting bottlenecks and other issues. A paper he co-authored on the project,  “Network Tomography on General Topologies,” was published in the Proceedings of the 2002 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems and recognized ten years later with a “Test of Time” award.

In 2007, Bell Labs Ventures approached Bu about creating an internal startup to commercialize his research on analyzing and optimizing wireless networks. Within 18 months, the technology was deployed in several Tier One networks. Momentum continued to build, and the startup was acquired by Alcatel-Lucent’s network intelligence business unit in 2010. In 2012, the Labs lured Bu back with the promise of applied research. For nearly three more years, he delved into questions about wireless networking and data monetization.

Joining Cisco would represent a radical change. If Cisco succeeded in its transformation, Bu would be at the forefront of figuring out how to automate IT and design genuinely self-healing systems. Not all the pieces were in place, but neither Cisco nor Bu could afford to wait. He decided to take a leap of faith and begin building a team.

His first hire was Anne Sauve, an expert in forecasting with a Ph.D. in electrical engineering. Sauve had a unique background, which Bu believed would be useful in finding insights into the millions of metrics per second that were streaming in from modern IT systems. During her doctoral studies at the University of Michigan, Sauve had specialized in statistical signal processing. Since then she had built up six years of experience in bioinformatics and genomics and nine years in medical imaging and 3D modeling. Her last job before joining Cisco was at a startup, where she developed a churn predictor for customer renewals and natural language processing algorithms to derive insights from customer tickets.

“What I liked about Cisco was its culture of rigorous engineering and the fact that it is grounded in reality,” she said. As Sauve dove into her work, producing a time series clustering algorithm to help determine the root cause of performance issues from streaming data and a new ensemble approach to forecasting, a second data scientist named Jiabin Zhao joined the group. An internal transfer from Cisco, Zhao brought more than a decade of experience working with IT data.

When Cisco acquired AppDynamics and Perspica in 2017 the size of the team more than doubled. AppDynamics had two seasoned data scientists: Yuchen Zhao and Yi Hong. Zhao and Hong both had worked for several years applying machine learning to the root cause analysis of problems affecting application performance. Their work included the algorithms that allowed customers to search for the relevant fields that were causing a business transaction to slow down. In addition, Zhao had shared two patents with Arjun Iyer, the senior engineering director, on automating log analysis and anomaly detection.

While AppDynamics’ strength lay in surfacing insights from stored data, Perspica applied machine learning and artificial intelligence to massive amounts of streaming data. Its cloud-based analysis engine could ingest and process millions of data points in real time. It offered the ability to automate threshold management and root cause analysis (RCA) and to predict problems at scale, complementing AppDynamics’ approach to those problems. While the pieces would have to be integrated, together they represented an extremely powerful AI solution.

From Bu’s point of view, the influx of talent from AppD and Pespica was as important as the technology. J.F. Huard, Perspica’s founder, now CTO of Data Science at AppDynamics, and Philip Labo, Perspica’s principal data scientist, were particularly strong additions to the team. Like Bu, Huard had spent time in the early 1990s at Bell Labs while simultaneously earning a doctorate at Columbia University. His research focus in those days was expert systems for network management. After graduating, he pioneered the application of advanced math to provide QoS in programmable networks at a company he co-founded called Xbind. He subsequently started three more companies including one that managed dynamic resource allocation based on game theory and another that focused on predictive analytics. Perspica was Huard’s fifth company.

Years of experience had brought Bu and Huard to the same conclusion: progress in machine learning and AI came from applying the right solution. It was insight and experience that distinguished one data scientist from another.

Labo was a post-doc at Stanford University when he met Huard to interview for a job at Perspica. He remembered how Huard had enthusiastically described a problem and then asked him to solve it. “I was thinking of elaborate solutions based on my work at Stanford,” Labo recalled. “JF was like, ‘No! Principal Component Analysis.’” PCA was a statistical procedure invented in 1901, and Labo was initially unimpressed. But as he thought about it more,  he realized PCA represented an elegant and simple solution to the problem Huard had posed.

Labo was drawn to the opportunity to put his background in applied math to work solving real-world problems for customers. In graduate school he had developed expertise in real-time multivariate analysis. Though the focus of his work was change point detection in yeast population evolution, the underlying ideas were curiously applicable to multivariate anomaly detection in computer data. “There’s something really funny about math in general and applied math in particular,” he said. “It just kind of works in a lot of different situations.”

Bu said Labo’s training has indeed been useful as the team has doubled down on multivariate anomaly detection. Overall, the diversity of backgrounds and depth of experience ensures that AppD will not blindly apply AI, but will choose the most appropriate solutions—ones that are both high quality and efficient to implement.

Given an industry shortage of senior data scientists, Bu said he feels particularly lucky to have a team that has spent years applying machine learning and AI to the entire stack—from applications to the network and beyond. “The strength of the team is that we are not just data scientists who know our math, we are also very familiar with the IT analytics domain,” he said.

The automation of IT at AppDynamics and Cisco is well on its way, Bu noted, with the right people applying the right solutions to important industry problems. For now, the team is focused on time series analysis, classification, and clustering. AppDynamics will be talking more in the near future about how customers can leverage their progress to spot problems sooner, find the root cause faster, and reduce system downtime.

Until then? “We are full speed ahead,” Bu said.

 

8 Reasons Enterprises Are Slow to Adopt Machine Learning

As CTO of Data Science at AppDynamics, and in my previous role as co-founder of Perspica, I’ve seen machine learning make huge strides in recent years. ML has helped Netflix perfect binge watching, taught Siri how to sound more human and made Amazon Echo a fashion consultant. But when it comes to machine learning use cases for the enterprise, it gets a whole lot more complicated. It’s easy to apply an algorithm to a one-off use case, but comprehensive enterprise applications of machine learning don’t exist today.

Here are the top 8 challenges standing in the way of widespread adoption of machine learning in the enterprise.

1) Confusion Over What Constitutes Machine Learning

Part of the problem is a lack of understanding around what machine learning is. Machine learning is an application or subset of AI, which is generally thought of as higher-order decision-making intelligence.

Machine learning is really about applying mathematics to different domains. It locates meaning within extremely large volumes of data by canceling out the noise. It uses algorithms to parse the data and draw conclusions about it, such as what constitutes normal behavior.

2) Uncertainty About What Machine Learning Can Do

Machine-learning algorithms don’t enter chess tournaments. What they are really good at is adapting to changing systems without human intervention while continuing to differentiate between expected and anomalous behavior. This makes machine learning useful in all kinds of applications—think everything from security to health care—as well as classification and recommendation engines, and voice and image identification systems.

Consumers interact daily with dozens of machine learning systems including Google Search, Google ads, Facebook ads, Siri and Alexa, as well as virtually any online product recommendation engine from Amazon to Netflix. The challenge for enterprises is understanding how machine learning can add value to their business.

3) Getting Started Can Be Daunting

Machine learning is usually introduced into an enterprise in one of two ways. The first is that one or two employees start applying machine learning to gain insight into data they already have access to. This requires a certain amount of expertise in data science and domain knowledge—skills that are in short supply.

The second is by purchasing a solution, such as security software or application performance management solution, that uses machine learning. This is by far the easiest way to begin to realize some of the benefits of machine learning, but the downside is an enterprise is dependent on the vendor and is not developing its own machine learning capabilities.

4) The Challenge of Data Preparation

Machine learning can sound deceptively simple. It’s easy to assume that all you have to do is collect the data and run it through some algorithms. The reality is very different. Once you collect the data then you have to aggregate it. You need to determine if there are any problems with it. Your algorithm needs to be able to adapt to missing data, outlying data, garbage data, and data that’s out of sequence.

5) The Lack of Public Labelled Datasets

In order for an algorithm to make sense of a collection of data points, it needs to understand what those points represent. In another words, it needs to be able to apply pre-established labels to the data.

The availability of publicly labelled datasets would make it much easier for companies to get started with machine learning. Unfortunately, these do not yet exist, and without them, most companies are looking at a “cold start.”

6) The Need for Domain Knowledge

At its best, machine learning represents the perfect marriage between an algorithm and a problem. This means domain knowledge is a prerequisite for effective machine learning, but there is no off-the-shelf way to obtain domain knowledge. It is built up in organizations over time and includes not just the inner workings of specific companies and industries, but the IT systems they use and the data that is generated by them.

7) Hiring Brilliant Data Scientists Is Not a Panacea

Most data scientists are mathematicians. Depending on their previous job experience, they may have zero domain knowledge that is relevant to their employer’s business. They need to be paired up with analysts and domain experts, which increases the cost of any machine learning project. And these people are hard to find and in high demand. We are lucky at AppDynamics to have a team of data scientists with broad experience in multiple fields who are doing ground-breaking work.

8) Machine Learning Lacks a Shared Vocabulary

One of the challenges encountered by organizations with successful machine learning initiatives is the lack of conventions around communicating findings. They end up with silos of people, each with their own definition of input and their own approach to sampling data. Consequently, they end up with wildly different results. This makes it difficult to inspire confidence in machine learning initiatives and will slow adoption until it is addressed.

At AppDynamics we’re excited to apply our machine learning expertise to solving enterprise IT problems. And you may be interested in my insights on how the arrival of AI and machine learning in the enterprise will have a profound impact on IT departments.