Successfully Deploying AIOps, Part 3: The AIOps Apprenticeship

Part one of our series on deploying AIOPs identified how an anomaly breaks into two broad areas: problem time and solution time. Part two described the first deployment phase, which focuses on reducing problem time. With trust in the AIOps systems growing, we’re now ready for part three: taking on solution time by automating actions.

Applying AIOps to Mean Time to Failure (MTTF)

The power of AIOps comes from continuous enhancement of machine learning powered by improved algorithms and training data, combined with the decreasing cost of processing power. A measured example was Google’s project for accurately reading street address numbers from its street image systems—a necessity in countries where address numbers don’t run sequentially but rather are based on the age of the buildings. Humans examining photos of street numbers have an accuracy of 98%. Back in 2011, the available algorithms and training data produced a trained model with 91% accuracy. By 2013, improvements and retraining boosted this number to 97.5%. Not bad, though humans still had the edge. In 2015, the latest ML models passed human capability at 98.1%. This potential for continuous enhancement makes AIOps a significant benefit for operational response times.

You Already Trust AI/ML with Your Life

If you’ve flown commercially in the past decade, you’ve trusted the autopilot for part of that flight. At some major airports, even the landings are automated, though taxiing is still left to pilots. Despite already trusting AI/ML to this extent, enterprises need more time to trust AI/ML in newer fields such as AIOps. Let’s discuss how to build that trust.

Apprenticeships allow new employees to learn from experienced workers and avoid making dangerous mistakes. They’ve been used for ages in multiple professions; even police departments have a new academy graduate ride along with a veteran officer. In machine learning, ML frameworks need to see meaningful quantities of data in order to train themselves and create nested neural networks that form classification models. By treating automation in AIOps like an apprenticeship, you can build trust and gradually weave AIOps into a production environment.

By this stage, you should already be reducing problem time by deploying AIOps, which delivers significant benefits before adding automation to the mix. These advantages include the ability to train the model with live data, as well as observe the outcomes of baselining. This is the first step towards building trust in AIOps.

Stage One: AIOps-Guided Operations Response

With AIOps in place, operators can address anomalies immediately. At this stage, operations teams are still reviewing anomaly alerts to ensure their validity. Operations is also parsing the root cause(s) identified by AIOps to select the correct issue to address. While remediation is manual at this stage, you should already have a method of tracking common remediations.

In stage one, your operations teams oversee the AIOps system and simultaneously collect data to help determine where auto-remediation is acceptable and necessary.

Stage Two: Automate Low Risk

Automated computer operations began around 1964 with IBM’s OS/360 operating system allowing operators to combine multiple individual commands into a single script, thus automating multiple manual steps into a single command. Initially, the goal was to identify specific, recurring manual tasks and figure out how to automate them. While this approach delivered a short-term benefit, building isolated, automated processes incurred technical debt, both for future updates and eventual integration across multiple domains. Ultimately it became clear that a platform approach to automation could reduce potential tech debt.

Automation in the modern enterprise should be tackled like a microservices architecture: Use a single domain’s management tool to automate small actions, and make these services available to complex, cross-domain remediations. This approach allows your investment in automation to align with the lifespan of the single domain. If your infrastructure moves VMs to containers, the automated services you created for networking or storage are still valid.

You will not automate every single task. Selecting what to automate can be tricky, so when deciding whether to fully automate an anomaly resolution, use these five questions to identify the potential value:

  • Frequency: Does the anomaly resolution occur often enough to warrant automation?
  • Impact: Are you automating the solution to a major issue?
  • Coverage: What proportion of the real-world process can be automated?
  • Probability: Does the process always produce the desired result, or can it be impacted by environmentals?
  • Latency: Will automating the task achieve a faster resolution?

Existing standard operating procedures (SOPs) are a great place to start. With SOPs, you’ve already decided how you want a task performed, have documented the process, and likely have some form of automation (scripts, etc.) in place. Another early focus is to address resource constraints by adding front-end web servers when traffic is high, or by increasing network bandwidth. Growing available resources is low risk compared to restarting applications. While bandwidth expansion may impact your budget, it’s unlikely to break your apps. And by automating resource constraint remediations, you’re adding a rapid response capability to operations.

In stage two, you augment your operations teams with automated tasks that can be triggered in response to AIOps-identified anomalies.

Stage Three: Connect Visibility to Action (Trust!)

As you start to use automated root cause analysis (RCA), it’s critical to understand the probability concept of machine learning. Surprisingly, for a classical computer technology, ML does not output a binary, 0 or 1 result, but rather produces statistical likelihoods or probabilities of the outcome. The reason this outcome sometimes looks definitive is that a coder or “builder” (the latter if you’re AWS’s Andy Jassy) has decided an acceptable probability will be chosen as the definitive result. But under the covers of ML, there is always a percentage likelihood. The nature of ML means that RCA sometimes will result in a selection of a few probable causes. Over time, the system will train itself on more data and probabilities and grow more accurate, leading to single outcomes where the root cause is clear.

Once trust in RCA is established (stage one), and remediation actions are automated (stage two), it’s time to remove the manual operator from the middle. The low-risk remediations identified in stage two can now be connected to the specific root cause as a fully automated action.

The benefits of automated operations are often listed as cost reduction, productivity, availability, reliability and performance. While all of these apply, there’s also the significant benefit of expertise time. “The main upshot of automation is more free time to spend on improving other parts of the infrastructure,” according to Google’s SRE project. The less time your experts spend in MTTR steps, the more time they can spend on preemption rather than reaction.

Similar to DevOps, AIOps will require a new mindset. After a successful AIOps deployment, your team will be ready to transition from its existing siloed capabilities. Each team member’s current specialization(s) will need to be accompanied with broader skills in other operational silos.

AIOps augments each operations team, including ITOps, DevOps and SRE. By giving each team ample time to move into preemptive mode, AIOps ensures that applications, architectures and infrastructures are ready for the rapid transformations required by today’s business.

Successfully Deploying AIOps, Part 2: Automating Problem Time

In part one of our Successfully Deploying AIOps series, we identified how an anomaly breaks into two broad areas: problem time and solution time. The first phase in deploying AIOps focuses on reducing problem time, with some benefit in solution time as well. This simply requires turning on machine learning within an AIOps-powered APM solution. Existing operations processes will still be defining, selecting and implementing anomaly rectifications. When you automate problem time, solution time commences much sooner, significantly reducing an anomaly’s impact.

AIOps: Not Just for Production

Anomalies in test and quality assurance (QA) environments cost the enterprise time and resources. AIOps can deliver significant benefits here. Applying the anomaly resolution processes seen in production will assist developers navigating the deployment cycle.

Test and QA environments are expected to identify problems before production deployment. Agile and DevOps approaches have introduced rapid, automated building and testing of applications. Though mean time to resolution (MTTR) is commonly not measured in test and QA environments (which aren’t as critical as those supporting customers), the benefits to time and resources still pay off.

Beginning your deployment in test and QA environments allows a lower-risk, yet still valuable, introduction to AIOps. These pre-production environments have less business impact, as they are not visited by customers. Understanding performance changes between application updates is critical to successful deployment. Remember, as the test and QA environments will not have the production workload available, it’s best to recreate simulated workloads through synthetics testing.

With trust in AIOps built from first applying AIOps to mean time to detect (MTTD), mean time to know (MTTK) and mean time to verify (MTTV) in your test and QA environments, your next step will be to apply these benefits to production. Let’s analyze where you’ll find these initial benefits.

Apply AI/ML to Detection (MTTD)

An anomaly deviates from what is expected or normal. Detecting an anomaly requires a definition of “normal” and a monitoring of live, streaming metrics to see when they become abnormal. A crashing application is clearly an anomaly, as is one that responds poorly or inconsistently after an update.

With legacy monitoring tools, defining “normal” was no easy task. Manually setting thresholds required operations or SRE professionals to guesstimate thresholds for all metrics measured by applications, frameworks, containers, databases, operating systems, virtual machines, hypervisors and underlying storage.

AIOps removes the stress of threshold-setting by letting machine learning baseline your environment. AI/ML applies mathematical algorithms to different data features seeking correlations. With AppDynamics, for example, you simply run APM for a week. AppDynamics observes your application over time and creates baselines, with ML observing existing behavioral metrics and defining a range of normal behavior with time-based and contextual correlation. Time-based correlation removes alerts related to the normal flow of business—for example, the login spike that occurs each morning as the workday begins; or the Black Friday or Guanggun Jie traffic spikes driven by cultural events. Contextual correlation pairs metrics that track together, enabling anomaly identification and alerts later when the metrics don’t track together.

AIOps will define “normal” by letting built-in ML watch the application and automatically create a baseline. So again, install APM and let it run. If you have specific KPIs, you can add these on top of the automatic baselines as health rules. With baselines defining normal, AIOps will watch metric streams in real time, with the model tuned to identify anomalies in real time, too.

Apply AI/ML to Root Cause Analysis (MTTK)

The first step to legacy root cause analysis (RCA) is to recreate the timeline: When did the anomaly begin, and what significant events occurred afterward? You could search manually through error logs to uncover the time of the first error. This can be misleading, however, as sometimes the first error is an outcome, not a cause (e.g., a crash caused by a memory overrun is the result of a memory leak running for a period of time before the crash).

In the midst of an anomaly, multiple signifiers often will indicate fault. Logs will show screeds of errors caused by stress introduced by the fault, but fail to identify the underlying defect. The operational challenge is unpacking the layers of resultant faults to identify root cause. By pinpointing this cause, we can move onto identifying the required fix or reconfiguration to resolve the issue.

AIOps creates this anomaly timeline automatically. It observes data streams in real time and uses historical and contextual correlation to identify the anomaly’s origin, as well as any important state changes during the anomaly. Even with a complete timeline, it’s still a challenge to reduce the overall noise level. AIOps addresses this by correlating across domains to filter out symptoms from possible causes.

There’s a good reason why AIOps’ RCA output may not always identify a single cause. Trained AI/ML models do not always produce a zero or one outcome, but rather work in a world of probabilities or likelihoods. The output of a self-taught ML algorithm will be a percentage likelihood that the resulting classification is accurate. As more data is fed to the algorithm, these outcome percentages may change if new data makes a specific output classification more likely. Early snapshots may indicate a priority list of probable causes that later refine down to a single cause, as more data runs through the ML models.

RCA is one area where AI/ML delivers the most value, and the time spent on RCA is the mean time to know (MTTK). While operations is working on RCA, the anomaly is still impacting customers. The pressure to conclude RCA quickly is why war rooms get filled with every possible I-shaped professional (a deep expert in a particular silo of skills) in order to eliminate the noise and get to the signal.

Apply AI/ML to Verification

Mean time to verify (MTTV) is the remaining MTTR portion automated in phase one of an AIOps rollout. An anomaly concludes when the environment returns to normal, or even to a new normal. The same ML mechanisms used for detection will minimize MTTV, as baselines already provide the definition of normal you’re seeking to regain. ML models monitoring live ETL streams of metrics from all sources provide rapid identification when the status returns to normal and the anomaly is over.

Later in your rollout when AIOps is powering fully automated responses, this rapid observation and response is critical, as anomalies are resolved without human intervention.  Part three of this series will discuss connecting this visibility and insight to action.

Successfully Deploying AIOps, Part 1: Deconstructing MTTR

Somewhere between waking up today and reading this blog post, AI/ML has done something for you. Maybe Netflix suggested a show, or DuckDuckGo recommended a website. Perhaps it was your photos application asking you to confirm the tag of a specific friend in your latest photo. In short, AI/ML is already embedded into our lives.

The quantity of metrics in development, operations and infrastructure makes development and operations a perfect partner for machine learning. With this general acceptance of AI/ML, it is surprising that organizations are lagging in implementing machine learning in operations automation, according to Gartner.

The level of responsibility you will assign to AIOps and automation comes from two factors:

  • The level of business risk in the automated action
  • The observed success of AI/ML matching real world experiences

The good news is this is not new territory; there is a tried-and-true path for automating operations that can easily be adjusted for AIOps.

It Feels Like Operations is the Last to Know

The primary goal of the operations team is to keep business applications functional for enterprise customers or users. They design, “rack and stack,” monitor performance, and support infrastructure, operating systems, cloud providers and more. But their ability to focus on this prime directive is undermined by application anomalies that consume time and resources, reducing team bandwidth for preemptive work.

An anomaly deviates from what is expected or normal. A crashing application is clearly an anomaly, yet so too is one that was updated and now responds poorly or inconsistently. Detecting an anomaly requires a definition of “normal,” accompanied with monitoring of live streaming metrics to spot when the environment exhibits abnormal behaviour.

The majority of enterprises are alerted to an anomaly by users or non-IT teams before IT detects the problem, according to a recent AppDynamics survey of 6,000 global IT leaders. This disappointing outcome can be traced to three trends:

  • Exponential growth of uncorrelated log and metric data triggered by DevOps and Continuous Integration and Continuous Delivery (CI/CD) in the process of automating the build and deployment of applications.
  • Exploding application architecture complexity with service architectures, multi-cloud, serverless, isolation of system logic and system state—all adding dynamic qualities defying static or human visualization.
  • Siloed IT operations and operational data within infrastructure teams.

Complexity and data growth overload development, operations and SRE professionals with data rather than insight, while siloed data prevents each team from seeing the full application anomaly picture.

Enterprises adopted agile development methods in the early 2000s to wash away the time and expense of waterfall approaches. This focus on speed came with technical debt and lower reliability. In the mid-2000s manual builds and testing were identified as the impediment leading to DevOps, and later to CI/CD.

DevOps allowed development to survive agile and extreme approaches by transforming development—and particularly by automating testing and deployment—while leaving production operations basically unchanged. The operator’s role in maintaining highly available and consistent applications still consisted of waiting for someone or something to tell them a problem existed, after which they would manually push through a solution. Standard operating procedures (SOPs) were introduced to prevent the operator from accidentally making a situation worse for recurring repairs. There were pockets of successful automation (e.g., tuning the network) but mostly the entire response was still reactive. AIOps is now stepping up to allow operations to survive in this complex environment, as DevOps did for the agile transformation.

Reacting to Anomalies

DevOps automation removed a portion of production issues. But in the real world there’s always the unpredictable SQL query, API call, or even the forklift driving through the network cable. The good news is that the lean manufacturing approach that inspired DevOps can be applied to incident management.

To understand how to deploy AIOps, we need to break down the “assembly line” used to address an anomaly. The time spent reacting to an anomaly can be broken into two key areas: problem time and solution time.

Problem time: The period when the anomaly has not yet being addressed.

Anomaly management begins with time spent detecting a problem. The AppDynamics survey found that 58% of enterprises still find out about performance issues or full outages from their users. Calls arrive and service tickets get created, triggering professionals to examine whether there really is a problem or just user error. Once an anomaly is accepted as real, the next step generally is to create a war room (physical or Slack channel), enabling all the stakeholders to begin root cause analysis (RCA). This analysis requires visibility into the current and historical system to answer questions like:

  • How do we recreate the timeline?
  • When did things last work normally or when did the anomaly began?
  • How are the application and underlying systems currently structured?
  • What has changed since then?
  • Are all the errors in the logs the result of one or multiple problems?
  • What can we correlate?
  • Who is impacted?
  • Which change is most likely to have caused this event?

Answering these questions leads to the root cause. During this investigative work, the anomaly is still active and users are still impacted. While the war room is working tirelessly, no action to actually rectify the anomaly has begun.

Solution time: The time spent resolving the issues and verifying return-to-normal state.

With the root cause and impact identified, incident management finally crosses over to spending time on the actual solution. The questions in this phase are:

  • What will fix the issue?
  • Where are these changes to be made?
  • Who will make them?
  • How will we record them?
  • What side effects could there be?
  • When will we do this?
  • How will we know it is fixed?
  • Was it fixed?

Solution time is where we solve the incident rather than merely understanding it. Mean time to resolution (MTTR) is the key metric we use to measure the operational response to application anomalies. After deploying the fix and verifying return-to-normal state, we get to go home and sleep.

Deconstructing MTTR

MTTR originated in the hardware world as “mean time to repair”— the full time from error detection to hardware replacement and reinstatement into full service (e.g., swapping out a hard drive and rebuilding the data stored on it). In the software world, MTTR is the time from software running abnormally (an anomaly) to the time when the software has been verified as functioning normally.

Measuring the value of AIOps requires breaking MTTR into subset components. Different phases in deploying AIOps will improve different portions of MTTR. Tracking these subdivisions before and after deployment allows the value of AIOps to be justified throughout.

With this understanding and measurement of existing processes, the strategic adoption of AIOps can begin, which we discuss in part two of this series.

Cognition Engine Unifies AIOps and Application Intelligence

When we welcomed Perspica into the AppDynamics family in 2017, I knew we were going to change the application performance monitoring industry in a big way. And that’s why today is so important for us.

Earlier this morning, we launched Cognition Engine – the next evolution of application performance monitoring that will give customers new levels of insight for a competitive edge in today’s digital-first economy.

When our customers told us that they would spend hours – sometimes days and weeks – to identify the root cause of performance issues, we knew we needed to bring a product to market that would alleviate this pain. And with Cognition Engine, that’s precisely the goal.

You can think of Cognition Engine as a culmination of the best features we’ve brought to market in the past — coupled with new and cutting-edge diagnostic capabilities leveraging the latest in AI/ML technology made possible by our Perspica acquisition. Now, IT teams no longer have to chase symptoms to find the root cause because the top suspects are automatically surfaced.

This level of insight from Cognition completely changes the game for IT, freeing them of tedious tasks and empowering them to focus on projects that will have great business impact. Below are some of Cognition Engine’s core benefits and features:

Avoid Customer-Impacting Performance Issues with Anomaly Detection

Cognition Engine ingests, processes, and analyzes millions of records per second, automatically understanding how metrics correlate, and detecting problems within minutes – giving IT a head start on fixing the problem before it impacts customers.

  • Using ML models, Anomaly Detection automatically evaluates healthy behavior for your application so that you don’t have to manually configure health rules.
  • Get alerts for key Business Transactions to deliver swift diagnostics, root-cause analysis, and remediation down to the line of code, function, thread, or database causing problems.
  • Cognition evaluates data in real-time as it enters the system using streaming analytics technology, allowing teams to analyze metrics and their associated behaviors to evaluate the health of the entire Business Transaction.

Achieve Fastest MTTR with Automated Root Cause Analysis

Cognition Engine automatically isolates metrics that deviate from normal behavior and presents the top suspects of root cause for any application issue – drastically reducing time spent on identifying root cause of performance issues.  

  • Reduce MTTR from minutes to seconds by automating the knowledge of exactly where and when to initiate a performance fix.
  • Understand the contextual insights about application and business health, predict performance deviations, and get alerts before serious customer impact.
  • Self-learning agents take full snapshots of performance anomalies—including code, database calls, and infrastructure metrics—making it easy to determine root-cause.

What Cognition Engine Means for the Enterprise

Cognition Engine ultimately empowers enterprises to embrace an AIOps mindset – valuing proaction over reaction, answers over investigation and, most importantly, never losing focus on customer experience or business performance.

Learn more about Cognition Engine now.

The New Serverless Agent For AWS Lambda

To control costs and reduce the burden of infrastructure management in the cloud, more companies are using services like AWS Lambda to deploy serverless functions. Due to the unpredictable nature of end-user demand in today’s digital-first world, serverless functions that can be spun up as needed can also help resolve unplanned scaling issues.

But that’s not to say these serverless workloads don’t impact the overall performance of your application environment. In fact, since these workloads are transient in nature, they represent a real challenge for teams who need to correlate an issue across their application environment, or see the impact that serverless applications are having on end users—or even on the business itself.

How AppDynamics Helps

Today, we’re announcing a new family of application agents that help our customers who use serverless microservices gain more visibility and insight into the performance of their application and its impact on the broader ecosystem.

In the same way that we collect and baseline metrics and events for traditional applications, we can now help serverless users gain deep insight into response times, throughput and exception rates in applications using services built in any mixture of serverless and conventional runtimes. Thus bringing our industry-leading ability to visualize end-user and business impact into the serverless realm, helping teams prioritize issue-resolution efforts and optimize the performance of these ephemeral workloads.

What We Do

The first iteration of AppDynamics’ Serverless Agent family targets Java microservices running in AWS Lambda, and is available as a beta program for qualified customers. Here’s how it works:

The Serverless Agent for AWS Lambda allows our customers to instrument their lambda code at entry (when it is invoked from an external request source), and exit (when it invokes some external downstream service), and to ingest incoming or populate outgoing correlation headers. Also, our streamlined approach to collecting metrics and events from serverless functions means you never have to worry about missing an important data point that may have gone unnoticed, or slowing down your otherwise healthy serverless functions.

You can find out more and sign up for AppDynamics’ AWS Lambda beta program on our community site.

Our Vision for AIOps: The Central Nervous System for IT

Exactly two years ago, Cisco announced their intent to acquire AppDynamics, and to say that it’s been quite a ride is a huge understatement.

Since the acquisition, we welcomed Perspica to the family to enrich our machine learning capabilities, expanded product coverage into areas like Business IQ, .NET Core, Kubernetes, SAP, and Mainframe, and leveraged new routes to market through Cisco and our partner programs – all of which helped us accelerate the hyper growth in our business. It’s been an amazing journey that has increased our workforce by 50% and made us one of Glassdoor’s Best Places to Work in 2019.

But while there has been a lot of change, there has always been one constant: Commitment to our customers. Together with Cisco, our mission is to empower Agents of Transformation – great leaders who have the ambition and determination to drive positive change for their customers, and in turn, their organizations, teams, and personal careers.

AppDynamics has been empowering Agents of Transformation since the day we were founded, and with Cisco, our ability to inspire change has been multiplied. That’s why today, I couldn’t be more excited to share the next chapter in the AppDynamics and Cisco story: The Central Nervous System – our vision for AIOps.

AIOps is a Mindset

AIOps enables organizations to leverage artificial intelligence and machine learning to derive real-time insights and begin automating tasks to augment technology operations teams.

But much like DevOps, AIOps will require an internal cultural change – a mindset shift for teams to move away from siloed monitoring tools and reject the notion of emergency war rooms.

When teams embrace an AIOps mindset, endless debugging tasks will be a thing of the past. AI-based systems help identify root case, predict performance, recommend optimizations, and automate fixes in real-time. So now, the time originally spent doing mundane tasks can be better focused on driving new innovation for their business.

The Central Nervous System

A critical element of embracing the AIOps mindset is to have a platform that can take input from various data sources, analyze it, and automate action in real-time.

Similar to how the central nervous system takes input from all the senses and coordinates action throughout the human body, the Cisco and AppDynamics AIOps strategy is to deliver the “Central Nervous System” for IT operations. This gives customers broader visibility of their complex environments, derives AI-based insights, and automates IT tasks to free up resources to drive new innovation.

Bringing the Central Nervous System to Life

For a system of intelligence to work effectively, it needs to understand how application performance impacts business outcomes and customer experience – and that’s exactly what our Business iQ solution makes possible.

Now, powered by the robust data set generated by APM and Business IQ offerings, the AppDynamics Cognition Engine brings real-time insights to mission-critical application and business performance by using machine learning to go beyond problem detection, to root cause identification. Once root cause is identified and exposed through an API, IT teams can start to develop an automation framework for faster remediation and resource optimization.

And then there’s Cisco – a critical piece needed to power the Central Nervous System. It starts with the breadth and depth of the data. Cisco connects and monitors billions of network devices, lights up data centers for hundreds of thousands of customers, blocks over 20 billion security threats per day, and collects hundreds of trillions of application metrics per year. With a rich automation roadmap in place, bringing this massive data set together for cross-domain correlation with machine learning and AI will deliver insights that no other company can provide.

But that’s not all. Cisco’s diverse partner ecosystem will also help develop innovative offers and scale go-to-market efforts. And it’s all of these elements that will fuel the AIOps journey for our customers.

Empowering Agents of Transformation

Together with Cisco, we’re committed to helping our customers at every stage of their AIOps journey. We want to empower great leaders to drive real business transformation and make them Agents of Transformation for their organization and their industry. And with an AIOps mindset that values prediction over reaction, answers over investigation, and actions over analysis – I know that we will.

How AppDynamics’ Diverse Partner Ecosystem Helps Power the Central Nervous System

Today, we announced the Cisco and AppDynamics AIOps strategy to deliver the “Central Nervous System” for IT operations, which is comprised of three core pillars:

  • Give customers broad visibility of their complex environments
  • Derive AI-based insights
  • Take Action on these insights to optimize IT environments and automate tasks to free up resources to drive new innovation

In this post, we’ll dive into the third pillar, Action, and explain four use cases for how AppDynamics integrates with strategic partners to action and automate IT tasks through incident response, event correlation, workload optimization, and communication facilitation.

Incident Response

Most incidents are a result of a change, and understanding and resolving the incident is the most critical part of operations – especially as we layer on levels of abstraction. However, the complexity of our systems is beyond what the human mind can possibly track or manage to understand the impact, making it extremely critical for IT teams to automate what they can to reduce the likelihood of human error.

To help IT teams get their apps back up and running as quickly as possible when incidents occur, AppDynamics integrates with typical workflows to manage incidents, problems, and changes including ServiceNow ITSM, Cherwell, BMC Remedy, and configuration management systems such as Evolven.

By combining AppDynamics’ granular visibility of applications with incident management capabilities, customers can triage user-impacting events before it impacts customers.

Event Correlation

Today’s IT teams are inundated with monitoring tools, according to a recent poll conducted at Gartner’s 2018 IT Infrastructure, Operations & Cloud Strategies Conference. Of the more than 200 respondents, 35% said they had over 30 monitoring tools – and this overload only seems to increase over time, with each tool generating alerts to find the right root cause of issues.

Event Management tools were created to help with this challenge by correlating and analyzing alerts from these disparate systems. AppDynamics integrates with the most commonly used Event Management systems, including ServiceNow Event Management and MoogSoft, which imports AppDynamics’ topological information (like flow maps and business transactions) to make more informed correlation decisions.

AppDynamics can also create events based on our machine learning baselines and anomaly detection. Paired with health rule violations, these provide more substantive alerts that show user experience degradation, which often leads to customer complaints.

Workload Optimization

Under every application, there are various layers of infrastructure, and many of these layers have additional layers constantly being added on top of them.

For example, in most enterprise data centers, we have physical servers with a virtualization layer on top, and more commonly, we are seeing a private cloud or orchestrated containers being implemented on top of them. These systems allow for easier deployment, scalability, and management – but it also comes at a cost as the complexity makes it difficult to ensure the technologies are delivering the right business outcomes.

To help address this challenge, the AppDynamics platform integrates with and monitors technologies such as Pivotal Cloud Foundry, RedHat OpenShift, and the open source Kubernetes platform. Many of these systems can also take telemetry in the form of metrics and events from AppDynamics to make better decisions on how to scale, when to scale, and what the results are. Additionally, AppDynamics integrates with Turbonomic’s platform to help customers optimize and orchestrate workloads on virtualized servers, private clouds, and public cloud environments.

As organizations continue to invest heavily in the cloud, workload optimization is more critical than ever. According to a Forrester Consulting survey with over 700 respondents, 86% said their organization has a multi-cloud strategy and almost half of Enterprises report at least $50 million in annual cloud spending. These statistics make one thing clear: Multi-cloud or hybrid management is not an option, but a requirement.

Facilitate Communication

Many organizations are implementing new ways to move code to production faster through the use of open source systems such as Spinnaker or Jenkins, or more advanced commercial offerings such as Microsoft TFS, Gitlab, CircleCI, or Harness. AppDynamics has integrations with these products to automatically tag and track when code is pushed to systems.

However, as teams accelerate their deployments, this can often result in overlooking processes which ensure additional checks, like a handoff for testing and verification. In fact, in the 12th annual State of Agile Survey, while the use of continuous deployment increased from 35% in 2017 to 37% in 2018, continuous integration dropped from 61% to 54%.

As a result, this lack of integration can lead to high business risk. To avoid this, teams must improve the way they communicate, coordinate, and execute on any issues. Since each organization has different structures with varying roles and expertise, matching up the right experts to resolve the incident is often a requirement.

With the Central Nervous System, we integrate into solutions such as PagerDuty, xMatters, and OpsGenie to help facilitate communication and ensure the coordination of subject matter experts or owners. Ownership and accountability are key elements when going through cultural change and implementing one of the most important of the 3 ways of DevOps, feedback, and collaboration.

Thank you, Partners!

AppDynamics is so proud to work with such an amazing breadth of partners who can helps us optimize the IT environment by using orchestration and automation systems to do intelligent workload placement, cloud cost optimization, incident response, or even security enforcement.
And as we continue to innovate on our offerings and solutions for our customers, I can’t wait see what other partners will join the AppDynamics partner ecosystem.

AppDynamics and Cisco To Host Virtual Event on AIOps and APM


To mark the two year anniversary of Cisco’s intent to acquire AppDynamics, the worldwide leader in IT, networking, and cybersecurity solutions will join AppDynamics for a one-of-a-kind virtual launch event on January 23, 2019. At AppDynamics Transform: AIOps and the Future of Performance Monitoring, David Wadhwani, CEO of AppDynamics, will share what’s next for the two companies, and lead a lively discussion with Cisco executives, Okta’s Chief Information Officer, Mark Settle, and Nancy Gohring, Senior Analyst at 451 Research. At the event, we’ll talk through what challenges leaders face and how they’re preparing for the future of performance monitoring.

Technology Leaders to Weigh In On the Impact of AI and the Future of Performance Monitoring

Today, application infrastructure is increasingly complex. Organizations are building and monitoring public, private, and hybrid cloud infrastructure alongside microservices and third party integrations. And while these developments have made it easier for businesses to scale quickly, they’ve introduced a deluge of data into the IT environment, making it challenging to identify issues and resolve them quickly.

APM solutions like AppDynamics continue to lead the way when it comes to providing real-time business insights to power mission critical business decisions. However, recent research has revealed a potential blind spot for IT teams: A massive 91% of global IT leaders say that monitoring tools only provide data on the performance of their own area of responsibility. For IT teams that want to mitigate risk as a result of performance problems, and business leaders who want to protect their bottom line, this blind spot represents a huge opportunity for improvement.

The Next Chapter in the AppDynamics and Cisco Story

As application environments continue to grow in complexity, so does the need for more comprehensive insight into performance. But technology infrastructure is simply too large and too dynamic for IT operations teams to manage manually. Automation for remediation and optimization is key–and that’s where innovations in artificial intelligence (AI) have the potential to make a huge difference in monitoring activities.

So, what does the future of performance monitoring look like?

Join us at the virtual event on January 23, 2019, to find out. David Wadhwani, alongside Cisco executives, will make an exciting announcement about our next chapter together. During the broadcast, we’ll also feature industry analysts and customers as we engage in a lively conversation about the emerging “AIOps” category, and what impact it will have on the performance monitoring space.

You won’t want to miss this unique virtual event.

Register now for AppDynamics Transform

 

What Is AIOps? Platforms, Market, Use Cases & The Future of Performance Monitoring

The term “AIOps” stands for “artificial intelligence for IT operations.” Originally coined by Gartner in 2017, the term refers to the way data and information from an IT environment are managed by an IT team–in this case, using AI. This definition from Gartner provides more granular detail related to the concept and explicates the value of an AIOps platform:

“AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight. AIOps platforms enable the concurrent use of multiple data sources, data collection methods, analytical (real-time and deep) technologies, and presentation technologies.”

But why should an enterprise IT team care about about AIOps?

To answer that question, let’s dig deeper to understand the story behind AIOps, explore the elements of AIOps platforms, and review three potential use cases.

The Core Elements of An AIOps Platform

Today’s application environments are exploding in complexity. According to the Wall Street Journal, midsize to large companies now use an average of eight different cloud providers for various enterprise applications and services. Compounding this complexity is the sheer volume of data produced by application infrastructure, and the high potential for performance problems each time an update or change is made to that existing infrastructure. While application performance monitoring (APM) solutions provide real-time alerts for performance problems, there’s evidence that IT teams need more support to effectively monitor the increasingly complex landscape.

And that’s where AIOps platforms enter the picture.

Rather than reacting to issues as they arise in the application environment, AIOps platforms allow IT teams to proactively manage performance challenges faster, and in real-time–before they become system-wide problems. That’s because AIOps platforms have the ability to ingest large volumes of data originating from all areas of the application environment, and analyze it using AI to identify areas of remediation and optimization.

AIOps platforms also play a critical role in eliminating the manual component of identifying issues within the IT landscape, a problem that’s compounded by the still siloed nature of the monitoring environment. In fact, recent research from AppDynamics revealed that 91% of global IT leaders said monitoring tools only provide data about how releases impact their own area of responsibility, and not the broader IT environment, or the business. With an AIOps platform, IT doesn’t have to work harder to get smarter about what’s happening within every facet of application infrastructure.

Make no mistake, AIOps platforms have compelling potential. But as of right now, the category itself is emergent and highly fluid. Case in point: Gartner defines AIOps platforms as having several key components, however, those components are broad enough that many tools could potentially fit into this category now or in the future. Here’s how a Gartner analyst, Pankaj Prasad, described AIOps platforms:

“AIOps platform technologies comprise of multiple layers that address data collection, storage, analytical engines and visualization. They enable integration with other applications via application programming interfaces (APIs) allowing for a vendor-agnostic data ingestion capability.”

While Gartner’s elements of an AIOps platform are somewhat broad–as are many others out there in the market–the category will continue to evolve in the years ahead as the technology becomes more rigorous, and its use cases more apparent. What’s more, many of these shifts will happen alongside changes in the broader APM space. A more pared down overview of core AIOps platform components would include:

  • Machine learning
  • Performance baselining
  • Anomaly detection
  • Automated root cause analysis
  • Predictive insights

What Problems Does An AIOps Strategy Solve?

Growing complexity and the deluge of data within the application environment puts new demands on IT professionals to both synthesize meaning from this influx of information and connect it to broader business objectives. In this highly demanding environment, IT teams need all of the help they can get when it comes to performance optimization.

That’s where AIOps platforms can play a pivotal role in advancing IT organizations and reducing the complexity within the application environment. With AIOps, you can bring all data into a single place, and scale it to understand your environment from every possible angle. This provides teams with the flexibility needed to automate certain tasks when appropriate, and use AI to pinpoint problems faster.

From reducing the cognitive overhead of parsing through volumes of data within the application environment to the potential for self-healing capabilities that help solve major performance problems, AIOps is an exciting space that could help IT professionals in three major areas:

  1. Drive faster and better decision-making. Broadly speaking, AIOps platforms and related AI features have the potential to become smart enough about IT environments in order to surface insights and provide them to leaders for faster and better decision-making.
  2. Decrease MTTR. Outages and performance problems hurt the bottom line of every business, so IT organizations must actively seek out ways to reduce the mean time to resolution (MTTR). With AIOps, it’s possible that IT teams could decrease MTTR and prevent emerging issues, and in doing so, reduce the costs associated with performance problems.
  3. Build a more proactive approach to performance monitoring. According to research from AppDynamics, 74% of IT professionals would like to build a more proactive approach to performance monitoring. With AIOps technology, there’s potential to take it a step further, and respond to issues in real-time. What’s more, by taking in the totality of application environment data, AIOps platforms could connect performance insights to business outcomes. This would finally close the loop on the impact of performance on the business and customers, and it would help organizations take action before small issues become larger problems.

Looking Ahead to the Future of Performance Monitoring and AIOps

Right now, AIOps technology is still relatively new, the terms and concepts relatively fluid, and there’s a great deal of work to be done before anyone can deliver on the promise of AIOps. What is established, however, is that AIOps is already a mindset focused on prediction over reaction, answers over investigation, and actions over analysis. And that’s why IT leaders should keep an eye on the rise of AIOps as a whole, and start preparing for what’s next in monitoring and observability. If history is any indication, there’s enormous potential for transformation in the space in a short period of time.

 

The Rise of AIOps: How Data, Machine Learning, and AI Will Transform Performance Monitoring

Over the last decade, application environments have exploded in complexity.

Gone are the days of managing monoliths. Today’s IT professionals are tasked with ensuring the performance and reliability of distributed systems across virtualized and multi-cloud environments. And while it may be true that the emergence of this modern application environment has provided the speed and flexibility professionals demand, these numerous services have unleashed a deluge of data on the enterprise IT environment.

Application performance monitoring (APM) solutions have proven essential in helping leaders take back control by providing the real-time insights needed to take action. But as the volume of data in IT ecosystems increases, many professionals are finding it challenging to take a proactive approach to managing it all. While automating tasks have helped teams free up some bandwidth for operations and planning, automation alone is no match for today’s increasingly complex environments. What’s needed is a strategy focused on reducing the burden of mounting IT operations responsibilities, and surfacing the insights that matter the most so that businesses can take the right action.

So, what are forward-thinking IT professionals doing to stay ahead of the curve?

Many are applying what’s being called an AIOps approach to the challenge of application environment complexity. This approach leverages advances in machine learning and artificial intelligence (AI) to proactively solve problems that arise in the application environment. Even though relatively new, the approach is gaining momentum. And for good reason: Using AI to identify potential challenges within the application environment doesn’t just help IT professionals get ahead of problems — it helps companies avoid revenue-impacting outages that jeopardize the customer experience, the business, and the brand.

In order to fully understand the rise of AIOps and why it has developed the momentum it has, we wanted to dig deeper to uncover the actual challenges faced by IT professionals, and how they’re managing them in an increasingly complex application environment. To accomplish that, AppDynamics undertook a study of 6,000 global IT leaders in Australia, Canada, France, Germany, the United Kingdom, and the United States. Their responses answered three key questions about the shift in the performance space:

(1) What’s the current enterprise approach to managing increasing application environment complexity?

(2) How are global IT leaders taking a proactive approach to identifying problems in the application environment?

(3) How broadly is AI identified as a potential solution to reducing complexity in IT ecosystems?

Let’s see what the research revealed.

The Demand for Proactive Application Performance Monitoring Tools

Today, midsize to large companies use an average of eight different cloud providers for various enterprise applications and services. As a result, IT professionals are managing an ever-increasing set of tasks that have the potential to become disconnected if not managed properly. What’s more, within these highly distributed systems, IT leaders must grapple with the impact of new code being deployed, as well as the virtually infinite potential outcomes associated with doing so. Without a unified view of how all of these elements interact, there’s significant potential for issues to arise that impact performance — and, ultimately — the customer experience.

New research from AppDynamics underscores the cause for concern: 48% of enterprises surveyed say they’re releasing new features or code at least monthly, but their current approach to monitoring only provides a siloed view on the quality and impact of each release. In fact, of those enterprises that release on that cadence, a massive 91% say that monitoring tools only provide data on how each release drives the performance of their own area of responsibility.

Research from AppDynamics indicates performance monitoring remains siloed.

Should these findings raise eyebrows? Absolutely.

That’s because they indicate that for the vast majority of those surveyed, a holistic view of business and customer value is still difficult to achieve. And that puts innovation — as well as modern, best-in-class software development practices like continuous delivery — at serious risk.

But that’s where leveraging data about the application environment using machine learning, as well as AI, can make a massive difference. Instead of merely ingesting data from every dimension of the application environment, these tools can help IT professionals build a more proactive approach to APM.

And, by all accounts, that’s what most global IT leaders want.

According to research findings from AppDynamics, 74% of surveyed said they want to use monitoring and analytics tools proactively to detect emerging business-impacting issues, optimize user experience, and drive business outcomes like revenue and conversion. But according to our research, 42% of respondents are still using monitoring and analytics tools reactively to find and resolve technical issues. There’s indication, however, that this approach is extremely problematic for businesses. Beyond a serving as a pain point for IT professionals in terms of capacity and resource planning, reactive monitoring — in some cases — can potentially cost businesses hundreds of thousands of dollars in lost revenue.

The majority of IT professionals want to use monitoring tools more proactively.

How Reactive Monitoring Hurts Performance, Revenue, and Brand

From e-commerce to banking, booking flights to watching movies on Netflix, applications have proliferated people’s lives. As a result, consumers have high expectations for application performance that businesses must deliver on. If not, they risk jeopardizing brand loyalty and, as our research revealed, their bottom line.

“As the broader technology landscape undergoes its own dramatic change, forcing businesses to double down on their customer focus, managing the performance of applications has never been more critical to the bottom line.” — Jason Bloomberg, The Rebirth of Application Performance Management

IT professionals have long relied on the mean time to repair (MTTR) metric to evaluate the overall health of an application environment. The longer it takes to resolve an issue, the greater the potential for it to turn into a significant business problem, particularly in an increasingly fast-paced digital world. However, in this latest AppDynamics research, we made a startling discovery: Most organizations are grappling with a high average MTTR:  Respondents reported that it took an average of 1 business day, or seven hours, to resolve a system-wide issue.

But that wasn’t the most alarming finding.

Our research also revealed that many enterprise IT teams weren’t notified about performance issues via monitoring tools at all. In fact:

  • 58% find out from users calling or emailing their organization’s help desk
  • 55% find out from an executive or non-IT team member at their company who informs IT
  • 38% find out from users posting on social networks

AppDynamics research reveals how performance problems are being discovered in the enterprise.

To fully appreciate the impact of 7 hour MTTR on a business, AppDynamics asked survey respondents to report the total number of dollars lost during an hour-long outage, and used that figure to extrapolate the typical cost of an average, day-long outage. For the United States and United Kingdom, the cost of an average outage totals $402,542 USD and $212,254 USD, respectively (the cost of an outage in the United Kingdom was converted into United States dollars).

United States

AppDynamics research revealed that companies in the United States on average lose $402,542 for a single service outage.

United Kingdom

The high cost of a performance outage in the United Kingdom.

It’s important to note that these figures reflect the total cost for a single outage in the enterprise — if a company has more than one, that figure can rise dramatically. In fact, a substantial 97% of global IT leaders surveyed said they’d had performance issues related to business-critical applications in the last six months alone.

Of the 6,000 IT professionals AppDynamics surveyed, 97% said they’d experienced a service outage in the last six months.

In addition to the impact on a company’s bottom line, global IT leaders reported that
reactive performance monitoring had created stressful war room situations and damaged their brand. 36% said they had to pull developers and other teams off other work to analyze and fix problems as they presented themselves, and nearly a quarter of respondents said slow root cause analyses drained resources.

The takeaway here is clear: global IT leaders need to build a more proactive approach to APM in order to lower MTTR and protect their bottom line. But in today’s increasingly complex application environment, that’s easier said than done.

Unless, of course, you’re developing an AIOps strategy to manage it.

The Risk of Not Adopting an AIOps Strategy

AppDynamics research showed that the overwhelming majority of IT professionals want a more proactive approach to APM, but one of the main ways of achieving that — through the adoption of an AIOps strategy — isn’t being widely pursued by global IT teams in the near-term.

In fact, the global IT leaders AppDynamics surveyed reported that although they believe AIOps will be critical to their monitoring strategy, only 15% identified it as a top priority for their business in the next two years.

AppDynamics research reveals that the vast majority of IT professionals surveyed don’t have an AIOps strategy in place in the near-term.

What’s more, the capabilities that respondents identified as essential to APM in the next 5 years are precisely those that AIOps has the potential to help provide. For example:

Intelligent alerting that can be trusted to indicate an emerging issue.
49% of respondents identified this feature as core to their performance monitoring capabilities in the next five years. By ingesting data from any application environment, AIOps platforms and technology can play a pivotal role in not just automating existing IT tasks, but identifying and managing new ones based on potential problems detected in the application environment.

Automated root cause analysis and business impact assessment.
44% of respondents said solving problems quickly and understanding their impact on the business would play a crucial part of their performance management in the years ahead. With the help of AIOps technology, this can be achieved, providing increased agility in the face of potential service disruptions or threats, and without additional drain on resources.

Automated remediation for common issues.
42% of survey respondents said that they needed to build automated remediation into their strategy for performance monitoring. With AIOps, it’s easy to not only automate remediation for known issues, but unknown issues, too. That’s because it not only ingests data from your application environment, but provides more intelligent insights as a result of it.

Leading The Way With AIOps Strategy and Platforms  

Despite increasingly complex application environments, few of the global IT leaders surveyed are prioritizing the development of an AIOps strategy, which would allow them to implement the platforms and practices to permit proactive identification of issues before they become system-wide problems. Instead, global IT leaders report an average MTTR rate that hovers at a full business day, and has the potential to cost companies hundreds of thousands of dollars in lost revenue with each incident.

What’s more, AppDynamics research findings also make it clear that many global IT leaders are struggling to integrate monitoring activities into the purview of the broader business. This can cause significant delays in MTTR, as noted, as well as make companies vulnerable to service disruptions that can cause irreparable harm to the customer experience, and the enterprise as a whole.

While IT leaders have expressed a desire for a more proactive approach to monitoring, this research indicates that there’s still plenty of work to be done on numerous fronts. But the first step is clear: IT leaders must prioritize the development of an AIOps strategy and related technology. In doing so, they’ll  simplify the demands of an increasingly complex application environment, and build a stronger connection from IT to the business as a whole.


Editor’s Note: In this piece, the term “global IT leaders” refers to the respondents surveyed for this report. The term “IT professionals” refers to people in the IT or related professions as a whole.