AIOps: A Self-Healing Mentality

The first time I began watching Minority Report back in 2002, the film’s premise made me optimistic: Crime could be prevented with the help of Precogs, a trio of mutant psychics capable of “previsualizing” crimes and enabling police to stop murderers before they act. What a great utopia!

I quickly realized, however, that this “utopia” was in fact a dystopian nightmare. I left the theater feeling confident that key elements of Minority Report’s bleak future—city-wide placement of iris scanners, for instance—would never come to pass. Fast forward to today, however, and ubiquitous iris-scanning doesn’t seem so far-fetched. Don’t believe me? Simply glance at your smartphone and the device unlocks.

This isn’t dystopian stuff, however. Rather, today’s consumer is enjoying the benefits that machine learning and artificial intelligence provide. From Amazon’s product recommendations to Netflix’s show suggestions to Lyft’s passenger predictions, these services—while not foreseeing crime—greatly enhance the user experience.

The systems that run these next-generation features are vastly complex, ingesting a large corpus of data and continually learning and adapting to help drive different decisions. Similarly, a new enterprise movement is underway to combine machine learning and AI to support IT operations. Gartner calls it “AIOps,” while Forrester favors “Cognitive Operations.”

A Hypothesis-Driven World

Hypothesis-driven analysis is not new to the business world. It impacts the average consumer in many ways, such as when a credit card vendor tweaks its credit-scoring rules to determine who should receive a promotional offer (and you get another packet in your mailbox). Or when the TSA decides to expand or contract its TSA PreCheck program.

Of course, systems with AI/ML are not new to the enterprise. Some parts of the stack, such as intrusion detection, have been using artificial intelligence and machine learning for some time.

But with potential AIOps use cases, we are entering an age where the entire soup-to-nuts of measuring user sentiment—everything from A/B testing to canary deployment—can be automated. And while there’s a sharp increase in the number of systems that can take action—CI/CD, IaaS, and container orchestrators are particularly well-suited to instruction—the harder part is the conclusions process, which is where AIOps systems will come into play.

The ability to make dynamic decisions and test multiple hypotheses without administrative intervention is a huge boon to business. In addition to myriad other skills, AIOps platforms could monitor user sentiment in social collaboration tools like Slack, for instance, to determine if some type of action or deeper introspection is required. This action could be something as simple as redeploying with more verbose logging, or tracing for a limited period of time to tune, heal, or even deploy a new version of an application.

AIOps: Precog, But in a Good Way

AIOps and cognitive operations may sound like two more enterprise software buzzwords to bounce around, but their potential should not be dismissed. According to Google’s Site Reliability Engineering workbook, self-healing and auto-healing infrastructures are critically important to the enterprise. What’s important to remember about AIOps and cognitive operations is that they enable self-healing before a problem occurs.

Of course, this new paradigm is no replacement for good development and operation practices. But more often than not, we take on new projects that may be ill-defined, or find ourselves dropped into the middle of a troubled project (or firestorm). In what I call the “fog of development,” no one person has an unobstructed, 360-degree view of the system.

What if the system could deliver automated insights that you could incorporate into your next software release? Having a systematic record of real-world performance and topology—rather than just tribal knowledge—is a huge plus. Similar to the security world having a runtime application self-protection (RASP) platform, engineers should address underlying issues in future versions of the application. In some ways, AIOps and cognitive operations have much in common with the CAMS Model, the core values of the DevOps Movement: culture, automation, measurement and sharing. Wouldn’t it be nice to automate the healing as well?

Cloud Migration Tips #2: We Should Use the Cloud Because…

Welcome back to my blog series on deploying applications to the cloud.

What’s the point of deploying an application to the cloud versus just hosting it in your own data center? Is it really a good idea? Will it save you money? Will it work better? Will it cause new deployment and management problems? How do you monitor it?


These are all basic questions you should ask yourself before deciding IF your new or existing application will end up in a cloud environment.

The answers might be different for each application supporting your business. Cloud is really a set of architectural patterns that are available to help you solve business problems using technology. If you’re considering cloud you better have a business problem that you need to solve.

Here are a few business problems that would make me consider a cloud implementation for my application(s):

  • We’re out of space in our data center and most of our applications are used via the internet–should we build another data center or move some applications to public cloud providers?
  • Our new mobile application will need to scale rapidly as it becomes more popular–we have to be able to scale as needed so our customers have a good user experience.
  • We need to accelerate our time to market and make our business more agile–we don’t have time to wait for IT and all of our productivity sapping processes.

No matter what your business reasons are you need to come up with quantifiable and measurable success criteria so that you can prove out the benefits (or failure) of your cloud computing initiative. This implies you are already measuring something BEFORE you move to the cloud so you can compare metrics before and after. Here are some example KPIs that might be applicable:

  • Time to deliver requested environment to developers
  • Number of application impact incidents
  • Infrastructure cost per application
  • Time to scale / Cost to scale Application
  • Transaction throughput
  • SLA (yeah, you really should have one of these)

So You’ve Decided To Go For It

You’ve got your business justification nailed down and decided you really do need a cloud based application. Great! If this is a brand new application you can design it from the ground up and just deploy it, right? No!!! Remember, you need to monitor and manage this application if you stand any chance of providing a good user experience over the long haul.

“My cloud provider has all the monitoring and management tools I need.” – Wrong! Your cloud provider has basic monitoring tools that show you infrastructure metrics (CPU usage, Memory Usage, I/O usage, etc…). These monitoring tools don’t tell you anything about your application. Here’s what you need to know about your cloud application (at a minimum):

  • Which application nodes are in use at any given time. (Dynamic scaling, provisioning, de-provisioning will change this picture at any given time)
  • Application calls to external services with response times and error rates. (External service calls are performance killers for cloud applications and drive up the cost as most provider charge for network traffic leaving their cloud)
  • The response time and errors of all of my users business transactions. (Applies to any application architecture but cloud deployments can experience greater variance due to factors outside of the application owners control (network congestion, regional provider issues, etc…))
  • When a problem occurs – full application call stack for code analysis. (Applies to any application architecture)
  • Host level KPIs correlated with all of the application activity. (Really important in the cloud due to host virtualization, shared resources, and multiple sizing options when you select a host to deploy. Select the wrong size by mistake and you just limited your max application performance)
  • Historic baselines for everything so you know what normal behavior looks like. (Critical to identifying problems regardless of architecture)

If you’re deploying a new application you should have a really good idea of any external application dependencies (like calling a payment gateway to process credit card orders). If you are moving an existing application there is more work that needs to be done up front. In particular you need to really understand your existing application dependencies. Is there a service or backend database that your application relies upon that you’re not planning to move with the application? If so you can really screw up the entire cloud implementation if you make a bunch of calls to a component that lives outside of your chosen cloud environment.


Modern applications have many external dependencies. You absolutely MUST know what they are before moving to the cloud.

If you’re moving an existing application you better deploy a tool that can dynamically detect and show application flow maps. I’m not talking about those agentless tools that scan your hosts everyday looking for network connections (those usually miss all of the short lived services calls). I mean a solution that will give you the entire picture regardless of persistent and transient connection methodologies.

Since you need to monitor your existing environment anyway you might as well collect performance data and save it so that you have a good point of comparison for your “before and after” application environment (We’ll discuss this item more in a future blog post).

There are a ton of considerations when you choose to implement your application using cloud computing architecture patterns. In my next post I’ll go into more detail around the planning phase. Having all your ducks in a row before your begin the migration is critical to success.

Top 5 Gotchas for Monitoring Applications in the Cloud

If you haven’t already, many IT organizations are migrating some of their applications to the cloud to become more agile, alleviate operational complexity and spend less time managing infrastructure and servers. The next question you may ask yourself is, “How will we monitor these applications and where should we even begin with so many monitoring tools on the market?”

I’m glad you asked. Here is a list of gotchas you should look out for. If you have your own list, feel free to comment below and share with us.

1. Lack of End User or Business Context – With apps running in the cloud, monitoring infrastructure metrics indicates very little about your end-user experience,or the performance of your apps or business running in the cloud. End users experience business transactions so make sure your monitoring gives you this visibility.

2. Node Churn – How well does your application monitoring solution deal with node churn – the provisioning and de-provisioning of servers and application nodes? The monitoring solution has to work in dynamic, virtual and elastic environment where change is constant, otherwise you’ll end up with blind spots in your application and monitoring. Many of the current monitoring solutions today are unable to monitor and adapt to dynamic cloud infrastructure changes, requiring manual intervention by operations so new nodes can be registered and monitored.

3. Agent-less is Tough in the Cloud – You may not have any major issues with installing a packet sniffer or network-monitoring appliance in your own private cloud or data-center, but you won’t be able to place these kinds of devices in PaaS or IaaS environments to monitor your application performance. Monitoring agents in comparison can easily be embedded or piggy-backed as part of an application deployment in the cloud. Agent-less may not be a option when trying to monitor many cloud applications.

4. High Network Bandwidth Costs – Cloud providers typically charge per gigabyte of inbound and outbound traffic. If your cloud application has 100 nodes and your collecting megabytes of performance data every minute, all of that data has to be communicated outside of the cloud to your monitoring solution’s management server, which can be on-premise or in another cloud. Monitoring what’s relevant in your application versus monitoring everything means you’ll avoid exorbitant cloud usage bandwidth costs for transferring monitoring data.

5. Inflexible Licensing – If you want to monitor specific nodes, will your application monitoring vendor lock each license down to a physical server, hostname or IP, OR can your licenses float to monitor any server/node? This can be a severe limitation as now your agents are locked down to a specific node indefinitely. Even if you weren’t monitoring your applications running in the cloud, it’s still a nuisance to have a monitoring agent handcuffed to a physical server without being given the licensing flexibility to move agents around to monitor different server or nodes. As stated above, with node churn occurring frequently in cloud environments, you need a monitoring solution to be as flexible as possible so you can deploy agents anywhere, at anytime.

The good news is, monitoring application performance in the cloud is hardly a new concept for AppDynamics as we address all of these requirements with flying colors. If you’re interested in a robust application monitoring solution in the cloud, you can take a free 30-day trial of AppDynamics Pro.


Storm Clouds in 2012? – Results of AppDynamics APM Survey

We recently finished conducting our annual Application Performance Management survey. Over 250 IT professionals participated, and they shared insights such as:
– Many Ops and Dev teams are anticipating growth in their applications by 20% or more
– Over 50% are planning to move to the cloud, and are architecting brand-new applications to be cloud-ready
– Most teams are using log files to monitor application performance, rather than an Application Performance Management (APM) tool.

We’ll release the full report soon, but here’s an infographic that summarizes some of the main findings:

AppDynamics Inforgraphic - Storm Clouds in 2012

Embed this image on your site:

What I found personally surprising was the heavy reliance on log files. When you’re troubleshooting distributed architectures, time is of the essence–and there’s no way to cut your MTTR down when you’re relying on log files to identify root cause.

In fact, there’s only one guy who ever made using a log file look cool:

And I think we can all agree that’s a pretty unique use case.

We’ll have the full survey results available soon.



Cloud Migration won’t happen overnight

There is a massive difference between migrating some code to the cloud and migrating an entire application to the cloud. Yes, the heart of any application is indeed its codebase, but code can’t execute without data and there lies the biggest challenge of any cloud migration. “No problem,” you say. “We can just move our database to the cloud and our code will be able to access it.” Sounds great, apart from most storage services in the cloud tend to run on cheap disk which is often virtualized and shared across several tenants. It’s no secret that databases store and access data from disk; the problem these days is that disks have got bigger and cheaper, but they haven’t exactly got much faster. The assumption that Cloud will offer your application a better Quality of Service (QoS) at a cheaper price is therefore not always true when you include application tiers that manage data. Your code might run faster with cheaper and elastic computing power, but it can only go as fast as the data which it retrieves and processes.

A Decade of APM Acquisitions and Evolution

“APM Game Changer” was the headline yesterday with the news that Compuware acquired Dynatrace Software for $256 million. With the APM market worth an estimated $6 billion, it’s not really a surprise to see such a transaction (no pun intended) within the APM market. In fact, over the past ten years, there have been several large transactions made by established IT vendors trying to claim a piece of the APM market.

If we take a look at several of those APM acquisitions (specifically Deep Dive) and present the facts, you can draw some interesting conclusions.





 Compuware  Dynatrace  Dynatrace  2005  2011
 Oracle  ClearApp  QuickVision  2002  2008
 BMC  Identify  AppSight  1996  2006
 CA  Wily Tech  Introscope  1998  2006
 Compuware  DevStream  Vantage Analyzer  2002  2004
 IBM  Cyanea  ITCam  2001  2004
 OpNet  Altaworks  Panorama  1999  2004
 Veritas  Precise  I3  1990  2003
 HP/Mercury  Performant  Diagnostics  2000  2003
 Quest  Sitraka  Foglight  1989  2002