65% of enterprises need more than 3 hours to troubleshoot application problems

Recently conducted research by Enterprise Management Associates and AppDynamics showed that IT organizations are spending extensive amounts of time and resources on application support. This research based on a survey of 302 IT professionals, also suggested that the majority of companies are still trying to manage complex applications using siloed tools and a combination of “all hands on deck” interactive marathons and tribal knowledge.

65% of enterprises need more than 3 hours to troubleshoot application problems

As you can see in the chart below, 65% of enterprises said that it takes them more than 3 hours to determine the root cause of an application-related problem. In fact, it takes more than 6 hours for one-third of enterprises (33%) to isolate an application problem.

Screen Shot 2015-08-02 at 9.25.57 PM.png

This report also suggested the enterprises are using multiple tools to monitor the application environment with ever growing complexity. These siloed tools also lack common context, and it becomes very difficult, if not impossible, to connect the dots between the troubleshooting information coming out of these disparate tools.

Let’s consider a scenario where an enterprise support team gets multiple customer calls about the checkout process performing slowly. Their Java profiler suggests some JVM performance issue, their database monitoring tools suggest certain issues with queries and their network monitoring tools. However, this disjointed data does not help them conclude if these issues are causing the slowdown of these checkout transactions, or merely a symptom. They need common context and the ability to correlate this disparate information to the checkout transaction so they can quickly isolate the problem and take action to resolve the issue.

77% of enterprises need more than 5 “people-hours” to resolve application problems

The EMA research also suggested that the total number of “people-hours” necessary to solve a single problem is most commonly between 5 and 7 hours (see Figure below). However, in many cases, the process takes much longer. Twenty percent of the time the number is 8-10 hours, and in 8% of cases the process requires more than 20 man-hours.

Screen Shot 2015-08-02 at 9.27.17 PM.png

Given that enterprises are using multiple siloed tool designed for subject matter experts (SME) in respective areas, it often requires a long interactive war room session where these SMEs can leverage their expertise and tribal knowledge to isolate and resolve the performance issues.

Application slowdown or downtime is very expensive. A critical application failure costs a staggering $500,000 to $1 million per hour according to a recently published research “DevOps and the Cost of Downtime: Fortune 1000 Best Practice Metrics Quantified,” by IDC Vice President Stephen Elliott.

A Unified Monitoring solution, like the one recently announced by AppDynamics, can address theses challenges observed by most of the enterprises in this research by EMA. AppDynamics Unified Monitoring is the industry-first, application-centric solution that traces and monitors transactions from the end-user through the entire application and infrastructure environment, to help quickly and proactively solve performance issues and ensure excellent user experience.

Many AppDynamics customers have replaced their siloed tools with this unified monitoring platform to address some of these challenges discussed in this blog. For example, luxury online fashion retailer, Net-A-Porter adopted the AppDynamics Application Intelligence Platform to help reduce application complexity and improve end-to-end visibility of its applications

According to Hugh Fahy, CIO at the Net-A-Porter Group, “We previously used multiple point-monitoring solutions, which didn’t give us end-to-end visibility. The AppDynamics platform gives us a unified, real-time view of user experience, application performance, and availability. That allows us to optimize the user experience, and it’s hard to imagine our business without it.”

Read the full EMA report on monitoring tools.

The Most Important Lesson I Ever Learned About Solving Performance Problems

I’m an operations guy. I’ve been one for over 15 years. From the time when I was a Systems Administrator I was always intrigued by application performance and jumped at every opportunity to try and figure out a performance problem. All of that experience has taught me that there is one aspect of troubleshooting that makes the biggest difference in the most cases.

My Charts Will Save The Day

Before I jump right in with that single most important lesson learned I want to tell the story that set me on my path to learning this lesson. I was sitting at my desk one day when I got called into a P1 issue (also called Sev 1, customers were impacted by application problems) for an application that had components on some of my servers. This application had many distributed components like most of the applications at this particular company did. I knew I was prepared for this moment since I had installed OS monitoring that gave me charts on every metric I was interested in and I had a long history of these charts (daily dating back for months).

Simply put, I was confident I had the data I needed to solve the problem. So I joined the 20+ other people on the conference call, listened to hear what the problem was and what had already been done, and began digging through my mountains of graphs. Within the first 30 minutes of pouring over my never ending streams of data I realized that I had no clue where any of the data points should be for each metric at any given time. I had no reference point to decipher good data points from bad data points. “No problem!” I thought to myself. I have months of this data just waiting for me to look at and determine what’s gone wrong.

Now I don’t know if you’ve ever tried to manually compare graphs to each other but I can tell you that comparing 2 charts that represent 2 metrics on 2 different days is pretty easy. Comparing ~50 daily charts to multiple days or weeks in history is a nightmare that consumes a tremendous amount of time. This was the Hell I had resigned myself to when I made that fateful statement in my head “No problem!”.

bangheadSkip ahead a few hours. I’ve been flipping between multiple workbooks in Excel to try and visually identify where the charts are different. I’ve been doing this for hours. Click-flip, click-flip, click-flip, click-flip… My eyes are strained and my head is throbbing. I want the pain to end but I’m a performance geek that doesn’t give up. I’ve looked at so many charts by now that I can no longer remember why I was zeroing in on a particular metric in the first place. I’m starting to think my initial confidence was a bit misguided. I slowly start banging my head on my desk in frustration.

From Hours To Seconds

Isn’t this one of the most commonly asked questions in any troubleshooting scenario? “What changed?” It’s also one of the toughest questions to answer in a short amount of time. If you want to resolve problems in minutes you need to know the answer to this question immediately. So that leads me to the most important lesson I ever learned about solving performance problems. I need something that will tell me exactly what has changed at any given moment in time.

I need a system that tracks my metrics, automatically baselines their normal behavior, and can tell me when these metrics have deviated from their baselines and by how much. Ideally I want this all in context of the problem that has been identified either by an alert or an end user calling in a trouble ticket (I’d rather know about the problem before a customer calls though).

Thankfully today this type of system does exist. Within AppDynamics Pro, every metric is automatically baselined and a candidate for alerting based upon deviation from that baseline. By default all business transactions are classified as slow or very slow based upon how much they deviate from their historic baselines but this is only the tip of the iceberg. The really cool feature is available after you drill down into a business transaction. Take a look at the screen grab below. This grab was taken from a single “Product Search” business transaction that was slow. Notice we are in the “Node Problems” area. I’ve requested that the software automatically find any JVM metrics that have deviated higher than their baseline during the time of this slow transaction. The charts on the right side of the screen are the resulting data set in descending order of most highly deviated to least highly deviated.

Screen Shot 2013-07-19 at 7.01.48 AM

Whoa… we just answered the “What changed?” question in 30 seconds instead of manually doing hours of analysis. I wish I had this functionality years ago. It would have saved me countless hours and countless forehead bruises. We veterans of the performance wars now have a bigger gun in the battle to restore performance faster. Leave the manual analysis and correlation to the rookies and click here to start your free trial of AppDynamics Pro right now so you can test this out for yourself.