Achieving Rapid Time-to-Value with AppDynamics

At AppDynamics, we are firm believers in demonstrating the value of our platform as quickly as possible. Many of our customers are able to address critical performance issues within minutes of getting their application instrumented. Below is an example of how we use AppDynamics with a fictional customer, AD-Betting, to analyze and troubleshoot the company’s business environment soon after AppD is up and running.

Getting Started

After allowing AppDynamics to watch the AD-Betting environment for a few hours, we were able to see the company’s environment, how its components were interacting with one another, and user interactions. Here is AD-Betting’s architecture diagram as represented by a close-up view of our Application Flow Map:

First we examined the last hour of AD-Betting’s performance data:

Even without having detailed knowledge of the application, we knew for certain the average response time increase at 3:00 p.m. wasn’t related to a scalability issue, as the load hadn’t increased at the same time.​

To investigate the performance issue further, we reduced the time range to cover only the problematic peak. The resulting flow map showed the HTTP call between the tier (aggregator) and the external service (ad-soccer-games.com) took four seconds on average:

With a single click on the aggregator tier, it was easy to identify the impacted business transaction (/quotes):

With a single drill-down click on View Business Transaction Dashboard, we could see how “/quotes” was impacted by this issue. The scorecard for this transaction (below) uncovered a direct correlation between the spike in application response time and an increase in “Slow” and “Very Slow” transactions:

To find the root cause, we switched to Transaction Snapshots view and filtered for slow and very slow snapshots, which AppDynamics had collected automatically. We then picked one snapshot as an example and looked at the Call Graph, which showed the call for method aggregateWithoutCaching in line 217 took five seconds. The connected HTTP exit call terminated with SocketTimeoutException: Read timed out. We also verified this issue with other snapshots at the same time and—voila!—AppDynamics had uncovered the root cause of the performance issue:

To better understand the issue’s impact, we used AppDynamics’ End User Monitoring to analyze AD-Betting’s most important page requests sorted by end-user response time. We immediately spotted the slow end-user experience for FIFA 2018 – All Games, which was an extremely critical component of the company’s business, particularly with the World Cup underway. This was one issue we needed to analyze further.

Looking at the waterfall diagram of all views for this particular page, we discovered that most of the transaction time was spent on the backend, not on the network or browser side:

We then picked a single browser snapshot for closer examination. Since AppDynamics correlates frontend (browser) and backend (Java) automatically, we were able to get an associated backend snapshot as well. In this case, we diagnosed the Call graph and again found the same performance-impacting issue. As you can see, the problem was impacting one of the company’s most important pages:

We hope this simple walk-through of AppDynamics’ powerful monitoring and diagnostic capabilities helps you analyze and troubleshoot your own business-critical environments. Take the AppDynamics Guided Tour to learn more!

Why Alerts Suck and Monitoring Solutions need to become Smarter

I have yet to meet anyone in Dev or Ops who likes alerts. I’ve also yet to meet anyone who was fast enough to acknowledge an alert, so they could prevent an application from slowing down or crashing. In the real world alerts just don’t work, nobody has the time or patience anymore, alerts are truly evil and no-one trusts them. The most efficient alert today is an angry end user phone call, because Dev and Ops physically hear and feel the pain of someone suffering 🙂

Why? There is little or no intelligence in how a monitoring solution determines what is normal or abnormal for application performance. Today, monitoring solutions are only as good as the users that configure them, which is bad news because humans make mistakes, configuration takes time, and time is something many of us have little of.

Its therefore no surprise to learn that behavioral learning and analytics are becoming key requirements for modern application performance monitoring (APM) solutions. In fact, Will Capelli from Gartner recently published a report on IT Operational Analytics and pattern based strategies in the data center. The report covered the role of Complex Event Processing (CEP), behavior learning engines (BLEs) and analytics as a means for monitoring solutions to deliver better intelligence and quality information to Dev and Ops. Rather than just collect, store and report data, monitoring solutions must now learn and make sense of the data they collect, thus enabling them to become smarter and deliver better intelligence back to their users.

Change is constant for applications and infrastructure thanks to agile cycles, therefore monitoring solutions must also change so they can adapt and stay relevant. For example, if the performance of a business transaction in an application is 2.5 secs one week, and that drops to 200ms the week after because of a development fix. 200ms should become the new performance baseline for that same transaction, otherwise the monitoring solution won’t learn or alert of any performance regression. If the end user experience of a business transaction goes from 2.5 secs to 200ms, then end user expectations change instantly, and users become used to an instant response. Monitoring solutions have to keep up with user expectations, otherwise IT will become blind to the one thing that impacts customer loyalty and experience the most.