The End of my Affair with Apdex

A decade ago, when I first learned of Apdex, it was thanks to a wonderful technology partner, Coradiant. At the time, I was running IT operations and web operations, and brought Coradiant into the fold. Coradiant was ahead of its time, providing end-user experience monitoring capabilities via packet analysis. The network-based approach was effective in a day when the web was less rich. Coradiant was one of the first companies to embed Apdex in its products.

As a user of APM tools, I was looking for the ultimate KPI, and the concept of Apdex resonated with me and my senior management. A single magical number gave us an idea of how well development, QA, and operations were doing in terms of user experience and performance. Had I found the metric to rule all metrics? I thought I had, and I was a fan of Apdex for many years leading up to 2012, when I started to dig into the true calculations behind this magical number.

As my colleague Jim Hirschauer pointed out in a 2013 blog post, the Apdex index is calculated by putting the number of satisfied versus tolerating requests into a formula. The definition of a user being “satisfied” or “tolerating” has to do with a lot more than just performance, but the applied use cases for Apdex are unfortunately focused on performance only. Performance is still a critical criterion, but the definition of satisfied or tolerating is situational.

I’m currently writing this from 28,000 feet above northern Florida, over barely usable in-flight internet, which makes me wish I had a 56k modem. I am tolerating the latency and bandwidth, but not the $32 I paid for this horrible experience , but hey, at least Twitter and email work. I self-classify as an “un-tolerating” user, but I am happy with some connectivity. People who know me will tell you I have a bandwidth and network problem. Hence, my level of a tolerable network connection is abnormal. My Apdex score would be far different than the average user due to my personal perspective, as would the business user versus the consumer, based on their specific situation as they use an application. Other criteria that affect satisfaction include the type of device in use and connection type of that device.

The thing that is missing from Apdex is the notion of a service level. There are two ways to manage service level agreements. First, a service level may be calculated, as we do at AppDynamics with our baselines. Secondarily, it may be a static threshold, which the customer expects; we support this use case in our analytics product. These two ways of calculating an SLA cover the right ways to measure and score performance.

This is AppDynamics’ Transaction Analytics Breakdown for users who had errors or poor user experience over the last week, and their SLA class:

 

 

Simplistic SLAs are in the core APM product. Here is a view showing requests that were below the calculated baseline, showing which were in SLA violation.

The notion of combining an SLA with Apdex will result in a meaningful number being generated. Unfortunately, I cannot take credit for this idea. Alain Cohen, one of the brightest minds in performance analysis, was the co-founder and CTO (almost co-CEO) of OPNET. Alain discussed his ideas with me around this new performance index concept called OpDex, which fixes many of the ApDex flaws by applying an SLA. Unfortunately, Alain is no longer solving performance problems for customers; he’s decided to take his skills and talents elsewhere after a nice payout.

Alain shared his OpDex plan with me in 2011; thankfully all of the details are outlined in this patent, which was granted in 2013. But OPNET’s great run of innovation has ended, and Riverbed has failed to pick up where they left off, but at least they have patents to show for these good ideas and concepts.

The other issue with Apdex is that users are being ignored by the formula. CoScale outlined this issues in a detailed blog post.They explain that histograms are far better ways to analyze a variant population. This is no different than looking at performance metrics coming from the infrastructure layer, but the use of histograms and heat charts tend to provide much better visual analysis.

AppDynamics employs automated baselines for every metric collected, and measures based on deviations out of the box. We also support static SLA thresholds as needed. Visually, AppDynamics has a lot of options including viewing data in histograms, looking at percentiles, and providing an advanced analytics platform for whatever use cases our users come up with. We believe these are valid approaches to the downsides of using Apdex extensively in a product, which has it’s set of downsides.

 

 

Apdex is Fatally Flawed

I’ve been looking into a lot of different statistical methods and algorithms lately and one particularly interesting model is Apdex. If you haven’t heard of it yet, Apdex has been adopted by many companies that sell monitoring software. The latest Apdex specification (v1.1) was released on January 22, 2007 and can be read by clicking here.

Here is the purpose of Apdex as quoted from the specification document referenced above: “Apdex is a numerical measure of user satisfaction with the performance of enterprise applications, intended to reflect the effectiveness of IT investments in contributing to business objectives.” While this is a noble goal, Apdex can lead you to believe your application is working fine when users are really frustrated. If Apdex is all you have to analyze your application performance then it is better than nothing but I’m going to show you the drawbacks and a better way to manage your application performance.

The Formula

The core fatal flaw in the Apdex specification is the formula used to derive your Apdex index. Apdex is an index from 0-1 derived using the formula shown below.

Apdex Formula

At the heart of Apdex is a static threshold (defined as T). The T threshold is set as a measure of the ideal response time for the application. This T threshold is set for an entire application. If you’ve been around application performance for a while you should immediately realize that all application functions perform at very different levels comparatively. For example, my “Search for Flights” function response time will be much greater than my “View Cart” response time. These 2 distinctly different functionalities should not be subjected to the same base threshold (T) for analysis and alerting purposes.

Another thing that application performance veterans should pick up on here is that static thresholds stink. You either set them too high or too low and have to keep adjusting them to a level you are comfortable with. Static thresholds are notorious for causing alert storms (set too low) or for missing problems (set too high). This manual adjustment philosophy leads to a lack of consistency and ultimately makes historical Apdex charts useless if there have been any manual modification of “T”.

So let’s take a look at the Apdex formula now that we understand “T” and it’s multiple drawbacks:

  • Satisfied count = the number of transactions with a response time between 0 and T
  • Tolerating count = the number of transactions with a response time between T and F (F is defined next)
  • F = 4 times T (if T=3 seconds than F=12 seconds)
  • Frustrated count (not shown in the formula but represented in Total samples) = the number of transactions with a response time greater than F

When my colleague Ian Withrow looked at the formula above he had some interesting thoughts on Apdex: “When was the last time you ‘tolerated’ a site loading 4 times slower than you expected it to? If T=1 I’ll tolerate 4 seconds but if T=3 no way am I waiting 12 seconds on a fast connection… which brings me to another point. The notion of a universal threshold for satisfied/tolerating is bunk even for the same page (let alone different pages). My tolerance for how a website loads at work on our fast connection is paper thin. It better be 1-2 seconds or I’m reloading/going elsewhere. On the train to work where I know signal is spotty on my iPhone I’ll load something and then wait for a few seconds.”

Getting back to the Apdex formula itself; assuming every transaction completes in less than the static threshold (T) the Apdex will have a value of 1.00 which is perfect, except the real world is never perfect so let’s explore a more plausible scenario.

An Example

You run an e-Commerce website. Your customers perform 15 clicks while shopping  for every 1 click of the checkout functionality. During your last agile code release the response time of the checkout function went from 4 seconds to over 40 seconds due to an issue with your production configuration. Users are abandoning your website and shopping elsewhere since their checkouts are lost in oblivion. What does Apdex say about this scenario? Let’s see…

T = 4 seconds
Total samples = 1500
Satisfied count = 1300 (assuming most were under 4 seconds with a few between 4 and 16 seconds (F=4T))
Tolerating count = 100 (the assumed number of transactions between 4 and 16 seconds)

Apdex = .90

Apdex Index Table

According to the Apdex specification anything from a .85 – .93 is considered “Good”. We know there is a horrible issue with checkout that is crushing our revenue stream but Apdex doesn’t care. Apdex is not a good way to tell if your application is servicing your business properly. Instead it is a high level reporting mechanism that shows overall application trends via an index. The usefulness of this index is questionable based upon the fact that Apdex is completely capable of manipulation based upon manually changing the T threshold. Apdex, like most other forms of analytics must be understood for what it is and applied to the appropriate problem.

A Better Way – Dynamic Baselines and Analytics

In my experience (and I have a lot of enterprise production monitoring experience) static thresholds and high level overviews of application health have limited value when you want your application to be fast and stable regardless of the functionality being used. The highest value methodology I have seen in a production environment is automatic baselining with alerts based upon deviation from normal behavior. With this method you never have to set a static threshold unless you have a really good reason to (like ensuring you don’t breach a static SLA). The monitoring system automatically figures out normal response time, compares each transactions actual response time to the baseline and classifies each transaction as normal, slow, or very slow. Alerts are based upon how far each transaction has deviated from normal behavior instead of an index value based upon a rigid 4 times the global static threshold like with Apdex.

baselines

Load and response time charts. Baselines are represented by dashed lines.

business baseline

Automatic baseline of business metrics (books sold) . Baseline represented by dashed line.

If you want to make yourself look great to your bosses (momentarily) then set your T threshold to 10 seconds and let the Apdex algorithm make your application look awesome. If you want to be a hero to the business then show them the problems with Apdex and take a free trial of AppDynamics Pro today. AppDynamics will show you how every individual business transaction is performing and can even tell you if your application changes are making or costing you money.