Technical deep dive on what’s impacting Healthcare.gov

There has been a wealth of speculation about what might be wrong with the healthcare.gov website from some of the best and brightest technology has to offer. It seems like everyone has an opinion about what is wrong yet there are still very few facts available to the public. One of our Engineers (Evan Kyriazopoulos) at AppDynamics did some research using our End User Monitoring platform and has some interesting findings that I will share with you today.

What We Already Know

Healthcare.gov is an architecture of distributed services built by a multitude of contractors, and integrates with many legacy systems and insurance companies. CGI Federal (the main contractor) said that it was working with the government and other contractors “around the clock” to improve the system, which it called “complex, ambitious and unprecedented.” We’ve put together a logical architecture diagram based upon the information that is currently in the public domain. This logical diagram could represent hundreds or even thousands of application components and physical devices that work in coordination to service the website.

healthcare-gov-architecture-diagram

How We Tested

Since we don’t have access to the server side of healthcare.gov to inject our JavaScript end user agent we had to inject from the browser side for a single user perspective. Using GreaseMonkey for Firefox we injected our small and lightweight JavaScript code. This code allowed us to measure both server and browser response times of any pages that we visited. All data was sent to our AppDynamics controller for analysis and visualization.

The Findings

What we found after visiting multiple pages was that the problems with response time are pretty evenly divided between server response time issues and browser processing time issues. In the screenshot below you can see the average of all response times for the web pages we visited during this exploratory session. The “First Byte Time” part of the graphic is the measure of how long the server took to get its initial response to the end user browser. Everything shown in blue/green hues represents server side response times and everything in purple represents browser client processing and rendering. You can see really bad performance on both the client and server side on average.

SessionOverview

Now let’s take a look at some individual page requests to see where their problems were. The image below shows an individual request of the registration page and the breakdown of response times across the server and client sides. This page took almost 71 seconds to load with 59 seconds attributed to the server side and almost 12 seconds attributed to the client side data loading and rendering. There is an important consideration with looking at this data. Without having access to either the server side to measure its response time, or to all network segments connecting the server to the client to measure latency we cannot fully determine why the client side is slow. We know there are js and css file optimizations that should be made but if the server side and/or network connections are slow then the impact will be seen and measured at the end user browser.

PoorServerResponse

Moving past registration… The image below shows a “successful” load of a blank profile page. This is the type of frustrating web application behavior that drives end users crazy. It appears as though the page has loaded successfully (to the browser at least) but that is not really the case.

BlankProfilePage

When we look at the list of our end user requests we see that there are AJAX errors associated with the myProfile web page (right below the line highlighted in blue).

myProfile-PageView

So the myProfile page successfully loaded in ~3.4 seconds and it’s blank. Those AJAX requests are to blame since they are actually responsible for calling other service components that provide the profile details (SOA issues rearing their ugly head). Let’s take a look at one of the failed AJAX requests to see what happened.

Looking at the image below you can see there is an HTTP 503 status code as the response to the AJAX request. For those of you who don’t have all of the HTTP status codes memorized this is a “Service Unavailable” response. It means that the HTTP server was able to accept the request but can’t actually do anything with it because there are server side problems. Ugh… more server side problems.

ajax-error-marked

Recommendations

Now that we have some good data about where the problems really lie, what can be done to fix them? The server side of the house must be addressed first as it will have an impact on all client side measurements. There are 2 key problems from the server side that need resolving immediately.

1) Functional Integration Errors

When 100s of services developed by different teams come together, they exchange data through APIs using XML (or similar) formats (note the AJAX failures shown above). Ensuring that all data exchange is accurate and errors are not occurring requires proper monitoring and extensive integration testing. Healthcare.gov was obviously launched without proper monitoring and testing and that’s a major reason why many user interactions are failing, throwing errors and insurance companies are getting incomplete data forms.

To be clear, this isn’t an unusual problem. We see this time and time again with many companies who eventually come to us looking for help solving their application performance issues.

To accelerate this process of finding and fixing these problems at least 5X faster, a product like AppDynamics is required. We identify the source of errors within and across services much faster in test or production environments so that Developers can fix them right away. The screenshots below show an example of an application making a failed service call to a third party and just how easy it is to identify with AppDynamics. In this case there was a timeout socket exception waiting for PayPal (i.e. the PayPal service never answered our web services call at all).

Transaction Flow

Screen Shot 2013-10-14 at 2.59.50 PM

paypal socet exception

2) Scalability bottlenecks

The second big problem is that there are performance bottlenecks in the software that need to be identified and tuned quickly. Here here is some additional information that almost certainly apply to healthcare.gov

a) Website workloads can be unpredictable, especially so with a brand new site like healtcare.gov. Bottlenecks will occur in some services as they get overwhelmed by requests or start to have resource conflicts. When you are dealing with many interconnected services you must use monitoring that tracks all of your transactions by tagging them with a unique identifier and following them across any application components or services that are touched. In this way you will be able to immediately see when, where, and why a request has slowed down.

b) Each web service is really a mini application of it’s own. There will be bottlenecks in the code running on every server and service. To find and remediate these code issues quickly you must have a monitoring tool that automatically finds the problematic code and shows it to you in an intuitive manner. We put together a list of the most common bottlenecks in an earlier blog post that you can access by clicking here

Finding and fixing bottlenecks and errors in custom applications is why AppDynamics exists. Our customers typically resolve their application problems in minutes instead of taking days or weeks trying to use log files and infrastructure monitoring tools.

AppDynamics has offered to help fix the healthcare.gov website for free because we think the American people deserve a system that functions properly and doesn’t waste their time. Modern, service oriented application architectures require monitoring tools designed specifically to operate in those complex environments. AppDynamics is ready to answer the call to action.

Apdex is Fatally Flawed

I’ve been looking into a lot of different statistical methods and algorithms lately and one particularly interesting model is Apdex. If you haven’t heard of it yet, Apdex has been adopted by many companies that sell monitoring software. The latest Apdex specification (v1.1) was released on January 22, 2007 and can be read by clicking here.

Here is the purpose of Apdex as quoted from the specification document referenced above: “Apdex is a numerical measure of user satisfaction with the performance of enterprise applications, intended to reflect the effectiveness of IT investments in contributing to business objectives.” While this is a noble goal, Apdex can lead you to believe your application is working fine when users are really frustrated. If Apdex is all you have to analyze your application performance then it is better than nothing but I’m going to show you the drawbacks and a better way to manage your application performance.

The Formula

The core fatal flaw in the Apdex specification is the formula used to derive your Apdex index. Apdex is an index from 0-1 derived using the formula shown below.

Apdex Formula

At the heart of Apdex is a static threshold (defined as T). The T threshold is set as a measure of the ideal response time for the application. This T threshold is set for an entire application. If you’ve been around application performance for a while you should immediately realize that all application functions perform at very different levels comparatively. For example, my “Search for Flights” function response time will be much greater than my “View Cart” response time. These 2 distinctly different functionalities should not be subjected to the same base threshold (T) for analysis and alerting purposes.

Another thing that application performance veterans should pick up on here is that static thresholds stink. You either set them too high or too low and have to keep adjusting them to a level you are comfortable with. Static thresholds are notorious for causing alert storms (set too low) or for missing problems (set too high). This manual adjustment philosophy leads to a lack of consistency and ultimately makes historical Apdex charts useless if there have been any manual modification of “T”.

So let’s take a look at the Apdex formula now that we understand “T” and it’s multiple drawbacks:

  • Satisfied count = the number of transactions with a response time between 0 and T
  • Tolerating count = the number of transactions with a response time between T and F (F is defined next)
  • F = 4 times T (if T=3 seconds than F=12 seconds)
  • Frustrated count (not shown in the formula but represented in Total samples) = the number of transactions with a response time greater than F

When my colleague Ian Withrow looked at the formula above he had some interesting thoughts on Apdex: “When was the last time you ‘tolerated’ a site loading 4 times slower than you expected it to? If T=1 I’ll tolerate 4 seconds but if T=3 no way am I waiting 12 seconds on a fast connection… which brings me to another point. The notion of a universal threshold for satisfied/tolerating is bunk even for the same page (let alone different pages). My tolerance for how a website loads at work on our fast connection is paper thin. It better be 1-2 seconds or I’m reloading/going elsewhere. On the train to work where I know signal is spotty on my iPhone I’ll load something and then wait for a few seconds.”

Getting back to the Apdex formula itself; assuming every transaction completes in less than the static threshold (T) the Apdex will have a value of 1.00 which is perfect, except the real world is never perfect so let’s explore a more plausible scenario.

An Example

You run an e-Commerce website. Your customers perform 15 clicks while shopping  for every 1 click of the checkout functionality. During your last agile code release the response time of the checkout function went from 4 seconds to over 40 seconds due to an issue with your production configuration. Users are abandoning your website and shopping elsewhere since their checkouts are lost in oblivion. What does Apdex say about this scenario? Let’s see…

T = 4 seconds
Total samples = 1500
Satisfied count = 1300 (assuming most were under 4 seconds with a few between 4 and 16 seconds (F=4T))
Tolerating count = 100 (the assumed number of transactions between 4 and 16 seconds)

Apdex = .90

Apdex Index Table

According to the Apdex specification anything from a .85 – .93 is considered “Good”. We know there is a horrible issue with checkout that is crushing our revenue stream but Apdex doesn’t care. Apdex is not a good way to tell if your application is servicing your business properly. Instead it is a high level reporting mechanism that shows overall application trends via an index. The usefulness of this index is questionable based upon the fact that Apdex is completely capable of manipulation based upon manually changing the T threshold. Apdex, like most other forms of analytics must be understood for what it is and applied to the appropriate problem.

A Better Way – Dynamic Baselines and Analytics

In my experience (and I have a lot of enterprise production monitoring experience) static thresholds and high level overviews of application health have limited value when you want your application to be fast and stable regardless of the functionality being used. The highest value methodology I have seen in a production environment is automatic baselining with alerts based upon deviation from normal behavior. With this method you never have to set a static threshold unless you have a really good reason to (like ensuring you don’t breach a static SLA). The monitoring system automatically figures out normal response time, compares each transactions actual response time to the baseline and classifies each transaction as normal, slow, or very slow. Alerts are based upon how far each transaction has deviated from normal behavior instead of an index value based upon a rigid 4 times the global static threshold like with Apdex.

baselines

Load and response time charts. Baselines are represented by dashed lines.

business baseline

Automatic baseline of business metrics (books sold) . Baseline represented by dashed line.

If you want to make yourself look great to your bosses (momentarily) then set your T threshold to 10 seconds and let the Apdex algorithm make your application look awesome. If you want to be a hero to the business then show them the problems with Apdex and take a free trial of AppDynamics Pro today. AppDynamics will show you how every individual business transaction is performing and can even tell you if your application changes are making or costing you money.