Technical deep dive on what’s impacting Healthcare.gov

There has been a wealth of speculation about what might be wrong with the healthcare.gov website from some of the best and brightest technology has to offer. It seems like everyone has an opinion about what is wrong yet there are still very few facts available to the public. One of our Engineers (Evan Kyriazopoulos) at AppDynamics did some research using our End User Monitoring platform and has some interesting findings that I will share with you today.

What We Already Know

Healthcare.gov is an architecture of distributed services built by a multitude of contractors, and integrates with many legacy systems and insurance companies. CGI Federal (the main contractor) said that it was working with the government and other contractors “around the clock” to improve the system, which it called “complex, ambitious and unprecedented.” We’ve put together a logical architecture diagram based upon the information that is currently in the public domain. This logical diagram could represent hundreds or even thousands of application components and physical devices that work in coordination to service the website.

healthcare-gov-architecture-diagram

How We Tested

Since we don’t have access to the server side of healthcare.gov to inject our JavaScript end user agent we had to inject from the browser side for a single user perspective. Using GreaseMonkey for Firefox we injected our small and lightweight JavaScript code. This code allowed us to measure both server and browser response times of any pages that we visited. All data was sent to our AppDynamics controller for analysis and visualization.

The Findings

What we found after visiting multiple pages was that the problems with response time are pretty evenly divided between server response time issues and browser processing time issues. In the screenshot below you can see the average of all response times for the web pages we visited during this exploratory session. The “First Byte Time” part of the graphic is the measure of how long the server took to get its initial response to the end user browser. Everything shown in blue/green hues represents server side response times and everything in purple represents browser client processing and rendering. You can see really bad performance on both the client and server side on average.

SessionOverview

Now let’s take a look at some individual page requests to see where their problems were. The image below shows an individual request of the registration page and the breakdown of response times across the server and client sides. This page took almost 71 seconds to load with 59 seconds attributed to the server side and almost 12 seconds attributed to the client side data loading and rendering. There is an important consideration with looking at this data. Without having access to either the server side to measure its response time, or to all network segments connecting the server to the client to measure latency we cannot fully determine why the client side is slow. We know there are js and css file optimizations that should be made but if the server side and/or network connections are slow then the impact will be seen and measured at the end user browser.

PoorServerResponse

Moving past registration… The image below shows a “successful” load of a blank profile page. This is the type of frustrating web application behavior that drives end users crazy. It appears as though the page has loaded successfully (to the browser at least) but that is not really the case.

BlankProfilePage

When we look at the list of our end user requests we see that there are AJAX errors associated with the myProfile web page (right below the line highlighted in blue).

myProfile-PageView

So the myProfile page successfully loaded in ~3.4 seconds and it’s blank. Those AJAX requests are to blame since they are actually responsible for calling other service components that provide the profile details (SOA issues rearing their ugly head). Let’s take a look at one of the failed AJAX requests to see what happened.

Looking at the image below you can see there is an HTTP 503 status code as the response to the AJAX request. For those of you who don’t have all of the HTTP status codes memorized this is a “Service Unavailable” response. It means that the HTTP server was able to accept the request but can’t actually do anything with it because there are server side problems. Ugh… more server side problems.

ajax-error-marked

Recommendations

Now that we have some good data about where the problems really lie, what can be done to fix them? The server side of the house must be addressed first as it will have an impact on all client side measurements. There are 2 key problems from the server side that need resolving immediately.

1) Functional Integration Errors

When 100s of services developed by different teams come together, they exchange data through APIs using XML (or similar) formats (note the AJAX failures shown above). Ensuring that all data exchange is accurate and errors are not occurring requires proper monitoring and extensive integration testing. Healthcare.gov was obviously launched without proper monitoring and testing and that’s a major reason why many user interactions are failing, throwing errors and insurance companies are getting incomplete data forms.

To be clear, this isn’t an unusual problem. We see this time and time again with many companies who eventually come to us looking for help solving their application performance issues.

To accelerate this process of finding and fixing these problems at least 5X faster, a product like AppDynamics is required. We identify the source of errors within and across services much faster in test or production environments so that Developers can fix them right away. The screenshots below show an example of an application making a failed service call to a third party and just how easy it is to identify with AppDynamics. In this case there was a timeout socket exception waiting for PayPal (i.e. the PayPal service never answered our web services call at all).

Transaction Flow

Screen Shot 2013-10-14 at 2.59.50 PM

paypal socet exception

2) Scalability bottlenecks

The second big problem is that there are performance bottlenecks in the software that need to be identified and tuned quickly. Here here is some additional information that almost certainly apply to healthcare.gov

a) Website workloads can be unpredictable, especially so with a brand new site like healtcare.gov. Bottlenecks will occur in some services as they get overwhelmed by requests or start to have resource conflicts. When you are dealing with many interconnected services you must use monitoring that tracks all of your transactions by tagging them with a unique identifier and following them across any application components or services that are touched. In this way you will be able to immediately see when, where, and why a request has slowed down.

b) Each web service is really a mini application of it’s own. There will be bottlenecks in the code running on every server and service. To find and remediate these code issues quickly you must have a monitoring tool that automatically finds the problematic code and shows it to you in an intuitive manner. We put together a list of the most common bottlenecks in an earlier blog post that you can access by clicking here

Finding and fixing bottlenecks and errors in custom applications is why AppDynamics exists. Our customers typically resolve their application problems in minutes instead of taking days or weeks trying to use log files and infrastructure monitoring tools.

AppDynamics has offered to help fix the healthcare.gov website for free because we think the American people deserve a system that functions properly and doesn’t waste their time. Modern, service oriented application architectures require monitoring tools designed specifically to operate in those complex environments. AppDynamics is ready to answer the call to action.

Making Healthcare.gov scale isn’t a quick fix

Health Insurance Marketplace - Please wait healthcare.govLike many of you, I’ve been following the news about the Obamacare situation. I’m not going to blame the President or his administration for the performance issues that Healthcare.gov are experiencing. I’m not going to blame the contractors that did the integration work. Instead I’m going to blame the colleges and universities around the world who teach computer science and programming.

I became a Bachelor of Computer Science many moons ago. While I wasn’t in a bar drinking Guinness, I studied the application development lifecycle from a bible-like book called “Software Engineering,” which detailed all the steps I needed to write modern applications. It covered everything from use cases to requirements to development, testing and maintenance. I graduated from college and started as a Java developer for a well-known U.S.-based systems integrator. I soon realized I had learned very little about developing applications that would work in the real world. Making an application function or work per spec is easy; making it perform and scale is an entirely different skillset.

downloadBut surely my “Software Engineering” book had a section called “Non-Functional Requirements” which included references to Performance and Scalability? Yes, it did, but no lecturer deemed those sections critical to my learning. Not once did someone tell me, “By the way, here are the top 25 things that will make your application run like a dog with 3 legs.” Building efficient code, designing scalable architectures and dealing with over 100,000 transactions per minute was a massive gap in my education. That’s why most developers today ignore the performance and scalability non-functional requirements, and leave them to QA teams to deal with once the application has been designed and developed. Simply put, performance is an afterthought.

I’ve spoken at a lot of conferences in the past two years for AppDynamics. In nearly all of them I’ve asked the audience, “How many of you performance test your code?” Usually one or two people raise their hands. This isn’t good enough, and it’s evidence of why websites and applications frequently slow down and break. Performance and Scalability are not something you can just “bolt on” to an application – these non-functional requirements must be designed from the start in any application architecture. Any experienced developer or architect knows this. The secret is that requirements will always change throughout the development lifecycle, especially now that agile is the de facto development process. The secret is to design an architecture that can scale and perform through change without needing to rewrite large numbers of components or services. Service Orientated Architectures (SOA) allowed developers to do this, because it provides the abstraction needed so new application services can be plugged in, and scaled horizontally when required. You then add virtualization and cloud into the mix and you’ve got on-demand computing power for those SOA services.

A key problem with SOA is that it creates service fragmentation, distribution and complexity. Developers and architects no longer manage just a few servers – they manage hundreds or thousands of servers that are shared and have lots of dependencies. If you truly want to test these huge SOA environments for performance and scalability you have to reproduce them in test, and that’s not easy. Imagine the Obama HealthCare.gov team trying to build a test environment and simulate millions of users across hundreds of different internal and remote third party services – an impossible task. Turns out it was an impossible task, their team only tested the websites health and performance a week before go live. Now, imagine if your application looked like this:

productionCraziness

 Screenshot above is from an AppDynamics Customer that leveraged SOA design principles. Healthcare.gov could be equally complex to manage and test.

Building fast and scalable applications is not easy. The answer isn’t simply to do more testing before you deploy your application in production. The answer is to design performance and scalability into your architecture right from the start. So yes, performance testing is better than no performance testing, but when HealthCare.gov has millions of concurrent users you really need an architecture that was built to scale from its inception.

Application performance monitoring like AppDynamics can definitely help identify the key architecture bottlenecks of HealthCare.gov in production, but the reality is the government is going to need a team of smart, dedicated architects and developers to make the required architecture changes and fixes. Unfortunately this isn’t something that can be achieved in hours or days.

If anyone needs to learn a lesson it should the colleges and universities around the world who educate students on how to build applications, so when students like me graduate, they are aware of what it takes to build scalable applications in the real world. Non-functional requirements like performance and scalability are just as important as the functional requirements we’ve all come to understand and practice in application development.

Healthcare.gov is just another example of what happens when non functional requirements are ignored.

Steve.

AppDynamics Offers to Monitor Healthcare.gov for Free

The Affordable Care Act and its many affiliated health exchanges have been online now for 11 days. To say this first week and a half has been challenging from an IT perspective is an understatement. Persistent “glitches” in these applications continue to prevent consumers from enrolling in health care programs in many states, especially those that rely on the federal site, healthcare.gov.

There are many reasons why these sites have had a rough start, as I outlined in my previous blog. It’s not surprising to anyone that these websites struggled to meet expectations, given the complexity of the applications underneath. And things are beginning to get better. But these glitches won’t go away for good until the engineers responsible for these applications get visibility into what’s going wrong.

Screen Shot 2013-10-10 at 10.55.30 PM

That’s why AppDynamics has decided to offer its software to the federal government for free for three months. I believe that our application performance management software can help the engineers working tirelessly to improve the sites by revealing to them where the bottlenecks in their applications are. Even more importantly, I believe that the insight provided by AppDynamics will help the government to dramatically improve the performance of these applications for end users and ultimately allow people to enroll in the programs more easily.

No matter what your politics are, I think we can all agree that no one likes a slow web application. So we’d like to make life a little better for everyone involved by helping those apps get up to speed.

Why is my state’s ACA healthcare exchange site slow?

Today marked the launch of the Affordable Care Act (Obamacare), which included the rollout of online health insurance exchanges in every US state and Washington, DC. Ahead of the launch, several states reported difficulties getting these new websites to function properly, and many experts and pundits anticipated slow performance as the whole system went live today. So why are these sites proving so problematic? Well, we can’t know for sure, but here are a few educated guesses (based on quite a bit of experience dealing with slow and troubled web applications):

  • Integrating with existing systems is difficult. Each state has their own healthcare systems already in place that rely on a whole host of different technologies ranging from Java to COBOL. Interfacing with these existing applications can be difficult, especially if they’re slow or unresponsive.

Health Insurance Marketplace - Please wait healthcare.gov

  • There wasn’t enough time to test. Testing an application really well requires a lot of time and resources to complete. With a hard deadline of October 1 these applications might not be completely ready for the big leagues.
  • They’re dealing with a lot of load… all at once. The uninsured as well as many curious Americans swarmed these exchanges en masse as soon as they became available this morning. If these applications receive more load than anticipated, the applications could simply crash.
  • There are thousands of corner cases to account for. Families’ health care eligibility scenarios could vary in literally thousands of ways, and the teams responsible for designing the application had to account for and test all of those different situations. When the permutations climb up into the hundreds of thousands it becomes extremely difficult not only to account for these situations in the application code but also to simulate these situations in a test environment.

Screen Shot 2013-10-10 at 10.25.38 PM

  • Federal and state governments don’t have experience launching web apps on this scale. Tech companies and eCommerce giants like Twitter, Facebook or Amazon have been building applications with massive scale for years, so when it comes time to build and deploy a new one they have experienced people to do it and tried and true processes to fall back on. Most state governments don’t have the same experience, which makes it more difficult for them to “get it right.” (And, we should mention, even the tech and eCommerce companies experience slowdowns and outages pretty regularly).
  • Checking for eligibility is a lot more complicated than buying an airplane ticket. Consumers may expect all websites to respond as quickly as their favorite travel site, but as Richard Onizuka, CEO of the Washington Health Benefit Exchange, commented, “Filling out an application for insurance is a much longer process than just buying a ticket to San Francisco on Orbitz.”

So if you’re experiencing poor web performance from your state’s health insurance exchange, take a deep breath and try again later. There are lot of reasons for the site to be slow, especially on their first day out of the gate. So be patient.