Technical deep dive on what’s impacting Healthcare.gov

There has been a wealth of speculation about what might be wrong with the healthcare.gov website from some of the best and brightest technology has to offer. It seems like everyone has an opinion about what is wrong yet there are still very few facts available to the public. One of our Engineers (Evan Kyriazopoulos) at AppDynamics did some research using our End User Monitoring platform and has some interesting findings that I will share with you today.

What We Already Know

Healthcare.gov is an architecture of distributed services built by a multitude of contractors, and integrates with many legacy systems and insurance companies. CGI Federal (the main contractor) said that it was working with the government and other contractors “around the clock” to improve the system, which it called “complex, ambitious and unprecedented.” We’ve put together a logical architecture diagram based upon the information that is currently in the public domain. This logical diagram could represent hundreds or even thousands of application components and physical devices that work in coordination to service the website.

healthcare-gov-architecture-diagram

How We Tested

Since we don’t have access to the server side of healthcare.gov to inject our JavaScript end user agent we had to inject from the browser side for a single user perspective. Using GreaseMonkey for Firefox we injected our small and lightweight JavaScript code. This code allowed us to measure both server and browser response times of any pages that we visited. All data was sent to our AppDynamics controller for analysis and visualization.

The Findings

What we found after visiting multiple pages was that the problems with response time are pretty evenly divided between server response time issues and browser processing time issues. In the screenshot below you can see the average of all response times for the web pages we visited during this exploratory session. The “First Byte Time” part of the graphic is the measure of how long the server took to get its initial response to the end user browser. Everything shown in blue/green hues represents server side response times and everything in purple represents browser client processing and rendering. You can see really bad performance on both the client and server side on average.

SessionOverview

Now let’s take a look at some individual page requests to see where their problems were. The image below shows an individual request of the registration page and the breakdown of response times across the server and client sides. This page took almost 71 seconds to load with 59 seconds attributed to the server side and almost 12 seconds attributed to the client side data loading and rendering. There is an important consideration with looking at this data. Without having access to either the server side to measure its response time, or to all network segments connecting the server to the client to measure latency we cannot fully determine why the client side is slow. We know there are js and css file optimizations that should be made but if the server side and/or network connections are slow then the impact will be seen and measured at the end user browser.

PoorServerResponse

Moving past registration… The image below shows a “successful” load of a blank profile page. This is the type of frustrating web application behavior that drives end users crazy. It appears as though the page has loaded successfully (to the browser at least) but that is not really the case.

BlankProfilePage

When we look at the list of our end user requests we see that there are AJAX errors associated with the myProfile web page (right below the line highlighted in blue).

myProfile-PageView

So the myProfile page successfully loaded in ~3.4 seconds and it’s blank. Those AJAX requests are to blame since they are actually responsible for calling other service components that provide the profile details (SOA issues rearing their ugly head). Let’s take a look at one of the failed AJAX requests to see what happened.

Looking at the image below you can see there is an HTTP 503 status code as the response to the AJAX request. For those of you who don’t have all of the HTTP status codes memorized this is a “Service Unavailable” response. It means that the HTTP server was able to accept the request but can’t actually do anything with it because there are server side problems. Ugh… more server side problems.

ajax-error-marked

Recommendations

Now that we have some good data about where the problems really lie, what can be done to fix them? The server side of the house must be addressed first as it will have an impact on all client side measurements. There are 2 key problems from the server side that need resolving immediately.

1) Functional Integration Errors

When 100s of services developed by different teams come together, they exchange data through APIs using XML (or similar) formats (note the AJAX failures shown above). Ensuring that all data exchange is accurate and errors are not occurring requires proper monitoring and extensive integration testing. Healthcare.gov was obviously launched without proper monitoring and testing and that’s a major reason why many user interactions are failing, throwing errors and insurance companies are getting incomplete data forms.

To be clear, this isn’t an unusual problem. We see this time and time again with many companies who eventually come to us looking for help solving their application performance issues.

To accelerate this process of finding and fixing these problems at least 5X faster, a product like AppDynamics is required. We identify the source of errors within and across services much faster in test or production environments so that Developers can fix them right away. The screenshots below show an example of an application making a failed service call to a third party and just how easy it is to identify with AppDynamics. In this case there was a timeout socket exception waiting for PayPal (i.e. the PayPal service never answered our web services call at all).

Transaction Flow

Screen Shot 2013-10-14 at 2.59.50 PM

paypal socet exception

2) Scalability bottlenecks

The second big problem is that there are performance bottlenecks in the software that need to be identified and tuned quickly. Here here is some additional information that almost certainly apply to healthcare.gov

a) Website workloads can be unpredictable, especially so with a brand new site like healtcare.gov. Bottlenecks will occur in some services as they get overwhelmed by requests or start to have resource conflicts. When you are dealing with many interconnected services you must use monitoring that tracks all of your transactions by tagging them with a unique identifier and following them across any application components or services that are touched. In this way you will be able to immediately see when, where, and why a request has slowed down.

b) Each web service is really a mini application of it’s own. There will be bottlenecks in the code running on every server and service. To find and remediate these code issues quickly you must have a monitoring tool that automatically finds the problematic code and shows it to you in an intuitive manner. We put together a list of the most common bottlenecks in an earlier blog post that you can access by clicking here

Finding and fixing bottlenecks and errors in custom applications is why AppDynamics exists. Our customers typically resolve their application problems in minutes instead of taking days or weeks trying to use log files and infrastructure monitoring tools.

AppDynamics has offered to help fix the healthcare.gov website for free because we think the American people deserve a system that functions properly and doesn’t waste their time. Modern, service oriented application architectures require monitoring tools designed specifically to operate in those complex environments. AppDynamics is ready to answer the call to action.

DevOps Scares Me – Part 3

Hey, who invited the developer to our operations conversation? In Devops Scares Me – Part 2 my colleague Dustin Whittle shared his developer point of view on DevOps. I found his viewpoint quite interesting and it made me realize that I take for granted the knowledge I have about what it takes to get an application into production in a large enterprise. As Dustin called out, there are many considerations including but not limited to code management, infrastructure management, configuration management, event management, log management, performance management, and general monitoring. In his blog post Dustin went on to cover some of the many tools available to help automate and manage all of the considerations previously mentioned. In my post I plan to explore if DevOps is only for loosely managed e-commerce providers or if it can really be applied to more traditional and highly regulated enterprises.

Out With the Old

In the operations environments I have worked in there were always strict controls on who could access production environments, who could make changes, when changes could be made, who could physically touch hardware, who could access what data centers, etc… In these highly regulated and process oriented enterprises the thought of blurring the lines between development and operations seems like a non-starter. There is so much process and tradition standing in the way of using a DevOps approach that it seems nearly impossible. Let’s break it down into small pieces and see if could be feasible.

Here are the basic steps to getting a new application built and deployed from scratch (from an operations perspective) in a stodgy Financial Services environment. If you’ve never worked in this type of environment some of the timing of these steps might surprise you (or be very familiar to you). We are going to assume this new application project has already been approved by management and we have the green light to proceed.

  1. Place order for dev, test, uat, prod, dr, etc… infrastructure. (~8 weeks lead time, all hardware ordered up front)
  2. Development team does dev stuff while us ops personnel are filling out miles of virtual paperwork to get the infrastructure in place. Much discussion occurs about failover, redundancy, disaster recovery, data center locations, storage requirements, etc… None of this discussion includes developers, just operations and architects…oops.
  3. New application is added to CMDB (or similar) to include new infrastructure components, application components, and dependencies.
  4. Operations is hopeful that the developers are making good progress in the 8 weeks lead time provided by the operational request process (actually the ops teams don’t usually even think about what dev might be working on). Servers have landed and are being racked and stacked. Hopefully we guessed right when we estimated the number of users, efficiency of code, storage requirements, etc… that were used to size this hardware. In reality we will have to see what happens during load testing and make adjustments (i.e. tell the developers to make it use fewer resources or order more hardware).
  5. We’re closing in on one week until the scheduled go-live date but the application isn’t ready for testing yet. It’s not the developers fault that the functional requirements keep changing but it is going to squeeze the testing and deployment phases.
  6. The monitoring team has installed their standard monitoring agents (usually just traditional server monitoring) and marked off that checkbox from the deployment checklist.
  7. It’s 2 days before go-live and we have an application to test. The load test team has coded some form of synthetic load to be applied to the servers. Functional testing showed that the application worked. Load testing shows slow response times and lot’s of errors. Another test is scheduled for tomorrow while the development team works frantically to figure out what went wrong with this test.
  8. One day until go-live, load test session 2, still some slow response time and a few errors but nothing that will stop this application from going into production. We call the load test a “success” and give the green light to deploy the application onto the production servers. The app is deployed, functional testing looks good, and we wait until tomorrow for the real test…production users!
  9. Go-Live … Users hit the application, the application pukes and falls over, the operations team check the infrastructure and gets the developers the log files to look at. Management is upset. Everyone is asking if we have any monitoring tools that can show what is happening in the application.
  10. Week one is a mess with the application working, crashing, restarting, working again, and new emergency code releases going into production to fix the problems. Week 2 and each subsequent week will get better  until new functionality gets released in the next major change window.
Nobody wins with a "toss it over the wall" mentality.

Nobody wins with a “toss it over the wall” mentality.

In With the New

Part of the problem with the scenario above is that the development and operations teams are so far removed from each other that there is little to no communication during the build and test phases of the development lifecycle. What if we took a small step towards a more collaborative approach as recommended by DevOps? How would this process change? Let’s explore (modified process steps are highlighted using bold font)…

  1. Place order for dev, test, uat, prod, dr, etc… infrastructure. (~8 weeks lead time, all hardware ordered up front)
  2. Development and operations personnel fill out virtual paperwork together which creates a much more accurate picture of infrastructure requirements. Discussions about failover, redundancy, disaster recovery, data center locations, storage requirements, etc… progress more quickly with better estimations of sizing and understanding of overall environment.
  3. New application is added to CMDB (or similar) to include new infrastructure components, application components, and dependencies.
  4. Operations is fully aware of the progress the developers are making. This gives the operations staff an opportunity to disucss monitoring requirements from both a business and IT perspective with the developers. Operations starts designing the monitoring architecture while the servers have arrived and are being racked and stacked. Both the development and operations teams are comfortable with the hardware requirement estimates but understand that they will have to see what happens during load testing and make adjustments (i.e. tell the developers to make it use fewer resources or order more hardware). Developers start using the monitoring tools in their dev environment to identify issues before the application ever makes it to test.
  5. We’re closing in on one week until the scheduled go-live date but the application isn’t ready for testing yet. It’s not the developers fault that the functional requirements keep changing but it is going to squeeze the testing and deployment phases.
  6. The monitoring team has installed their standard monitoring agents (usually just traditional server monitoring) as well as the more advanced application performance monitoring (APM) agents across all environments. This provides the foundation for rapid triage during development, load testing, and production.
  7. It’s 2 days before go-live and we have an application to test. The load test team has coded a robust set of synthetic load based upon application monitoring data gathered during development. This load is applied to the application which reveals some slow response times and some errors. The developers and operations staff use the APM tool together during the load test to immediately identify the problematic code and have a new release available by the end of the original load test. This process is repeated until the slow response times and errors are resolved.
  8. One day until go-live, we were able to stress test overnight and everything looks good. We have the green light to deploy the application onto the production servers. The app is deployed, functional testing looks good, business and IT metric dashboard looks good, and we wait until tomorrow for the real test…production users!
  9. Go-Live … Users hit the application, the application works well for the most part. The APM tool is showing some slow response time and a couple of errors to the developers and the operations staff. The team agrees to implement a fix after business hours as the business dashboard shows that things are generally going well. After hours the development and operations team collaborate on the build, test, and deploy of the new code to fix the issues identified that day. Management is happy.
  10. Week one is highly successful with issues being rapidly identified and dealt with as they come up. Week 2 and each subsequent week are business as usual and the development team is actively focused on releasing new functionality while operations adapts monitoring and dashboards when needed.
DevAndOps

Developers and operations personnel living together in harmony!

So what scenario sounds better to you? Have you ever been in a situation where increased collaboration caused more problems than it solved? In this example the overall process was kept mostly intact to ensure compliance with regulatory audit procedures. Developers were never granted access to production (regulatory issue for Financial Services companies) but by being tightly coupled with operations they had access to all of the information they needed to solve the issues.

It seems to me that you can make a big impact across the lifecycle of an application by implementing parts of the DevOps philosophy in even a minor way. In this example we didn’t even touch the automation aspects of DevOps. That’s where all of those fun and useful tools come into play so that is where we will pick up next time.

If you’re interested in adding an APM tool to your DevOps, development, or operations toolbox you can take a free self guided trial by clicking here and following the prompts.

Click here for DevOps Scares Me Part 4.