IT Operations and the Team USA Speed Skating Disaster

Photo Credit: http://sports.yahoo.com/news/speed-skating-u-miss-medals-again-172417859–oly.html

In case you weren’t aware, Team USA speed skating came home from the Olympic games in Sochi with zero medals. Not just zero Gold medals, zero medals in total (not counting short track skating which only netted one silver medal). Now I’m not a big speed skating fan but what happened during the Olympic games reminded me of what I had seen all too often while working in operations at some very large enterprises. So that’s the topic for today’s blog… What IT organizations can learn from the USA speed skating melt down in Sochi (and visa versa).

As the Team USA speed skaters were turning in poor performance after poor performance on the ice, their coaches were trying to figure out why their athletes were not competing at the level they expected and how to fix the issue. The same things happens in IT organizations when things go wrong with application performance. The business leans on the IT organization and they try to figure out what is going wrong so they can fix the issue.

Team USA decided that their new suits could be the reason why no medals had been won yet and requested to change back to the suits they had been using before they made the switch. Did they not test the new suits at all? No, of course they tested them. The important question is; how did they perform these tests?

Testing Parameters

Altitude, air density, humidity levels, varying velocity, body positions, body shapes, etc… There are a ton of different factors that may or may not have been accounted for during the testing of these suits. It’s not possible to test every variant of every parameter before using the suits in a race just like it’s not possible to properly test an application for every scenario before it get’s released into production.

The fact is, at some point you have to transition from testing to using in a real life scenario. For Team USA, that means using the new suits in races. For IT professionals, it means deploying your application into production.

Cover Up the Air Vent (Blame It On the Database)

The initial reaction by Team USA was to guess that the performance problem was a result of the back air vent system creating turbulent air flow during the race. They proceeded to cover up this air vent with no improvement in results. This is the IT equivalent of blaming application performance problems on the database or the network. This is a natural reaction in the absence of data that can help isolate the location of the problem.

Isolating the bottleneck in a production application can be easy. Here we see there is a very slow call to the database.

Isolating the bottleneck in a production application can be easy. Here we see there is a very slow call to the database.

When it was obvious that a simple closure of the air vent did not have the desired effect, Team USA then decided it was time to switch back to their old suits. In the IT world we call this the back out plan. Roll back the code to the last known good version and hope for the best.

It’s All About the Race Results (aka Production)

No matter how well you test speed skating suits or application releases, the measure of success or failure is how well you do on race day (or in production for the IT crowd). As Team USA found out, you can’t predict how things will go in production from your tests alone. Similar to an IT organization holding off on application changes before major events (like end of year financials, Black Friday, Cyber Monday, etc…), Team USA should have proven their new suits in some races leading up to their biggest event instead of making a suit change right before it.

I feel bad for the amazing athletes of Team USA Speed Skating that had to deal with so much drama during the Olympics. I would imagine it was difficult to perform at the highest level required with a major distraction like their new suits. In the end, there was no difference in results between using the new suits or the old ones. The suits were just a distraction from whatever the real issue was.

IT professionals have the luxury of tools that help them find and fix problems in production.

IT professionals have the luxury of tools that help them find and fix problems in production.

Fortunately for IT professionals we have tools that help us find problems in production instead of just having to guess what the root cause of the issue might be. If you’re a fan of USA Speed Skating, sorry for bringing up a sore subject but hopefully there is a lesson that has been learned for next time. If you’re an IT professional without the proper monitoring tools to figure out the root cause of your production issues you need to sign up today for a free trial of AppDynamics and see what you’ve been missing.

How Garmin attacks IT and business problems with Application Performance Management (APM)

If you’ve used a GPS product before you’re probably familiar with Garmin. They are the leading worldwide provider of navigation devices and technologies. In addition to their popular consumer GPS products, Garmin also offers devices for planes, boats, outdoor activities, fitness, and cars. Because end user experience is so important to Garmin and its customers, ensuring the speed of its web and mobile applications is one of its highest priorities.

This blog post summarizes the content of a case study that can be found by clicking here.

The reality in today’s IT driven world is that there are many different issues that need to be solved and therefore many different use cases for the products designed to help with these issues. Garmin has shown it’s industry leadership by taking advantage of AppDynamics to address many use cases with a single product. This shows a high level of maturity that most organizations strive to achieve.

“We never had anything that would give us a good deep dive into what was going on in the transactions inside our application – we had no historical data, and we had no insight into database calls, threads, etc.,” – Doug Strick

Garmin-FlowMap

AppDynamics flow map showing part of Garmin’s production environment.

Real-Time and Historic Visibility in Production

You just can’t test everything before your application goes to production. Real users will find ways to break your application and/or make it slow to a crawl. When this happens you need historical data to compare against as well as real-time information to quickly remediate problems.

Garmin found that AppDynamics, an application performance management solution designed for high-volume production environments, best suited its needs. Garmin’s operations team feels comfortable leaving AppDynamics on in the production environment, allowing them to respond more quickly and collect historical data.

The following use cases are examples of how Garmin is attacking IT and business problems to better serve their company and their customers.

Use Case 1: Memory Monitoring

Before they started using AppDynamics, Garmin was unable to track application memory usage over time. Using AppDynamics they were able to identify a major memory issue impacting customers and ensure a fast and stable user experience. Click here to read the details in the case study.

Garmin-Heap

Chart of heap usage over time.

Use Case 2: Automated Remediation Using Application Run Book Automation

AppDynamics’ Application Run Book Automation feature allows organizations to trigger certain tasks, such as restarting a JVM or running a script, with thresholds in AppDynamics. Garmin was able to use this feature very effectively one weekend while waiting for a fix from a third party. Click here to read the details in the case study.

Use Case 3: Code-level Diagnostics

“We knew we needed a tool that could constantly monitor our production environment, allowing us to collect historical data and trend performance over time,” said Strick. “Also, we needed something that would give us a better view of what was going on inside the application at the code level.” – Doug Strick

With AppDynamics, Garmin’s application teams have been able to rapidly resolve several issues originating in the application code.  Click here to read the details in the case study.

Use Case 4: Bad Configuration in Production

Garmin also uses AppDynamics to verify that new code releases perform well. Because Garmin has an agile release schedule, new code releases are very frequent (every 2 weeks) and it’s important to ensure that each release goes smoothly. One common problem that occurs when deploying new code is that configurations for test are not updated for the production environment, resulting in performance issues. Click here to read the details in the case study.

Garmin-Verification

AppDynamics flow map showing production nodes making services calls to test environments.

Use Case 5: Real-time Business Metrics (RTBM)

IT metrics alone don’t tell the whole story. There can be business problems due to pricing issues our issues with external service providers that impact the business and will never show up as an IT metric. That’s why AppDynamics created RTBM and why Garmin adopted it so early on in their deployment of AppDynamics. With Real-Time Business Metrics, organizations like Garmin can collect, monitor, and visualize transaction payload data from their applications. Strick used this capability to measure the revenue flow in Garmin’s eCommerce application.  Click here to read the details in the case study.

Garmin-RTBM

Real-time order totals for web orders.

Conclusion

“AppDynamics has given us visibility into areas we never had visibility into before… Some issues that used to take several hours now take under 30 minutes to resolve, or a matter of seconds if we’re using Run Book Automation.” – Doug Strick

Garmin has shown how to get real value out of its APM tool of choice and you can do the same. Click here to try AppDynamics for free in your environment today.

DevOps Scares Me – Part 4: Dev and Ops Collaborate Across the Lifecycle

Today we are blogging something a little different than our normal. I’m Jim Hirschauer the Operations Guy, and this is my esteemed colleague Dustin Whittle the Developer. In this blog post we’re going to discuss how we would take an application from inception through development, testing, QA, and into production. We’ll each comment on the different stages and provide our perspective on the tools that we need to use at each stage and how they help with automation, testing, and monitoring. Along the way we’ll call out the potential collaboration points to identify the areas where the DevOps approach provides the most value.

The software development loop looks like this:

DevOps Cycle

Inception and working with a product team

From an operational perspective, my first instinct is to understand the application architecture so that I can start thinking about the proper deployment model for the infrastructure components. Here are some of my operational questions and considerations for this stage:

  • Are we using a public or private cloud?
  • What is the lead time for spinning up each component and ensuring they comply with my companies regulations?
  • When do I need to provide a development environment to my dev team or will they handle it themselves?
  • Does this application perform functions that other applications or services already handle? Operations should have high-level visibility into the application and service portfolio.

From a development perspective, my first milestone is to make sure the ops team fully understands the application and what it takes to deploy it to a pre-production environment. This is where we the developers sync with the product and ops team and make sure we are aligned.

Planning for the product team:

  • Is the project scope well defined? Is there a product requirements document?
  • Do we have a well defined product backlog?
  • Are there mocks of the user experience?

Planning for the ops team:

  • What tools will we use for deployment and configuration management?
  • How will we automate the deployment process and does the ops team understand the manual steps?
  • How will we integrate our builds with our continuous integration server?
  • How will we automate the provisioning of new environments?
  • Capacity Planning – Do we know the expected production load?

There’s not a ton of activity at this stage for the operations team. This is really where the devops synergy comes into play. DevOps is simply operations working together with engineers to get things done faster in an automated and repeatable way. When it comes to scaling, the more automation in place the easier things will be in the long run.

Development and scoping production

This should start with a conversation between the dev and ops teams to control domain ownership. Depending on your organization and peers strengths this is a good time to decide who will be responsible for automating the provisioning and deployment of the application. The ops questions for deploying complex web applications:

  • How do you provision virtual machines?
  • How do you configure network devices and servers?
  • How do you deploy applications?
  • How do you collect and aggregate logs?
  • How do you monitor services?
  • How do you monitor network performance?
  • How do you monitor application performance?
  • How do you alert and remediate when there are problems?

During the development phase the operations focused staff normally make sure the development environment is managed and are actively working to set up the test, QA and Prod environments. This can take a lot of time if automation tools aren’t used.

Here are some tools you can use to automate server build and configuration:

Meanwhile, the operations staff should also make sure that the developers have access to tools which will help them with release management and application monitoring and troubleshooting. Here are some of those tools:

Deployment Automation:

Infrastructure Monitoring:

Logging:

Application + Network Performance Management:

Testing and Quality Assurance

Once developers have built unit and functional tests we need to ensure the tests are running after every commit and we don’t allow regressions in our promoted environments. In theory, developers should do this before they commit any code, but often times problems don’t show up until you have production traffic running under production infrastructure. The goal of this step is really to simulate as much as possible everything that can go wrong and find out what happens and how to remediate.

QA Problems

The next step is to do capacity planning and load testing to be confident the application doesn’t fall over when it is needed most. There are a variety of tools for load testing:

  • Apica Load Test – Cloud-based load testing for web and mobile applications
  • Soasta – Build, execute, and analyze performance tests on a single, powerful, intuitive platform.
  • Bees with Machine Guns – A utility for arming (creating) many bees (micro EC2 instances) to attack (load test) targets (web applications).
  • MultiMechanize – Multi-Mechanize is an open source framework for performance and load testing. It runs concurrent Python scripts to generate load (synthetic transactions) against a remote site or service. Multi-Mechanize is most commonly used for web performance and scalability testing, but can be used to generate workload against any remote API accessible from Python.
  • Google PageSpeed Insights – PageSpeed Insights analyzes the content of a web page, then generates suggestions to make that page faster. Reducing page load times can reduce bounce rates and increase conversion rates.

The last step of testing is discovering all of the possible failure scenarios and coming up with a disaster recovery plan. For example what happens if we lose a database or a data center or have a 100x surge in traffic.

During the test and QA stages operations needs to play a prominent role. This is often overlooked by ops teams but their participation in test and QA can make a meaningful difference in the quality of the release into production. Here’s how.

If the application is already in production (and monitored properly), operations has access to production usage and load patterns. These patterns are essential to the QA team for creating a load test that properly exercises the application. I once watched a functional test where 20+ business transactions were tested manually by the application support team. Directly after the functional test I watched the load test that ran the same 2 business transactions over and over again. Do you think the load test was an accurate representation of production load? No way! When I asked the QA team why there were only 2 transactions they said “Because that is what the application team told us to model.”

The development and application support teams usually don’t have time to sit with the QA team and give them an accurate assessment of what needs to be modeled for load testing. Operations teams should work as the middle man and provide business transaction information from production or from development if this is an application that has never seen production load.

Here are some of the operational tasks during testing and QA:

  • Ensure monitoring tools are in place.
  • Ensure environments are properly configured
  • Participate in functional, load, stress, leak, etc… tests and provide analysis and support
  • Providing guidance to the QA team

Production

Production is traditionally the domain of the operations team. For as long as I can remember, the development teams have thrown applications over the production wall for the operations staff to deal with when there are problems. Sure, some problems like hardware issues, network issues, and cooling issues are purely on the shoulders of operations–but what about all of those application specific problems? For example, there are problems where the application is consuming way too many resources, or when the application has connection issues with the database due to a misconfiguration, or when the application just locks up and has to be restarted.

I recall getting paged in the middle of the night for application-related issues and thinking how much better each release would be if the developers had to support their applications once they made it to production. It was really difficult back in those days to say with any certainty that the problem was application related and that a developer needed to be involved. Today’s monitoring tools have changed that and allow for problem isolation in just minutes. Since developers in financial services organizations are not allowed access to production servers, it makes having the proper tools all the more important.

Production devops is all about:

  • deploying code in a fast, repeatable, scalable manner
  • rapidly identifying performance and stability problems
  • alerting the proper team when a problem is detected
  • rapidly isolating the root cause of problems
  • automatic remediation of known problems and rapid manual remediation of new problems (runbooks and runbook automation)

Your application must always be available and operating correctly during business hours (this may be 24×7 for your specific application).

Alerting Tools:

In case of failures alerting tools are crucial to notify the ops team of serious issues. The operations team will usually have a runbook to turn to when things go wrong. A best practice is to collaborate on incident response plans.

Maintenance

Finally we’ve made it to the last major category of the SDLC, maintenance. As an operations guy my mind focuses on the following tasks:

  • Capacity planning – Do we have enough resources available to the application? If we use dynamic scaling, this is not an issue but a task to ensure the scaling is working properly.
  • Patching – are we up to date with patches on the infrastructure and application components? This is supposed to help with performance and/or security and/or stability but it doesn’t always work out that way.
  • Support – are we current with our software support levels (aka, have we paid and are we on supported versions)?
  • New releases (application updates) – New releases always made me cringe since I assumed the release would have issues the first week. I learned this reaction from some very late nights immediately following those new releases.

As a developer the biggest issues during the maintenance phase is working with the operations team to deploy new versions and make critical bug fixes. The other primary concern is troubleshooting production problems. Even when no new code has been deployed, sometimes failures happen. If you have a great process, application performance monitoring, and a devops mentality collaborating with ops to resolve the root cause of failures becomes easy.

As you can see, the dev and ops perspectives are pretty different, but that’s exactly why those 2 sides of the house need to tear down the walls and work together. DevOps isn’t just a set of tools, but a philosophical shift that needs that requires buy-in from all folks involved to really succeed. It’s only through a high level of collaboration that things will change for the better. AppDynamics can’t change the mindset of your organization, but it is a great way to foster collaboration across all of your organizational silos. Sign up for your free trial today and make a difference for you organization.