Quantifying the value of DevOps

In my experience when you work in IT the executive team rarely focuses on your team until you experience a catastrophic failure – once you do you are the center of attention until services are back to normal. It is easy to ignore the background work that IT teams spend most of their days on just to keep everything running smoothly. In this post I will discuss how to quantify the value of DevOps to organizations. The notion of DevOps is simple: Developers working together with Operations to get things done faster in an automated and repeatable way. If the process is working the cycle looks like:

DevOps

DevOps consists of tools, processes, and the cultural change to apply both across an organization. In my experience in large companies this is usually driven from the top down, and in smaller companies this comes organically from the bottom up.

When I started in IT I worked as a NOC engineer for a datacenter. Most my days were spent helping colocation customers install or upgrade their servers. If one of our managed servers failed it was my responsibility to fix it as fast as possible. Other days were spent as a consultant helping companies manage their applications. This is when most web applications were simple with only two servers – a database and an app server:

monolithic_app

As I grew in my career I moved to the engineering side and worked developing very large web applications. The applications I worked on were much more complex then what I was used to in my datacenter days. It is not just the architecture and code that is more complex, but the operational overhead to manage such large infrastructure requires an evolved attitude and better tools.

distributed_app

When I built and deployed applications we had to build our servers from the ground up. In the age of the cloud you get to choose which problems you want to spend time solving. If you choose an Infrastructure as a service provider you own not only your application and data, but the middleware and operating system as well. If you pick a platform as a service you just have to support your application and data. The traditional on-premise option while giving you the most freedom, also carries the responsibility for managing the hardware, network, and power. Pick your battles wisely:

Screen Shot 2014-03-12 at 11.50.15 AM

As an application owner on a large team you find out quickly how well a team works together. In the pre-DevOps days the typical process to resolve an operational issues looked like this:

Screen Shot 2014-03-12 at 11.49.50 AM

1)     Support creates a ticket and assigns a relative priority
2)     Operations begins to investigate and blames developers
3)     Developer say its not possible as it works in development and bounces the ticket back to operations
4)     Operations team escalates the issue to management until operations and developers are working side by side to find the root cause
5)     Both argue that the issue isn’t as severe as being stated so they reprioritize
6)     Management hears about the ticket and assigns it Severity or Priority 1
7)     Operations and Developers find the root cause together and fix the issue
8)     Support closes the ticket

Many times we wasted a lot of time investigating support tickets that weren’t actually issues. We investigated them because we couldn’t rely on the health checks and monitoring tools to determine if the issue was valid. Either the ticket couldn’t be reproduced or the issues were with a third-party. Either way we had to invest the time required to figure it out. Never once did we calculate how much money the false positives cost the company in man-hours.

Screen Shot 2014-03-12 at 11.50.35 AM

With better application monitoring tools we are able to reduce the number of false positive and the wasted money the company spent.

How much revenue did the business lose?

noidea

I never once was able to articulate how much money our team saved the company by adding tools and improving processes. In the age of DevOps there are a lot of tools in the DevOps toolchain.

By adopting infrastructure automation with tools like Chef, Puppet, and Ansible you can treat your infrastructure as code so that it is automated, versioned, testable, and most importantly repeatable. The next time a server goes down it takes seconds to spin up an identical instance. How much time have you saved the company by having a consistent way to manage configuration changes?

By adopting deployment automation with tools like Jenkins, Fabric, and Capistrano you can confidently and consistently deploy applications across your environments. How much time have you saved the company by reducing build and deployment issues?

By adopting log automation using tools such as Logstash, Splunk, SumoLogic and Loggly you can aggregate and index all of your logs across every service. How much time have you saved the company by not having to manually find the machine causing the problem and retrieve the associated logs in a single click?

By adopting application performance management tools like AppDynamics you can easily get code level visibility into production problems and understand exactly what nodes are causing problems. How much time have you saved the company by adopting APM to decrease the mean time to resolution?

By adoption run book automation through tools like AppDynamics you can automate responses to common application problems and auto-scale up and down in the cloud. How much time have you saved the company by automatically fixing common application failures with out even clicking a button?

Understanding the value these tools and processes have on your organization is straightforward:

devops_tasks

DevOps = Automation & Collaboration = Time = Money 

When applying DevOps across your organization the most valuable advice I can give is to automate everything and always plan to fail. A survey from RebelLabs/ZeroTurnaround shows that:

1)     DevOps teams spend more time improving things and less time fixing things
2)     DevOps teams recover from failures faster
3)     DevOps teams release apps more than twice as fast

 

How much does an outage cost in your company?

This post was inspired by a tech talk I have given in the past: https://speakerdeck.com/dustinwhittle/devops-pay-raise-devnexus

 

 

Take five minutes to get complete visibility into the performance of your production applications with AppDynamics today.

How Garmin attacks IT and business problems with Application Performance Management (APM)

If you’ve used a GPS product before you’re probably familiar with Garmin. They are the leading worldwide provider of navigation devices and technologies. In addition to their popular consumer GPS products, Garmin also offers devices for planes, boats, outdoor activities, fitness, and cars. Because end user experience is so important to Garmin and its customers, ensuring the speed of its web and mobile applications is one of its highest priorities.

This blog post summarizes the content of a case study that can be found by clicking here.

The reality in today’s IT driven world is that there are many different issues that need to be solved and therefore many different use cases for the products designed to help with these issues. Garmin has shown it’s industry leadership by taking advantage of AppDynamics to address many use cases with a single product. This shows a high level of maturity that most organizations strive to achieve.

“We never had anything that would give us a good deep dive into what was going on in the transactions inside our application – we had no historical data, and we had no insight into database calls, threads, etc.,” – Doug Strick

Garmin-FlowMap

AppDynamics flow map showing part of Garmin’s production environment.

Real-Time and Historic Visibility in Production

You just can’t test everything before your application goes to production. Real users will find ways to break your application and/or make it slow to a crawl. When this happens you need historical data to compare against as well as real-time information to quickly remediate problems.

Garmin found that AppDynamics, an application performance management solution designed for high-volume production environments, best suited its needs. Garmin’s operations team feels comfortable leaving AppDynamics on in the production environment, allowing them to respond more quickly and collect historical data.

The following use cases are examples of how Garmin is attacking IT and business problems to better serve their company and their customers.

Use Case 1: Memory Monitoring

Before they started using AppDynamics, Garmin was unable to track application memory usage over time. Using AppDynamics they were able to identify a major memory issue impacting customers and ensure a fast and stable user experience. Click here to read the details in the case study.

Garmin-Heap

Chart of heap usage over time.

Use Case 2: Automated Remediation Using Application Run Book Automation

AppDynamics’ Application Run Book Automation feature allows organizations to trigger certain tasks, such as restarting a JVM or running a script, with thresholds in AppDynamics. Garmin was able to use this feature very effectively one weekend while waiting for a fix from a third party. Click here to read the details in the case study.

Use Case 3: Code-level Diagnostics

“We knew we needed a tool that could constantly monitor our production environment, allowing us to collect historical data and trend performance over time,” said Strick. “Also, we needed something that would give us a better view of what was going on inside the application at the code level.” – Doug Strick

With AppDynamics, Garmin’s application teams have been able to rapidly resolve several issues originating in the application code.  Click here to read the details in the case study.

Use Case 4: Bad Configuration in Production

Garmin also uses AppDynamics to verify that new code releases perform well. Because Garmin has an agile release schedule, new code releases are very frequent (every 2 weeks) and it’s important to ensure that each release goes smoothly. One common problem that occurs when deploying new code is that configurations for test are not updated for the production environment, resulting in performance issues. Click here to read the details in the case study.

Garmin-Verification

AppDynamics flow map showing production nodes making services calls to test environments.

Use Case 5: Real-time Business Metrics (RTBM)

IT metrics alone don’t tell the whole story. There can be business problems due to pricing issues our issues with external service providers that impact the business and will never show up as an IT metric. That’s why AppDynamics created RTBM and why Garmin adopted it so early on in their deployment of AppDynamics. With Real-Time Business Metrics, organizations like Garmin can collect, monitor, and visualize transaction payload data from their applications. Strick used this capability to measure the revenue flow in Garmin’s eCommerce application.  Click here to read the details in the case study.

Garmin-RTBM

Real-time order totals for web orders.

Conclusion

“AppDynamics has given us visibility into areas we never had visibility into before… Some issues that used to take several hours now take under 30 minutes to resolve, or a matter of seconds if we’re using Run Book Automation.” – Doug Strick

Garmin has shown how to get real value out of its APM tool of choice and you can do the same. Click here to try AppDynamics for free in your environment today.

Forrester Talks Automation and Application Performance Management

Jean-Pierre (JP) Garbani of the analyst firm Forrester recently wrote a research paper titled “Technology Spotlight: Automate Application Performance Management – Automate Your Incident Management Process With Run Book Automation“. In this paper JP discusses the fact that IT must embrace automation if they are to be successful moving forward. He goes on to describe Run Book Automation (RBA) and how it applies to Application Performance Management (APM).

One of the most interesting parts of the research note is a graphical depiction of the intersection between APM and RBA. While I can’t show that image in this blog post I can say that it is a spot on depiction of how RBA works within AppDynamics. For those who are not familiar, AppDynamics decided to lead the APM industry in a new direction by listening to our customers and making RBA an important and fully integrated part of our product.

If you take nothing else away from this blog post or from JPs research paper you need to understand this key insight… The reason APM based RBA works so much better than traditional RBA is because APM understands exactly which application nodes are impacted at any given time and can perform run book remediation on those nodes without any input from a user.

The Old Way

Traditional RBA requires that you write tests to act as triggers for run book workflows. You would run these tests against pre-defined sets of infrastructure and application components when there is a problem so that your runbooks can fix any known issues. If your application and infrastructure change you need to manually modify the list components that are getting tested. This manual update process has been the downfall of RBA and other technologies (CMDB anyone?) as people have come to realize the time investment required to keep everything current. This problem is amplified by todays dynamic application technologies like virtualization and cloud computing.

Classic RBA Process

Process flow and timeline of classic RBA.

The New Way

Forrester and AppDynamics agree that the answer to the traditional RBA problem described above is by using APM to dynamically track the current state of application and infrastructure components as well as identify problems that trigger run book workflows for resolution. Identification of issues from within the application and in real time is a giant step forward from the pre-determined interval testing of traditional RBA. And when you combine this capability with real time business metrics you get a new capability that enables the business to react immediately to problems that have nothing to do with IT.

APM RBA Process

Process flow and timeline of APM based RBA.

Application run book automation can be used by any organization large or small. If there are issues within your environment that have known fixes then you can use application RBA to automatically detect and remediate those problems within seconds. Stop wasting your time doing repetitive tasks and try AppDynamics for free today. If you’d like to read the Forrester research paper in its entirety you can download it for free by clicking here.

My Top 3 Automated Tasks for Finding and Fixing Problems

Gears-BrainAutomation sets apart organizations at the top of their game from the rest of the pack. The limiting factor in most organizations is that they are usually too busy putting out fires and keeping up with all of their other obligations to expend the effort required to envision and build out their automation strategy. With that in mind I have created a small list of the automation tasks that I feel provide the most value to an organization. Along with these tasks I explain the type of effort involved and reward associated with each one. All of the information presented is based upon my 15 years of troubleshooting within enterprise operations and applications environments.

IT Operations

1. Collect troubleshooting metrics – To me this is the no-brainer of automation tasks. For each particular type of problem you always want certain information to help resolve that issue. This is also the easiest of my top three to implement but may provide less value than the other two. Here are some examples…

  • Hung/Unresponsive JVM/CLR – Initiate and store thread dump, restart application on offending node.
  • Slow transactions plus high server CPU utilization – Collect process listing to determine CPU contributors and spin up extra instance of application.
  • Transactions throwing excessive errors – search log files for errors and send list to appropriate personnel, based upon error type possibly probe individial components deeper (see #2 below)

Application Operations

2. Probe application components – This one is really useful for figuring out difficult application problems but requires more effort to set up than #1. The basic concept is that you need to find out from the application support team what steps they would take manually to trouble shoot their application if it were slow or broken. The usual responses are things like “Check this log file for this word”, “Run this query against this database and if the output is -3 then I know what the problem is”, “Hit this URL to see what the response looks like. If the page does not return properly I know there is a problem with this component”, etc…

It may seem like a lot of work to set up this type of automated probing at first but once it is set up it becomes an invaluable troubleshooting tool. Imagine that the application response times get slow so these troubleshooting measures are automatically invoked and you get an email with the exact root cause within minutes of performance degradation. With this type of automation there is usually a known resolution based upon the probing results so that could be automated too. How handy is that?

Business Operations

Blame_Flowchart3. Alert the business to changing conditions – This is where the IT staff has the ability to really make an impression and impact on the business. One of the most overlooked aspects of monitoring and automation is the ability to gather, baseline, alert, and act based upon pure business metrics. Here’s an example scenario…

Bobs business has an e-commerce website that sells many different products. They use AppDynamics Pro to track the quantity of each item sold along with the total revenue of all sales throughout the business day (using the information point functionality). One day Bob gets an alert that the latest greatest widgets are selling way below their typical volume but the automation engine searched the prices of major competitors to find that one competitor lowered prices and is undercutting business. With this actionable information Bob is able to immediately match the pricing of the competition and sales rates return to normal before too much damage is done.

Obviously business operations automation can be the most complex but also the most rewarding. Reaching out to the business and having a conversation about their activities and processes can seem daunting but I can tell you from personal experience that the business is more than willing to participate in initiatives that save them time and money. This type of conversation also normally leads to the business asking about other ways to collaborate to make them more effective so it is a great way to improve the overall level of communication with the business.

AppDynamics Pro enables a new level of automation because it knows exactly what is going on inside of your applications from both a business and IT perspective. Your level of automation is limited only by your imagination. I recommend that you start out small with a single automation use case and build outward from there. You can use AppDynamics Pro for free to try out our application runbook automation functionality on your specific use case by clicking here.

DevOps Scares Me – Part 2

What is DevOps?

In the first post of this series, my colleague Jim Hirschauer defined what DevOps means and how it impacts organizations. He concluded DevOps is defined as “A software development method that stresses communication, collaboration and integration between software developers and information technology professionals with the goal of automating as much as possible different operational processes.”

I like to think DevOps can be explained simply as operations working together with engineers to get things done faster in an automated and repeatable way.

DevOps Cycle

From developer to operations – one and the same?

As a developer I have always dabbled lightly in operations. I always wanted to focus on making my code great and let an operations team worry about setting up the production infrastructure. It used to be easy! I could just ftp my files to production, and voila! My app was live and it was time for a beer. Real applications are much more complex. As I evolved my skillset I started to do more and expand my operations knowledge.

Infrastructure Automation

When I was young you actually had to build a server from scratch, buy power and connectivity in a data center, and manually plug a machine into the network. After wearing the operations hat for a few years I have learned many operations tasks are mundane, manual, and often have to be done at two in the morning once something has gone wrong. DevOps is predicated on the idea that all elements of technology infrastructure can be controlled through code. With the rise of the cloud it can all be done in real-time via a web service.

Growing pains

When you are responsible for large distributed applications the operations complexity grows quickly.

  • How do you provision virtual machines?
  • How do you configure network devices and servers?
  • How do you deploy applications?
  • How do you collect and aggregate logs?
  • How do you monitor services?
  • How do you monitor network performance?
  • How do you monitor application performance?
  • How do you alert and remediate when there are problems?

Combining the power of developers and operations

The focus on the developer/operations collaboration enables a new approach to managing the complexity of real world operations. I believe the operations complexity breaks down into a few main categories: infrastructure automation, configuration management, deployment automation, log management, performance management, and monitoring. Below are some tools I have used to help solve these tasks.

Infrastructure Automation

Infrastructure automation solves the problem of having to be physically present in a data center to provision hardware and make network changes. The benefits of using cloud services is that costs scale linearly with demand and you can provision automatically as needed without having to pay for hardware up front.

Configuration Management

Configuration management solves the problem of having to manually install and configure packages once the hardware is in place. The benefit of using configuration automation solutions is that servers are deployed exactly the same way every time. If you need to make a change across ten thousand servers you only need to make the change in one place.

There are other vendor-specific DevOps tools as well:

Deployment Automation

Deployment automation solves the problem of deploying an application with an automated and repeatable process.

Log Management

Log management solves the problem of aggregating, storing, and analyzing all logs in one place.

Performance Management

Performance management is about ensuring your network and application are performing as expected and providing intelligence when you encounter problems.

Monitoring

Monitoring and alerting are a crucial piece to managing operations and making sure people are notified when infrastructure and related services go down.

In my next post of this series we will dive into each of these categories and explore the best tools available in the devops space. As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

Click here for DevOps Scares Me Part 3.

DevOps Scares Me – Part 1

Call of Duty: DevOpsDevOps is scary stuff for us pure Ops folks that thought they left coding behind a long, long time ago. Most of us Ops people can hack out some basic (or maybe even advanced) shell scripts in Perl, ksh, bash, csh, etc… But the term DevOps alone makes me cringe and think I might really need to know how to write code for real (which I don’t enjoy, that’s why I’m an ops guy in the first place).

So here’s my plan. I’m going to do a bunch of research, play with relevant tools (what fun is IT without tools?), and document everything I discover here in a series of blog posts. My goal is to educate myself and others so that we operations types can get more comfortable with DevOps. By breaking down this concept and figuring out what it really means, hopefully we can figure out how to transition pure Ops guys into this new IT management paradigm.

What is DevOps

Here we go, I’m probably about to open up Pandoras Box by trying to define what DevOps means but to me that is the foundation of everything else I will discuss in this series. I started my research by asking Google “what is devops”. Naturally, Wikipedia was the first result so that is where we will begin. The first sentence on Wikipedia defines DevOps as “a software development method that stresses communication, collaboration and integration between software developers and information technology (IT) professionals.” Hmmm… This is not a great start for us Ops folks who don’t really want anything to do with programming.

Reading further on down the page I see something more interesting to me … “The goal is to automate as much as possible different operational processes.” Now that is an idea I can stand behind. I have always been a fan of automating whatever repetitive processes that I can (usually by way of shell scripts).

Taking Comfort

My next stop on this DevOps train lead me to a very interesting blog post by the folks at the agile admin. In it they discuss the definition and history of DevOps. Here are some of the nuggets that were of particular interest to me:

  • “Effectively, you can define DevOps as system administrators participating in an agile development process alongside developers and using many of the same agile techniques for their systems work.”
  • “It’s a misconception that DevOps is coming from the development side of the house – DevOps, and its antecedents in agile operations, are largely being initiated out of operations teams.”
  • “The point is that all the participants in creating a product or system should collaborate from the beginning – business folks of various stripes, developers of various stripes, and operations folks of various stripes, and all this includes security, network, and whoever else.”

Wow, that’s a lot more comforting to my fragile psyche. The idea that DevOps is being largely initiated out of the operations side of the house makes me feel like I misunderstood the whole concept right from the start.

For even more perspective I read a great article on O’Reilly Radar from Mike Loukides. In it he explains the origins of dev and ops and shows how operations has been changing over the years to include much more automation of tasks and configurations. He also explains how there is no expectation of all knowing developer/operations super humans but instead that operations staff needs to work closely or even be in the same group as the development team.

When it comes right down to it there are developers and there are operations staff. The two groups have worked too far apart for far too long. The DevOps movement is an attempt to bring these worlds together so that they can achieve the effectiveness and efficiency that the business deserves. I really do feel a lot better about DevOps now that I have done more research into the basic meaning and I hope this helps some of you who were feeling intimidated like I was. In my next post I plan to break down common operations tasks and talk about the tools that are available to help automate those tasks and their associated processes.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

Click here for DevOps Scares Me Part 2.

Application Runbook Automation – A Detailed Walk Through

On Monday AppDynamics announced a new feature called Application Runbook Automation (RBA). The response to this announcement has been great and many people want to see the details on how we implement RBA within AppDynamics, so I’m going to walk you through it step by step in this blog post.

If you don’t already know WHY we built Application RBA please read “Don’t Be An “Also-Ran” – Application Runbook Automation for World Class IT”

Let’s jump right in…

Don’t be an “Also-Ran” – Application Runbook Automation for World Class IT

Your applications are the lifeblood of your business–the culmination of countless hours of development, testing, bug fixes, more testing, and finally deployment to production. No matter how good your coding and testing practices are, though, eventually your applications will slow down, start to throw errors, and crash. How quickly and effectively you remediate these business impacting problems is the difference between world class IT organizations and the “also rans.”

Application Runbook Automation from AppDynamics is a game changing technology that can remediate business impact in a matter of seconds. It can represent a savings of hundreds of thousands of dollars per incident. Read more to find out how…

rba revenue risk 2

This table shows the business impact in USD of reducing MTTR from hours to minutes to seconds by taking advantage of AppDynamics and application run book automation.

 

So What’s a Runbook Really?

In case you’ve been locked in an IT dungeon for your whole career, let’s take a minute to talk about runbooks. Runbooks are the cure for insomnia, the crappy part of application support where you document how to stop and restart your awesome application that will never, ever fail. But just in case, some of you are also forced by the evil management regime to document troubleshooting procedures along with remediation steps for the common problems that could cause your application to misbehave.

All of this documentation (Runbook) is handed off to your friendly operations center staff so that it can be stuffed away in a giant repository never to be heard from again. After all, the job of the NOC (network operations center) is just to receive and route alerts to the proper people–so why even write out those runbooks in the first place?

You know how to restart your application when you get paged at 3 AM, no documentation required. You can recover your application within 30-120 minutes of initial impact half asleep with one hand tied behind your back. Unfortunately, during that time your company just lost revenue, customers, and tarnished their reputation. Tough break, but that’s just the way it is today. But there’s a better way!

Hasn’t Runbook Automation (RBA) Been Around for a While?

Yep, RBA is that expensive software sitting on your corporate shelf right now. It can’t help you with the problem described above. It’s become niche software for infrastructure support personnel who know nothing about your applications (and are offended they consume so much of their resources, kidding of course). Traditional RBA is a colossal failure that knows nothing about your user, business transactions, and applications. It’s this lack of application awareness that has relegated RBA to being used by a handful of infrastructure support personnel in your company.

Application Runbook Automation is a Business Differentiator

The time has come to fulfill the promise of automated remediation of business impact. Consider this:

  • AppDynamics understands exactly what components are being used at any given time by your applications.
  • It knows exactly what your end users are doing all the time and tracks and measures that activity through your application components.
  • It knows what transactions are slow or failing and exactly what code, configuration, and nodes are causing the problems.

Using this granular level of problem detection and isolation, AppDynamics application RBA understands business impact and takes the appropriate action to remediate within seconds of detection. There are many problems that get detected automatically and many remediation’s already available out of the box to take advantage of so you don’t have to code your own. You can also re-use your existing RBA processes and scripts to take advantage of all that hard work that has been done in the past. You can even integrate your Puppet and Chef tools directly with AppDynamics. Why re-invent the wheel?!

Application RBA is powerful and flexible. You have the choice to take action automatically or to be prompted for authorization if you’re not ready to trust in the judgement of machines just yet. Your troubleshooting and remediation actions can be executed on only the impacted nodes, on all associated nodes in a tier, or at the overall application level. Application awareness and flexibility empower you to manage your applications the way you need to for your unique business.

Monitoring, Meet Management

Management has long been the missing dimension from the promise of APM. The lack of focus on this problem by the monitoring industry needs to change and AppDynamics is leading the way. When most applications were run on a handful of servers it was acceptable to overlook the management aspect and just focus on pointing out where the problems were. With modern applications scaling to hundreds and thousands of nodes it’s becoming impossible for humans to keep ahead of the management challenges without using machine automation.

If you need to remediate business impact caused by slow or crashed applications as fast as possible, Application Runbook Automation from AppDynamics is just what you need.

Try AppDynamics with Application Runbook Automation for free today.