6 Strategic Steps to Rock Solid IT Service Assurance

IT Service AssuranceIn most organizations, managing the service level of critical applications is still a challenge. For some there is a lack of strategic planning, for others it’s simply not applying the proper tools and methodology to their everyday work. Regardless of the reason there are steps that need to be taken in all organizations to avoid costly and damaging service disruption.

We’ve Stopped Making Money

One day, while working at an investment bank, I got a phone call requesting my help (it was really more like a plea and an order at the same time) with troubleshooting a business-critical application. I had even heard  chatter in the office about how this application had been unusable for days before I was asked to participate. My role at the bank was as a monitoring architect who tested, reviewed, purchased, and on-boarded new tools among  other responsibilities. As a result, I was one of the people who would get a phone call when difficult problems went unsolved for too long so  I could apply my tools and expertise.

This was a time of great instability in the stock market and our traders were very active. This was also the time when the traders needed this particular application the most and when the bank should have made a small transaction fee for every completed transaction through this application. Simply put, the bank was loosing millions of dollars while this application was performing so poorly.

I started my work with the development team by getting a breakdown of the problem, the conditions leading up to the problem, an overview of the technology, and a demonstration of them recreating the problem in a test environment. Next, I deployed some application monitoring tools into their test environment — since they only had basic OS monitoring and the data that was coming from their load test tool — and watched as they ran more load tests. I could see certain parts of their code degrading as load ramped up and this led me to ask a lot of questions about the logic associated with these parts of the code.

Developer IdeasI worked together with the development team for 2 days asking questions, seeing the mental light bulbs explode through the look in their eyes, and testing the new code they feverishly created after each bottleneck was removed and a new one discovered. After all was said and done the application was upgraded in production at the end my 2 days of involvement. Capable of handling 5 times the throughput helped the traders do their jobs, and most importantly, the bank was ringing the cash register again for each transaction.

Strategic Planning

The worst part is the situation could have been completely avoided. By following a few key rules, the application team could have detected this problem in it’s infancy and minimized or avoided the lengthy and embarrassing production impact.

  1. Where the rubber meets the road – Application performance monitoring IN PRODUCTION is a requirement for any business, mission, or revenue critical applications.
  2. Dev and QA monitoring – Using application monitoring tools in PRE-PRODUCTION will dramatically improve quality of production application releases.
  3. Feedback loop – Constantly apply the information gained in production to your pre-production environment. Use production loading and performance patterns shown by your monitoring tools to prioritize development work and the create more realistic load tests.
  4. Collaboration is king – Development AND Operations personnel should have access to and use the same monitoring tools during load tests to gain the most benefit.
  5. Think strategic instead of tactical – Implement a well thought out monitoring and management strategy starting with your most critical revenue generating applications and working down from there (after rigorous testing of course).
  6. Identify and fix small problems before they turn into big problems – Alerting should be based off of deviation from normal (baseline) behavior in most situations. Minimize the number of static thresholds you use to trigger alerts and make an investment in analytics-driven monitoring platforms. Static thresholds should mostly be used to identify service level breaches.

The reality of the 6 points outlined above is that it takes some initial effort to make the required organization and process changes as well as getting the right tools in place. However, the fact remains the investment is well worth it for business-critical applications. I’ve seen so many groups think they don’t have enough time to invest in strategic initiatives and then they constantly run around firefighting the next battle which should have been avoided in the first place. It’s a vicious cycle that needs to be end. Consider the tips listed above and break the cycle starting right now.

Beyond DevOps … APM as a Collaboration Engine

In the beginning there was a simply acronym: MTTI (mean time to innocence). Weary after years of costly and time-consuming warroom battles, IT organisations turned to AppDynamics to give an objective application-level view of production incidents. As a result, application issues are swiftly pinpointed and fixed, accelerating time to repair by up to 90%.

In fact, gravitation towards fact-based constructive issue management spawned a whole new movement – DevOps – with the goal of ingraining this maturity and cooperative spirit into IT organisations from the ground up. The movement was discussed by Jim in a previous blog post. Of course, AppDynamics (or at least, easily accessible fact-based information about application behaviour in production) is a necessary prerequisite to this.

Looking back, before DevOps or even MTTI were topical buzzwords, this basic ability to foster communication between teams proved to be an invaluable benefit to the more drab and well-worn business realities of offshoring and outsourcing.

This blog reviews 3 real-life examples of this:

  • Managing external offshore development organisations
  • Facilitating near shore development teams
  • Bringing external developments in-house

Managing external offshore development organisations

Some months ago, I did some work with a then prospect who had started a self-service trial of AppDynamics.

When I spoke to them, they were delighted with the visibility that AppDynamics provided out of the box for their .NET application, a SaaS Learning Management System.

Digging into what had sparked their interest in AppDynamics, they told me they had commissioned an outside development firm to rewrite their flagship application, which was somewhat dated and not architecturally fit to support some newer services the business wanted to offer to customers.

The good news was the new version of the app was live, and supporting around 10% of their existing customer base. The bad news?  This 10% used the same hardware footprint as the remaining 90% on the old system. Extrapolating this hardware requirement for the entire user base would not only require a new datacentre, but also entirely break the business model for the application from an operational cost perspective (not the first time that hardware savings alone could pay for AppDynamics!)

For months prior to trying AppDynamics, the external developers had been under huge pressure to optimize the application footprint (and some pretty lacklustre performance too). Armed with only windows performance counters and intuition, weeks had been spent optimizing slow database queries, which only turned out to be 5% of the errant response times at a transaction level.

Having put AppDynamics in place, the prospect

  • Easily found specific application bottlenecks, allowing them to focus developers on high-impact remediation
  • Could verify the developers had made the required improvements with each new release

Clearly, huge benefits at a technical level.

At a higher level, this helped lead to a more constructive relationship between the development shop and their customer – moving things away from the edge of litigation, constant finger-pointing, and blame shifting.

Facilitating near shore development teams

Another group I have worked with recently are responsible for a settlement system within a large global investment bank based in London. The system is developed in-house, and typical with most financial services institutions, the actual development team itself is located ‘near-shore’ in Eastern Europe to cut costs. The development processes are Agile, with new releases every few weeks.

Inevitably, with new releases can come new production issues and – of course – the best people to deal with these during the “bedding in” period are the developers themselves.

Another thing that is very common in the financial services industry is regulation, and this poses a problem in this scenario. Nobody is permitted to directly access the production systems from outside the UK due to data privacy regulations.

This means hands-on troubleshooting must be left to the on-shore architect staff who are not only expensive, but are not as well-equipped as the developers themselves to dig in to the issues in new code.

Enter AppDynamics. Our agents deployed in production made all the performance data readily available to anyone with the appropriate credentials, but – critically – having access to this does not expose ANY business data from the production system. Now, the near-shore development team can look directly at the non-functional behaviour of their code in production, eliminating the time spent gathering sufficient log data to enable reproduction of issues in test environments.  Bingo, the business case for the AppDynamics purchase is made!

There is an interesting side note to this, which applies much more widely too. Many customers have observed an “organic” improvement in service levels once AppDynamics is installed in production. For the first time, developers can see how their code is actually working in the wild. Developer pride kicks in and suddenly non-functional stories are added to development backlogs to fix latent issues that get observed, which would have previously have gone unnoticed.

Bringing external developments in-house

Of course, as we all know the only constant in life is change, so no outsource is a one-way journey. As a result, I have come across several organisations that are now working on projects which were previously outsourced. Of course, once these customers have completed the initial challenge of recruiting a new development team they then need to get their arms around the existing codebase. Usually handover workshops can help with this, but in many cases these systems have been out- and in- sourced several times, with many changes of personnel along the way. There is only so much you can distill onto a whiteboard in a brain dump session, however long and well-intentioned.

It is here where the high-level visibility that AppDynamics provides can be invaluable. Out of the box, AppDynamics instruments previously unseen systems, automatically detecting and following transactions and draws up flow-maps. The end-to-end visibility of the entire system greatly eases the process. In fact, this system overview (and the ability to view how it changes over time) has proved invaluable for many customers for a number of reasons beyond whole-scale in (or out) sourcing, such as onboarding new team members, verifying compliance with architectural governance of externally developed code changes and so forth.

Conclusion

In summary, AppDynamics does not have to be all about troubleshooting and MTTI.  Nor even necessarily about DevOps and brave new worlds. The easily configured deep insight that we provide into the dynamic behaviour of your applications has many uses – and business cases – beyond the traditional MTTI/MTTR domain.  APM is, after all, just one use-case (albeit an important one) for our Application Intelligence Platform.

Take five minutes to get complete visibility and control into the performance of your production applications with AppDynamics Pro today.

Quantifying the value of DevOps

In my experience when you work in IT the executive team rarely focuses on your team until you experience a catastrophic failure – once you do you are the center of attention until services are back to normal. It is easy to ignore the background work that IT teams spend most of their days on just to keep everything running smoothly. In this post I will discuss how to quantify the value of DevOps to organizations. The notion of DevOps is simple: Developers working together with Operations to get things done faster in an automated and repeatable way. If the process is working the cycle looks like:

DevOps

DevOps consists of tools, processes, and the cultural change to apply both across an organization. In my experience in large companies this is usually driven from the top down, and in smaller companies this comes organically from the bottom up.

When I started in IT I worked as a NOC engineer for a datacenter. Most my days were spent helping colocation customers install or upgrade their servers. If one of our managed servers failed it was my responsibility to fix it as fast as possible. Other days were spent as a consultant helping companies manage their applications. This is when most web applications were simple with only two servers – a database and an app server:

monolithic_app

As I grew in my career I moved to the engineering side and worked developing very large web applications. The applications I worked on were much more complex then what I was used to in my datacenter days. It is not just the architecture and code that is more complex, but the operational overhead to manage such large infrastructure requires an evolved attitude and better tools.

distributed_app

When I built and deployed applications we had to build our servers from the ground up. In the age of the cloud you get to choose which problems you want to spend time solving. If you choose an Infrastructure as a service provider you own not only your application and data, but the middleware and operating system as well. If you pick a platform as a service you just have to support your application and data. The traditional on-premise option while giving you the most freedom, also carries the responsibility for managing the hardware, network, and power. Pick your battles wisely:

Screen Shot 2014-03-12 at 11.50.15 AM

As an application owner on a large team you find out quickly how well a team works together. In the pre-DevOps days the typical process to resolve an operational issues looked like this:

Screen Shot 2014-03-12 at 11.49.50 AM

1)     Support creates a ticket and assigns a relative priority
2)     Operations begins to investigate and blames developers
3)     Developer say its not possible as it works in development and bounces the ticket back to operations
4)     Operations team escalates the issue to management until operations and developers are working side by side to find the root cause
5)     Both argue that the issue isn’t as severe as being stated so they reprioritize
6)     Management hears about the ticket and assigns it Severity or Priority 1
7)     Operations and Developers find the root cause together and fix the issue
8)     Support closes the ticket

Many times we wasted a lot of time investigating support tickets that weren’t actually issues. We investigated them because we couldn’t rely on the health checks and monitoring tools to determine if the issue was valid. Either the ticket couldn’t be reproduced or the issues were with a third-party. Either way we had to invest the time required to figure it out. Never once did we calculate how much money the false positives cost the company in man-hours.

Screen Shot 2014-03-12 at 11.50.35 AM

With better application monitoring tools we are able to reduce the number of false positive and the wasted money the company spent.

How much revenue did the business lose?

noidea

I never once was able to articulate how much money our team saved the company by adding tools and improving processes. In the age of DevOps there are a lot of tools in the DevOps toolchain.

By adopting infrastructure automation with tools like Chef, Puppet, and Ansible you can treat your infrastructure as code so that it is automated, versioned, testable, and most importantly repeatable. The next time a server goes down it takes seconds to spin up an identical instance. How much time have you saved the company by having a consistent way to manage configuration changes?

By adopting deployment automation with tools like Jenkins, Fabric, and Capistrano you can confidently and consistently deploy applications across your environments. How much time have you saved the company by reducing build and deployment issues?

By adopting log automation using tools such as Logstash, Splunk, SumoLogic and Loggly you can aggregate and index all of your logs across every service. How much time have you saved the company by not having to manually find the machine causing the problem and retrieve the associated logs in a single click?

By adopting application performance management tools like AppDynamics you can easily get code level visibility into production problems and understand exactly what nodes are causing problems. How much time have you saved the company by adopting APM to decrease the mean time to resolution?

By adoption run book automation through tools like AppDynamics you can automate responses to common application problems and auto-scale up and down in the cloud. How much time have you saved the company by automatically fixing common application failures with out even clicking a button?

Understanding the value these tools and processes have on your organization is straightforward:

devops_tasks

DevOps = Automation & Collaboration = Time = Money 

When applying DevOps across your organization the most valuable advice I can give is to automate everything and always plan to fail. A survey from RebelLabs/ZeroTurnaround shows that:

1)     DevOps teams spend more time improving things and less time fixing things
2)     DevOps teams recover from failures faster
3)     DevOps teams release apps more than twice as fast

 

How much does an outage cost in your company?

This post was inspired by a tech talk I have given in the past: https://speakerdeck.com/dustinwhittle/devops-pay-raise-devnexus

 

 

Take five minutes to get complete visibility into the performance of your production applications with AppDynamics today.

A Real Example of the Database to Storage Performance Relationship

Most enterprise databases today run on shared storage volumes (SAN, NAS, etc…) that are accessed over the network or via Fibre Channel connection. The shared storage concept is great for helping to keep storage infrastructure and management costs relatively low but creates cross silo finger pointing when there are performance issues. In this blog post we will explore a real world example of how to avoid finger pointing and get right down to figuring out how to fix the problem.

One Rotten Apple Can Ruin The Whole Bunch

This story dates back to June of 2012 but I just came across it so it is new to me. One of our customers had an event which impacted the performance of multiple databases. All of these databases were connected to the same NetApp storage array. Often when there is an issue with database performance the DBAs will point the finger at the storage team and the storage team will tell the DBA team that everything looks good on their side. This finger pointing between silo’s is a common occurrence between various groups (network, storage, database, application support, etc…) within enterprise organizations.

In the chart below (screen grab taken from AppDynamics for Databases) you can see that there was a significant increase in I/O activity on dw_logvol. This issue impacted the performance of the entire NetApp storage array.

NetApp Storage Issue

As it turns out dw_logvol was used as a temporary storage location for web logs. There was a process that would copy log files to this location, decompress them, and insert them into an Oracle data warehouse for long term storage. This process normally would not impact the performance of anything else connected to the same storage array but in this case there happened to be corrupted log files that could not be properly decompressed. This resulted in multiple attempts to retransmit and decompress the same files.

Context and Collaboration to the Rescue

Storage teams normally don’t have access to application context and application teams normally don’t have access to storage metrics. In this case though, both teams were able to collaborate and quickly realize what the problem was as a result of having a monitoring solution that was available to everyone. The fix for this problem was really easy, just remove the corrupted files and replace them with versions without any corruption. You can see activity return to normal in the chart below.

NetApp Storage Issue After

Modern application architectures require collaboration across all silo’s in order to identify and fix issues in a timely manner. One of the key enablers of cross-silo collaboration is intelligent monitoring at each layer of the application and the infrastructure components that provide the underlying resources. AppDynamics provides end-to-end visibility in an analytics based solution that help you identify, isolate and remediate issues. Try AppDynamics for Databases and Storage for free today and bring a new level of collaboration to your organization.