Best Practice for ITSM Professionals Using Monitoring and Alerting

I’ve only been with AppDynamics a few months and wish I knew 10 years ago what I know now. Using technology, businesses have moved on leaps and bounds, but the fundamental enterprise IT challenges remain the same — just with increased complexity.

10 years ago, as the Director of IT Business and Service Management with a large european investment bank., I was tasked with IT governance and control environment. My main goal was to ensure IT had full visibility of the service it provided to the business.

The goal of the program was to ensure IT managed every business-critical application issue:

  • Restoring an issue while informing the customer about the problem and the business impact of the issue
  • Notifying the customer of application issues and expected mean-time-to-resolution (MTTR)
  • This seems so easy when condensed into 2 sentences, and it really should be in the modern world of IT in an investment bank. However, those who have experienced this it’s anything but and the tasks were more accurately:

  • Knowing all the technical intricacies for every business service
  • Knowing the underlying configuration items (CI’s) that supported each technical service
  • Monitoring the performance and status of every CI
  • Every time the configuration of the IT estate changed I needed to know the impact that this would have on the business service
  • Historically, in an ideal world

    To help with our role, we deployed an application discovery and dependency-mapping tool, which continually monitored the topology of our estate. This tool populated our configuration database (CMDB) with all the changes, and also reconciled them to the approved state of the estate, informing us on any deviations.

    We implemented monitoring tools on all of the CI’s to ensure proper performance. To help us receive proper notifications, we configured the tools to alert us any time there was a performance issue with any of the CI’s — in theory updating the technical and business services. The IT service owner would then confirm the service was restored and create a problem record.

    Once the problem record was created, the IT team would analyze and look for the root cause of the issue and create (or log) and error. This ideal procedure would foster a balanced IT situation within the bank. However, the situation was anything but ideal.

    In principle, this all seems relatively simple but the maintenance and manual control of the environment was unachievable.

  • The CMDB was not updated accurately
  • The alerting system was not continuously integrated
  • The technical service were not updated with any changes
  • Often, the root cause analysis was not confirmed
  • It was unlikely the errors were logged
  • Why was this the case, considering we had (in essence) deployed an out of the box ITSM environment based on an ITIL best practice? Simply put, here’s why:

  • Alerting was based on static thresholds
  • The estate changed rapidly and we couldn’t model the CMBD quickly enough
  • Lack of dynamic baselines resulted in inaccurate alert storms and an impossible root cause analysis
  • Without knowing the root cause, we couldn’t correctly log the errors
  • No changes were made without authenticating the errors
  • How AppDynamics helps IT

    Don’t get me wrong, we weren’t completely incompetent, we still had manual governance and control over the critical business process and service. However, all we had was a state of the art ITSM solution adhering to an ITIL best practice, and we went about our day jobs in pretty much the same way as we had before. Like having a Ferrari sitting in your garage collecting dust.

    So this brings me back to where I am today, working at AppDynamics and a little smarter than I was 10 years ago. With AppDynamics:

  • Monitor business transactions at a code level
  • Provide a continuously updated topology of the business service
  • Receive alerts based on dynamical baselines
  • Using the AppDynamics flow map, update the business and technical services to improve the overall quality
  • Easily see the root cause within the environment
  • Update the problem records in a service management toolset
  • If we had had AppDynamics at the bank our lives would have been much easier and the bank would be performing optimally, instead of the bottleneck and broken flow we had mapped out.

    This is the benefit of next generation application intelligence tools. They make the important measurable, not the measurable important. Please check out my white paper on dynamic ticketing with our integration with ServiceNow here.

    Why BSM Fails to Provide Timely Business Insight

    Business Service Management (BSM) projects have always had a reputation for over promising and under delivering. Most people know BSM as the alleged “manager of managers,” or “single source of truth.” According to the latest ITIL definition, BSM is described as:  “The management of business services delivered to business customers.” Like much of ITIL this description is rather ambiguous.

    Wikipedia however, currently describes BSM’s purpose as facilitating a “cultural change from one which is very technology-focused to a position which understands and focuses on business objectives and benefits.” Nearly every organization I talk to highlights being technology-focussed as one of their biggest challenges, as well as having a desire for greater alignment to business goals. BSM should therefore be the answer everyone is looking for… it’s just a shame BSM has always been such a challenge to deliver.

    Some years ago I worked as a consultant for a small startup which provided BSM software and services. I got to work with many large organizations who all had one common goal: “to make sense of how well IT was supporting their business.” It was a tremendous learning experience for me where I frequently witnessed just how little most organizations really understand the impact major IT events had on their business. For example, I remember working with a major European telco who would have an exec review meeting on the 15th calendar day in the month, to review the previous months’ IT performance. The meeting was held on this date because it was taking 4 people 2 weeks to collate all the information and crunch them into a “mega-spreadsheet.” That’s 40 man days effort to report on the previous 30 day period!

    As organizations continue to collect an increasing amount of data from a growing list of data sources, more and more organizations I talk to are looking for solutions to manage this type of “information-fogginess,” but are skeptical about undertaking large scale BSM projects due to the implementation timescale and overall cost.

    Implementing BSM:

    I’m sure the person who first coined the term “scope creep” must have been involved in implementing BSM, as most BSM projects have a nasty habit of growing arms and legs during the implementation phase. I dread to think how many BSM projects have actually provided a return on their substantial investments.

    BSM has always been a heavily services-led undertaking as it is attempting to uniquely model and report on an organization. No two organisations are structured in quite the same way; each having its own unique IT architecture, operating model, tools, challenges and business goals. This is why BSM projects almost always begin with a team of consultants conducting lots of interviews.

    Let’s look at cost of implementation for a typical deployment such as the European Telco example I described earlier. This type of project could easily expect 100 – 200 man days of professional services in order to deliver. Factoring in software license costs, training, support & maintenance costs, the project needs to deliver a pretty substantial return in order to justify the spend:

    Cost of BSM implementation:

    Professional services

    (100-200 days @ $1800 per day)

    $180,000 – $360,000

    Software license

    $200,000 -$500,000

    Annual support and maintenance

    $40,000  – $100,000

    Training

    $25,000

    TOTAL

    $445k – $985k

    Now if we compare to the pre-existing cost of manually producing the monthly analysis:

    Existing manual process costs:

    Days per month creating reports

    10

    Number of people

    4

    Total man days effort per year

    480 days

    Average annual salary

    $45,000

    Total working days per year

    225

    Annual cost to generate reports

    $96,000

    Even with our most conservative estimates it would take almost 5 years before this organization would see a return on their investment by which time things will probably have changed sufficiently to require a bunch of additional service days in order to update the BSM implementation. This high cost of implementation is one reason why there is such a reluctance to take the leap of faith needed to implement such technologies.

    The most successful BSM implementations I am aware of have typically been the smaller projects, which are primarily focused around data visualization; but with powerful open-source reporting tools such as graphite, graphiti or plotly available for free, I wonder if BSM still has a place even with these small projects today?

    What does success look like?

    Fundamentally, BSM is about mapping business services to their supporting IT components. However, modern IT environments have become highly distributed, with SOA architectures that have data dispersed across numerous cloud environments and it is just not feasible to map basic 1:1 relationships between business and IT functions any more. This growing complexity can only increase the amount of time and money it takes to complete a traditional BSM implementation. A simplified, more achievable approach is needed in order to fulfil the need to provide meaningful business insight from today’s complex IT environments.

    In 2011 Netscape founder Mark Andreessen famously described how today’s businesses depend so heavily on applications when he wrote “software is eating the world”. These applications are built with the purpose of supporting whatever the individual business goals are. It seems logical then that organizations should look into the heart of these applications to get a true understanding of how well the business is functioning.

    In a previous post I described how this can be achieved using AppDynamics Real-time Business Metrics (RtBM) to enable multiple parts of an IT organisations to access business metrics from within these applications. By instrumenting the key information points within your application code and gathering business metrics in real time such as the number of orders being placed, the amount of revenue per transaction, and more, AppDynamics can enable everyone in your organization to focus on the success or failure of the most important business metrics.

    These are very similar goals to those of a traditional BSM project, however in stark contrast to every BSM project I have ever heard of; AppDynamics can be deployed in under an hour, without the need for any costly services as detailed in a previous blog post introducing Real-time Business Metrics.

    Instead of interviewing dozens of people, architecting and building complex dependency models, gathering data and analyzing it all to make sense of what is happening, AppDynamics Real time Business Metrics focuses on the key metrics which matter to your business, providing focus and a common measurement for success across IT and The Business.

    So before you embark on a long and costly BSM project to understand what is happening in your business, why not download a free trial of AppDynamics and see for yourself; there is an easier way!

    DevOps Is No Replacement for Ops

    DevOps is gaining traction according to a study at the end of 2012 by Puppet Labs. Their study concluded that DevOps adoption within companies grew by 26% year over year. DevOps is still misunderstood and has tremendous room for greater adoption still but let’s be clear about one very important thing: DevOps is not a replacement for operations!

    If you’re not as well versed in the term DevOps as you want to be I suggest you read my “DevOps Scares Me” series.

    Worst Developer on the Planet

    The real question in my mind is where do you draw the line between dev and ops. Before you get upset and start yelling that dev and ops should collaborate and not have lines between them let me explain. If developers are to take on more aspects of operations, and operations personnel are to work more closely with developers, what are the duties that can be shared and what are the duties that must be held separate?

    i-will-not-write-any-more-bad-code

    To illustrate this point I’ll use myself as an example. I am a battle hardened operations guy. I know how get servers racked, stacked, cabled, loaded, partitioned, secured, tuned, monitored, etc… What I don’t know how to do is write code. Simply put, I’m probably the worst developer on the planet. I would never trust myself to write code in any language to run a business. Now in this particular case it comes down to the fact that I simply don’t currently posses the proper skills to be a developer but that’s exactly my point. In order for DevOps to succeed there needs to be a clear delineation of duties where expertise can be developed and applied.

    Experts Needed

    Operations is a mix of implementing standard processes, planning for the future, and sometimes working feverishly to try and solve problems that are impacting the business. Just like writing good code, providing good operational support takes experience. There are many areas where you need to develop expertise and you shouldn’t just start doing operations stuff unless you have some experience and a peer review of your work plan.

    the-expert

    You do have a work plan don’t you? A work plan could be something as formal as a change control (ITIL would be good to read about) or something as informal as a set of steps written on a bar napkin (hopefully not, but it’s possible). The point is that you need a plan and you need someone to review your plan to see if it makes sense. Here are some of the things you need to consider when creating your plan:

    • Implementation process – What are you going to actually do?
    • Test plan – How will you be able to tell if everything is working properly after you implement your change?
    • Backout process – How do you revert back if things don’t go right?
    • Impacted application – What application are you actually working on?
    • Dependent application(s) – What applications are dependent upon the application you are changing?

    The point I’m trying to make here is that DevOps does not change the fact that well defined operations processes and practices are what keep companies IT departments working smoothly. Adopting DevOps doesn’t mean that you can revert back to the days of the wild west and just run around making changes however you see fit. If you are going to adopt DevOps as a philosophy within your organization you need to blend the best of what each practice has to offer and figure out how to combine the practices in a manner that best meets your organizations requirements and goals.

    Bad operations decisions can cost companies a lot of money and significant reputation damage in a short period of time. Bad code written by inexperienced developers can have the same effect but can be detected before causing impact. Did someone just release new code for your application to production without testing it properly? Find the application problems and bad code before they find you by trying AppDynamics for free today.