Big Data Monitoring

The term “Big Data” is quite possibly one of the most difficult IT-related terms to pin down ever. There are so many potential types of, and applications for Big Data that it can be a bit daunting to consider all of the possibilities. Thankfully, for IT operations staff, Big Data is mostly a bunch of new technologies that are being used together to solve some sort of business problem. In this blog post I’m going to focus on what IT Operations teams need to know about big data technology and support.

Big Data Repositories

At the heart of any big data architecture is going to be some sort of NoSQL data repository. If you’re not very familiar with the various types of NoSQL databases that are out there today I recommend reading this article on the MongoDB website. These repositories are designed to run in a distributed/clustered manner so they they can process incoming queries as fast as possible on extremely large data sets.

MongoDB Request Diagram

Source: MongoDB

An important concept to understand when discussing big data repositories is the concept of sharding. Sharding is when you take a large database and break it down into smaller sets of data which are distributed across server instances. This is done to improve performance as your database can be highly distributed and the amount of data to query is less than the same database without sharding. It also allows you to keep scaling horizontally and that is usually much easier than having to scale vertically. If you want more details on sharding you can reference this Wikipedia page.

Application Performance Considerations

Monitoring the performance of big data repositories is just as important as monitoring the performance of any other type of database. Applications that want to use the data stored in these repositories will submit queries in much the same way as traditional applications querying relational databases like Oracle, SQL Server, Sybase, DB2, MySQL, PostgreSQL, etc… Let’s take a look at more information from the MongoDB website. In their documentation there is a section on monitoring MongoDB that states “Monitoring is a critical component of all database administration.” This is a simple statement that is overlooked all too often when deploying new technology in most organizations. Monitoring is usually only considered once major problems start to crop up and by that time there has already been impact to the users and the business.

RedisDashboard

Dashboard showing Redis key metrics.

One thing that we can’t forget is just how important it is to monitor not only the big data repository, but to also monitor the applications that are querying the repository. After all, those applications are the direct clients that could be responsible for creating a performance issue and that certainly rely on the repository to perform well when queried. The application viewpoint is where you will first discover if there is a problem with the data repository that is actually impacting the performance and/or functionality of the app itself.

Monitoring Examples

So now that we have built a quick foundation of big data knowledge, how do we monitor them in the real world?

End to end flow – As we already discussed, you need to understand if your big data applications are being impacted by the performance of your big data repositories. You do that by tracking all of the application transactions across all of the application tiers and analyzing their response times. Given this information it’s easy to identify exactly which components are experiencing problems at any given time.

FS Transaction View

Code level details – When you’ve identified that there is a performance problem in your big data application you need to understand what portion of the code is responsible for the problems. The only way to do this is by using a tool that provides deep code diagnostics and is capable of showing you the call stack of your problematic transactions.

Cassandra_Call_Stack

Back end processing – Tracing transactions from the end user, through the application tier, and into the backend repository is required to properly identify and isolate performance problems. Identification of poor performing backend tiers (big data repositories, relational databases, etc…) is easy if you have the proper tools in place to provide the view of your transactions.

Backend_Detection

AppDynamics detects and measures the response time of all backend calls.

Big data metrics – Each big data technology has it’s own set of relevant KPIs just like any other technology used in the enterprise. The important part is to understand what is normal behavior for each metric while performance is good and then identify when KPIs are deviating from normal. This combined with the end to end transaction tracking will tell you if there is a problem, where the problem is, and possibly the root cause. AppDynamics currently has monitoring extensions for HBase, MongoDB, Redis, Hadoop, Cassandra, CouchBase, and CouchDB. You can find all AppDynamics platform extensions by clicking here.

HadoopDashboard

Hadoop KPI Dashboard 1

HadoopDashboard2

Hadoop KPI Dashboard 2

Big data deep dive – Sometimes KPIs aren’t enough to help solve your big data performance issues. That’s when you need to pull out the big guns and use a deep dive tool to assist with troubleshooting. Deep dive tools will be very detailed and very specific to the big data repository type that you are using/monitoring. In the screen shots below you can see details of AppDynamics monitoring for MongoDB.

MongoDB Monitoring 1

 MongoDB Monitoring 2

MongoDB Monitoring 3

MongoDB Monitoring 4

If your company is using big data technology, it’s IT operations’ responsibility to deploy and support a cohesive performance monitoring strategy for the inevitable performance degradation that will cause business impact. See what AppDynamics has to offer by signing up for our free trial today.

A Newbie Guide to APM

Today’s blog post is headed back to the basics. I’ve been using and talking about APM tools for so many years sometimes it’s hard to remember that feeling of not knowing the associated terms and concepts. So for anyone who is looking to learn about APM, this blog is for you.

What does the term APM stand for?

APM is an acronym for Application Performance Management. You’ll also hear the term Application Performance Monitoring used interchangeably and that is just fine. Some will debate the details of monitoring versus management and in reality there is an important difference but from a terminology perspective it’s a bit nit-picky.

What’s the difference between monitoring and management?

Monitoring is a term used when you are collecting data and presenting it to the end user. Management is when you have the ability to take action on your monitored systems. Management tasks can include restarting components, making configuration changes, collecting more information through the execution of scripts, etc… If you want to read more about the management functionality in APM tools click here.

What is APM?

There is a lot of confusion about the term APM. Most of this confusion is caused by software vendors trying to convince people that their software is useful for monitoring applications. In an effort to create a standard definition for grouping software products, Gartner introduced a definition that we will review here.

Gartner lists five key dimensions of APM in their terms glossary found here… http://www.gartner.com/it-glossary/application-performance-monitoring-apm

End user experience monitoringEUM and RUM are the common acronyms for this dimension of monitoring. This type of monitoring provides information about the response times and errors end users are seeing on their device (mobile, browser, etc…). This information is very useful for identifying compatibility issues (website doesn’t work properly with IE8), regional issues (users in northern California are seeing slow response times), and issues with certain pages and functions (the Javascript is throwing an error on the search page).

prod-meuem_a-960x0 (2)

Screen_Shot_2014-08-04_at_4.29.55_PM-960x0 (2)

Runtime application architecture discovery modeling and display – This is a graphical representation of the components in an application or group of applications that communicate with each other to deliver business functionality. APM tools should automatically discover these relationships and update the graphical representation as soon as anything changes. This graphical view is a great starting point for understanding how applications have been deployed and for identifying and troubleshooting problems.

Screen_Shot_2014-07-17_at_3.42.47_PM-960x0 (3)

User-defined transaction profiling – This is functionality that tracks the user activity within your applications across all of the components that service those transactions. A common term associated with transaction profiling is business transactions (BT’s). A BT is very different from a web page. Here’s an example… As a user of a website I go to the login page, type in my username and password, then hit the submit button. As soon as I hit submit a BT is started on the application servers. The app servers may communicate with many different components (LDAP, Database, message queue, etc…) in order to authenticate my credentials. All of this activity is tracked and measured and associated with a single “login” BT. This is a very important concept in APM and is shown in the screenshots below.

Screen_Shot_2014-08-04_at_1.30.15_PM-960x0

Component deep-dive monitoring in application context – Deep dive monitoring is when you record and measure the internal workings of application components. For application servers, this would entail recording the call stack of code execution and the timing associated with each method. For a database server this would entail recording all of the SQL queries, stored procedure executions, and database statistics. This information is used to troubleshoot complex code issues that are responsible for poor performance or errors.

Screen_Shot_2014-08-07_at_11.08.00_AM-960x0

Analytics – This term leaves a lot to be desired since it can be and often is very liberally interpreted. To me, analytics (in the context of APM) means baselining, and correlating data to provide actionable information. To others analytics can be as basic as providing reporting capabilities that simply format the raw data in a more consumable manner. I think analytics should help identify and solve problems and be more than just reporting but that is my personal opinion.

business-impact-analytics2-1-960x0

performance-analytics-960x0

Do I need APM?

APM tools have many use cases. If you provide support for application components or the infrastructure components that service the applications then APM is an invaluable tool for your job. If you are a developer the absolutely yes, APM fits right in with the entire software development lifecycle. If your company is adopting a DevOps philosophy, APM is a tool that is collaborative at it’s core and enables developers and operations staff to work more effectively. Companies that are using APM tools consider them a competitive advantage because they resolve problems faster, solve more issues over time, and provide meaningful business insight.

How can I get started with APM?

First off you need an application to monitor. Assuming you have access to one, you can try AppDynamics for free. If you want to understand more about the process used in most companies to purchase APM tools you can read about it by clicking here.

Hopefully this introduction has provided you with a foundation for starting an APM journey. If there are more related topics that you want me to write about please let me know in the comments section below.

Check out our complementary ebook, Top 10 Java Performance Problems!

Insights from an Investment Banking Monitoring Architect

To put it very simply, Financial Services companies have a unique set of challenges that they have to deal with every day. They are a high priority target for hackers, they are highly regulated by federal and state governments, they deal with and employ some of the most demanding people on the planet, problems with their applications can have an impact on every other industry across the globe. I know this from first hand experience; I was an Architect at a major investment bank for over 5 years.

In this blog post I’m going to show you what’s really important when Financial Services companies consider application monitoring solutions and warn you about the hidden realities that only expose themselves after you’ve installed a large enough monitoring footprint.

1 – Product Architecture Plays a Major Role in Long Term Success or Failure

Every monitoring tool has a different core architecture. On the surface these architectures may look similar but it is imperative to dive deeper into the details of how all monitoring products work. We’ll use two real product architectures as examples.

Monitoring Architecture“APM Solution A” is an agent based solution. This means that a piece of vendor code is deployed to gather monitoring information from your running applications. This agent is intelligent and knows exactly what to monitor, how to monitor, and when to dial itself back to do no harm. The agent sends data back to central collector (called a controller) where this data is correlated, analyzed, and categorized automatically to provide actionable intelligence to the user. With this architecture the agent and the controller are very loosely coupled which lends itself well to highly distributed, virtualized environments like you see in modern application architectures.

“APM Solution B” is also agent based. They have a 3 tiered architecture which consists of agents, collectors, and servers. On the surface this architecture seems reasonable but when we look at the details a different story emerges. The agent is not intelligent therefore it does not know how to instrument an application. This means that every time an application is re-started, the agent must send all of the methods to the collector so that the collector can tell the agent how and what to instrument. This places a large load on the network, delays application startup time, and adds to the amount of hardware required to run your monitoring tool. After the collector has told the agent what to monitor the collectors job is to gather the monitoring data from the agent and pass it back to the server where it is stored and viewed. For a single application this architecture may seem acceptable but you must consider the implications across a larger deployment.

Choosing a solution with the wrong product architecture will impact your ability to monitor and manage your applications in production. Production monitoring is a requirement for rapid identification, isolation and repair of problems.

2 – Monitoring Philosophy

Monitoring isn’t as straight forward as collecting, storing, and showing data. You could use that approach but it would not provide much value. When looking at monitoring tools it’s really important to understand the impact of monitoring philosophy on your overall project and goals. When I was looking at monitoring tools I needed to be able to solve problems fast and I didn’t want to spend all of my time managing the monitoring tools. Let’s use examples to illustrate again.

Application Monitoring Philosophy“APM Solution A” monitors every business transaction flowing through whatever application it is monitoring. Whenever any business transaction has a problem (slow or error) it automatically collects all of the data (deep diagnostics) you need to figure out what caused the problem. This, combined with periodic deep diagnostic sessions at regular intervals, allows you to solve problems while keeping network, storage, and CPU overhead low. It also keeps clutter down (as compared to collecting everything all the time) so that you solve problems as fast as possible.

“APM Solution B” also monitors every transaction for each monitored application but collects deep diagnostic data for all transactions all the time. This monitoring philosophy greatly increases network, storage, and CPU overhead while providing massive amounts of data to work with regardless of whether or not there are application problems.

When I was actively using monitoring tools in the Investment Bank I never looked at deep diagnostic data unless I was working on resolving a problem.

3 – Analytics Approach

Analytics comes in many shapes and sizes these days. Regardless of the business or technical application, analytics does what humans could never do. It creates actionable intelligence from massive amounts of data and allows us to solve problems much faster than ever before. Part of my process for evaluating monitoring solutions has always been determining just how much extra help each tool would provide in identifying and isolating (root cause) application problems using analytics. Back to my example…

“APM Solution A” is an analytics product at it’s core. Every business transaction is analyzed to create a picture of “normal” response time (a baseline). When new business transactions deviate from this baseline they are automatically classified as either slow or very slow and deep diagnostic information is collected, stored, and analyzed to help identify and isolate the root cause. Static thresholds can be set for alerting but by default, alerts are based upon deviation from normal so that you can proactively identify service degradation instead of waiting for small problems to become major business impact.

“APM Solution B” only provides baselines for the business transactions you have specified. You have to manually configure the business transactions for each application. Again, on small scale this methodology is usable but quickly becomes a problem when managing the configuration of 10’s, 100’s, or 1000’s of applications that keep changing as development continues. Searching through a large set of data for a problem is much slower without the assistance of analytics.

Monitoring Analytics

4 – Vendor Focus

When you purchase software from a vendor you are also committing to working with that vendor. I always evaluated how responsive every vendor was during the pre-sales phase but it was hard to get a good measure of what the relationship would be like after the sale. No matter how good the developers are, there are going to be issues with software products. What matters the most is the response you get from the vendor after you have made the purchase.

5 – Ease of Use

This might seem obvious but ease of use is a major factor in software delivering a solid return on investment or becoming shelf-ware. Modern APM software is powerful AND easy to use at the same time. One of the worst mistakes I made as an Architect was not paying enough attention to ease of use during product evaluation and selection. If only a few people in a company are capable of using a product then it will never reach it’s full potential and that is exactly what happened with one of the products I selected. Two weeks after training a team on product usage, almost nobody remembered how to use it. That is a major issue with legacy products.

Enterprise software is undergoing a major disruption. If you already have monitoring tools in place, now is the right time to explore the marketplace and see how your environment can benefit from modern tools. If you don’t have any APM software in place yet you need catch up to your competition since most of them are already looking or have already implemented APM for their critical applications. Either way, you can get started today with a free trial of AppDynamics.

How Garmin attacks IT and business problems with Application Performance Management (APM)

If you’ve used a GPS product before you’re probably familiar with Garmin. They are the leading worldwide provider of navigation devices and technologies. In addition to their popular consumer GPS products, Garmin also offers devices for planes, boats, outdoor activities, fitness, and cars. Because end user experience is so important to Garmin and its customers, ensuring the speed of its web and mobile applications is one of its highest priorities.

This blog post summarizes the content of a case study that can be found by clicking here.

The reality in today’s IT driven world is that there are many different issues that need to be solved and therefore many different use cases for the products designed to help with these issues. Garmin has shown it’s industry leadership by taking advantage of AppDynamics to address many use cases with a single product. This shows a high level of maturity that most organizations strive to achieve.

“We never had anything that would give us a good deep dive into what was going on in the transactions inside our application – we had no historical data, and we had no insight into database calls, threads, etc.,” – Doug Strick

Garmin-FlowMap

AppDynamics flow map showing part of Garmin’s production environment.

Real-Time and Historic Visibility in Production

You just can’t test everything before your application goes to production. Real users will find ways to break your application and/or make it slow to a crawl. When this happens you need historical data to compare against as well as real-time information to quickly remediate problems.

Garmin found that AppDynamics, an application performance management solution designed for high-volume production environments, best suited its needs. Garmin’s operations team feels comfortable leaving AppDynamics on in the production environment, allowing them to respond more quickly and collect historical data.

The following use cases are examples of how Garmin is attacking IT and business problems to better serve their company and their customers.

Use Case 1: Memory Monitoring

Before they started using AppDynamics, Garmin was unable to track application memory usage over time. Using AppDynamics they were able to identify a major memory issue impacting customers and ensure a fast and stable user experience. Click here to read the details in the case study.

Garmin-Heap

Chart of heap usage over time.

Use Case 2: Automated Remediation Using Application Run Book Automation

AppDynamics’ Application Run Book Automation feature allows organizations to trigger certain tasks, such as restarting a JVM or running a script, with thresholds in AppDynamics. Garmin was able to use this feature very effectively one weekend while waiting for a fix from a third party. Click here to read the details in the case study.

Use Case 3: Code-level Diagnostics

“We knew we needed a tool that could constantly monitor our production environment, allowing us to collect historical data and trend performance over time,” said Strick. “Also, we needed something that would give us a better view of what was going on inside the application at the code level.” – Doug Strick

With AppDynamics, Garmin’s application teams have been able to rapidly resolve several issues originating in the application code.  Click here to read the details in the case study.

Use Case 4: Bad Configuration in Production

Garmin also uses AppDynamics to verify that new code releases perform well. Because Garmin has an agile release schedule, new code releases are very frequent (every 2 weeks) and it’s important to ensure that each release goes smoothly. One common problem that occurs when deploying new code is that configurations for test are not updated for the production environment, resulting in performance issues. Click here to read the details in the case study.

Garmin-Verification

AppDynamics flow map showing production nodes making services calls to test environments.

Use Case 5: Real-time Business Metrics (RTBM)

IT metrics alone don’t tell the whole story. There can be business problems due to pricing issues our issues with external service providers that impact the business and will never show up as an IT metric. That’s why AppDynamics created RTBM and why Garmin adopted it so early on in their deployment of AppDynamics. With Real-Time Business Metrics, organizations like Garmin can collect, monitor, and visualize transaction payload data from their applications. Strick used this capability to measure the revenue flow in Garmin’s eCommerce application.  Click here to read the details in the case study.

Garmin-RTBM

Real-time order totals for web orders.

Conclusion

“AppDynamics has given us visibility into areas we never had visibility into before… Some issues that used to take several hours now take under 30 minutes to resolve, or a matter of seconds if we’re using Run Book Automation.” – Doug Strick

Garmin has shown how to get real value out of its APM tool of choice and you can do the same. Click here to try AppDynamics for free in your environment today.

How I failed my company as a monitoring architect.

picard-facepalmI didn’t realize it until recently but now I know that I failed one of my previous companies when I was their monitoring architect. It’s not easy to admit past failures but it’s an important part of growing as a professional and as a person. Before I explain how I failed let me provide some short background.

Filling Gaps the Best Way Possible

As a monitoring architect it was my job to make sure that we had monitoring tools, training and processes in place to properly support the business. I was extremely passionate about my responsibilities and worked tirelessly to understand the gaps in our strategy and plug those gaps in the best way possible. And there’s the rub, my idea of the “best way possible” was flawed.

During my time as a monitoring architect I got to test out many different technologies. I put the tools through their paces in my lab. I put the companies through their paces by testing the sales and support organizations before making any purchases. I figured out how each new technology would fit into our existing processes and created new processes where they were needed. I documented everything thoroughly and ultimately ended up bringing in multiple new tools over the course of a few years.

Ignorance is Bliss, and Then It’s Not

The tools and processes were successful from the the perspective of helping to reduce customer and business impact. Unaware of my failure I was proud of what I had accomplished and I moved onto other things and new companies. Then I had a conversation this week that burst my bubble and woke me up like a slap to the face. With one sentence I realized my failure… “Our users only open the tool a few times a year when there are problems and they can’t remember how to use it!”. (Thankfully we weren’t discussing my current companies software)

ignorance

Smack!!!! It hit me fast and without mercy. I realized that for years I had overlooked an absolutely essential aspect to successfully implementing technology. As a technologist I am well trained and very experienced in using “less than intuitive” products. But, I’m the exception to the rule in the real world. Most people don’t have the time nor the desire to become technology experts. They simply want and need a tool that solves their problems and is easy to use.

I had been so focused on features and processes and filling gaps that I overlooked that fact that power + ease of use = successful adoption. I had unwittingly robbed my company of the possibility of massive return on investment by selecting tools that were okay for me to use but very difficult for most of the rest of the company to use.

Immediately after this disturbing realization I recalled the timeframe shortly after I brought a new tool into my organization. I had set up a vendor led 1 week training session for the people on my team to learn our fancy new toy. The training session went well with only a few bumps in the road where the usage concepts were somewhat difficult. The problem was that within 2 weeks of the end of training most of the team had forgotten how to do most of what they were taught except for the simplest of tasks. That should have clued me in but I was blinded by pride and continued along blissfully ignorant until this week.

Never Repeat the Same Mistake

I hate losing. I hate getting something wrong. I hate the fact that it took me years to realize my failure but I’m grateful for having realized it at all. It’s a mistake I will never repeat and hopefully one that you will never have to make now that you have read this story. Please, when making decisions for a large group of people, really think about the people that will be impacted by your decision and make sure you consider the fact that everyone is different and what works for you wont necessarily work for them.

These days I’m happy to work for a company that is disrupting the enterprise technology market with powerful AND easy to use software. I didn’t truly realize how important it was until this week. If you’d like to check it out for yourself click here and start a free trial today.

DevOps Scares Me – Part 3

Hey, who invited the developer to our operations conversation? In Devops Scares Me – Part 2 my colleague Dustin Whittle shared his developer point of view on DevOps. I found his viewpoint quite interesting and it made me realize that I take for granted the knowledge I have about what it takes to get an application into production in a large enterprise. As Dustin called out, there are many considerations including but not limited to code management, infrastructure management, configuration management, event management, log management, performance management, and general monitoring. In his blog post Dustin went on to cover some of the many tools available to help automate and manage all of the considerations previously mentioned. In my post I plan to explore if DevOps is only for loosely managed e-commerce providers or if it can really be applied to more traditional and highly regulated enterprises.

Out With the Old

In the operations environments I have worked in there were always strict controls on who could access production environments, who could make changes, when changes could be made, who could physically touch hardware, who could access what data centers, etc… In these highly regulated and process oriented enterprises the thought of blurring the lines between development and operations seems like a non-starter. There is so much process and tradition standing in the way of using a DevOps approach that it seems nearly impossible. Let’s break it down into small pieces and see if could be feasible.

Here are the basic steps to getting a new application built and deployed from scratch (from an operations perspective) in a stodgy Financial Services environment. If you’ve never worked in this type of environment some of the timing of these steps might surprise you (or be very familiar to you). We are going to assume this new application project has already been approved by management and we have the green light to proceed.

  1. Place order for dev, test, uat, prod, dr, etc… infrastructure. (~8 weeks lead time, all hardware ordered up front)
  2. Development team does dev stuff while us ops personnel are filling out miles of virtual paperwork to get the infrastructure in place. Much discussion occurs about failover, redundancy, disaster recovery, data center locations, storage requirements, etc… None of this discussion includes developers, just operations and architects…oops.
  3. New application is added to CMDB (or similar) to include new infrastructure components, application components, and dependencies.
  4. Operations is hopeful that the developers are making good progress in the 8 weeks lead time provided by the operational request process (actually the ops teams don’t usually even think about what dev might be working on). Servers have landed and are being racked and stacked. Hopefully we guessed right when we estimated the number of users, efficiency of code, storage requirements, etc… that were used to size this hardware. In reality we will have to see what happens during load testing and make adjustments (i.e. tell the developers to make it use fewer resources or order more hardware).
  5. We’re closing in on one week until the scheduled go-live date but the application isn’t ready for testing yet. It’s not the developers fault that the functional requirements keep changing but it is going to squeeze the testing and deployment phases.
  6. The monitoring team has installed their standard monitoring agents (usually just traditional server monitoring) and marked off that checkbox from the deployment checklist.
  7. It’s 2 days before go-live and we have an application to test. The load test team has coded some form of synthetic load to be applied to the servers. Functional testing showed that the application worked. Load testing shows slow response times and lot’s of errors. Another test is scheduled for tomorrow while the development team works frantically to figure out what went wrong with this test.
  8. One day until go-live, load test session 2, still some slow response time and a few errors but nothing that will stop this application from going into production. We call the load test a “success” and give the green light to deploy the application onto the production servers. The app is deployed, functional testing looks good, and we wait until tomorrow for the real test…production users!
  9. Go-Live … Users hit the application, the application pukes and falls over, the operations team check the infrastructure and gets the developers the log files to look at. Management is upset. Everyone is asking if we have any monitoring tools that can show what is happening in the application.
  10. Week one is a mess with the application working, crashing, restarting, working again, and new emergency code releases going into production to fix the problems. Week 2 and each subsequent week will get better  until new functionality gets released in the next major change window.
Nobody wins with a "toss it over the wall" mentality.

Nobody wins with a “toss it over the wall” mentality.

In With the New

Part of the problem with the scenario above is that the development and operations teams are so far removed from each other that there is little to no communication during the build and test phases of the development lifecycle. What if we took a small step towards a more collaborative approach as recommended by DevOps? How would this process change? Let’s explore (modified process steps are highlighted using bold font)…

  1. Place order for dev, test, uat, prod, dr, etc… infrastructure. (~8 weeks lead time, all hardware ordered up front)
  2. Development and operations personnel fill out virtual paperwork together which creates a much more accurate picture of infrastructure requirements. Discussions about failover, redundancy, disaster recovery, data center locations, storage requirements, etc… progress more quickly with better estimations of sizing and understanding of overall environment.
  3. New application is added to CMDB (or similar) to include new infrastructure components, application components, and dependencies.
  4. Operations is fully aware of the progress the developers are making. This gives the operations staff an opportunity to disucss monitoring requirements from both a business and IT perspective with the developers. Operations starts designing the monitoring architecture while the servers have arrived and are being racked and stacked. Both the development and operations teams are comfortable with the hardware requirement estimates but understand that they will have to see what happens during load testing and make adjustments (i.e. tell the developers to make it use fewer resources or order more hardware). Developers start using the monitoring tools in their dev environment to identify issues before the application ever makes it to test.
  5. We’re closing in on one week until the scheduled go-live date but the application isn’t ready for testing yet. It’s not the developers fault that the functional requirements keep changing but it is going to squeeze the testing and deployment phases.
  6. The monitoring team has installed their standard monitoring agents (usually just traditional server monitoring) as well as the more advanced application performance monitoring (APM) agents across all environments. This provides the foundation for rapid triage during development, load testing, and production.
  7. It’s 2 days before go-live and we have an application to test. The load test team has coded a robust set of synthetic load based upon application monitoring data gathered during development. This load is applied to the application which reveals some slow response times and some errors. The developers and operations staff use the APM tool together during the load test to immediately identify the problematic code and have a new release available by the end of the original load test. This process is repeated until the slow response times and errors are resolved.
  8. One day until go-live, we were able to stress test overnight and everything looks good. We have the green light to deploy the application onto the production servers. The app is deployed, functional testing looks good, business and IT metric dashboard looks good, and we wait until tomorrow for the real test…production users!
  9. Go-Live … Users hit the application, the application works well for the most part. The APM tool is showing some slow response time and a couple of errors to the developers and the operations staff. The team agrees to implement a fix after business hours as the business dashboard shows that things are generally going well. After hours the development and operations team collaborate on the build, test, and deploy of the new code to fix the issues identified that day. Management is happy.
  10. Week one is highly successful with issues being rapidly identified and dealt with as they come up. Week 2 and each subsequent week are business as usual and the development team is actively focused on releasing new functionality while operations adapts monitoring and dashboards when needed.
DevAndOps

Developers and operations personnel living together in harmony!

So what scenario sounds better to you? Have you ever been in a situation where increased collaboration caused more problems than it solved? In this example the overall process was kept mostly intact to ensure compliance with regulatory audit procedures. Developers were never granted access to production (regulatory issue for Financial Services companies) but by being tightly coupled with operations they had access to all of the information they needed to solve the issues.

It seems to me that you can make a big impact across the lifecycle of an application by implementing parts of the DevOps philosophy in even a minor way. In this example we didn’t even touch the automation aspects of DevOps. That’s where all of those fun and useful tools come into play so that is where we will pick up next time.

If you’re interested in adding an APM tool to your DevOps, development, or operations toolbox you can take a free self guided trial by clicking here and following the prompts.

Click here for DevOps Scares Me Part 4.