Server Monitoring – Think You’re Covered? You’re not!

Server monitoring a foundational component to any data center monitoring architecture but it has become a crutch and a deterrent to successfully building out a holistic monitoring platform. Servers exist to run applications and you will never properly monitor applications with server monitoring alone. In this blog we will explore what server monitoring is, some of the available tools, benefits, drawbacks, and how to move to the next level.

What is server monitoring?

Server monitoring consists of monitoring operating system and associated hardware metrics. It’s the view of the world from the perspective of the server but never from inside of the running processes. It includes the basics of CPU, memory, disk, and I/O metrics but those are simplified categories of the real underlying metrics like CPU sys time, CPU wait time, used memory, free memory, disk queue length, % disk used, network collisions, adapter transmit rate, etc… Server monitoring is used by every IT organization in some shape or form.

What are some server monitoring tools?

When I was a Unix System Administrator I used tools like sar, vmstat, nmon, top, topas, and netstat to monitor my servers in real time. In the Windows world I used perfmon for my real time monitoring needs. I also had other tools at my disposal for alerting and keeping historical metrics. Some popular versions of those tools are BMC Patrol, HP OpenView, Nagios, Zenoss, Cacti, Zabbix, Ganglia, GroundWork, and Hyperic.

All of these tools are useful. All of these tools also fall woefully short of achieving the goal of minimizing application downtime and maximizing application performance.

vmstat

vmstat output from a Linux server. Is there any problem with the running application?

What’s the problem?

The problem is that none of the server monitoring tools are capable of knowing how your applications are performing. Some of them can probe your application to see if it is available or not but none can tell you why your application has ceased to function.  No server monitoring tool can tell you any of the following:

  • What is the response time of every request to my application?
  • What components of my application are involved in any of my transactions and where is the slow down?
  • How does the application code execute in the run time?
  • What part of the application code is slow?
  • What application functionality is used, how often, and how does it perform?
  • What application functionality is throwing exceptions and what are they?
  • Did a slow external service call impact my application response time and by how much?

Without answering those fundamental questions you don’t stand a chance of restoring application service in minutes instead of hours or days. Jonah Kowall of Gartner recently wrote a post titled “Got Nagios? Get Rid of It.“. I’d suggest you take a moment to read it when you can.

nagios graph explorer

Nagios HTTP check response times. Is the application experiencing problems as a whole? Problems with individual functions? Are there application Errors?

nagios performance charts

Nagios server monitoring charts.What does this tell us about our application?

What’s the solution?

Many companies have turned to log monitoring and analytics as a solution to this problem but as my colleague Dustin Whittle explained, “If all you have is logs, you are doing it wrong!“. Log file monitoring is nice to have but it also can’t answer many of the questions posed in the above section without a lot of customization. The best solution to the problem at hand is to use the latest generation of Application Performance Monitoring (APM) tools. APM tools understand the inner workings of your applications. They can see the code executing, the entry and exit calls to the application, the transactions flowing through and across multiple application components, exceptions and their associated impact, and much, much more.

Application Flow Map

Dynamic application flow map showing all application components.

Business Transactions

Business transactions automatically detected, tracked, and classified.

Call Graph

Call graph of a single business transaction with all methods, timing, and remote calls.

What’s the impact?

Ultimately APM products offer a tremendous amount of value. Here are just a few of the benefits of using APM:

  • Reducing MTTR from hours/days to minutes.
  • Faster development due to less time tracking down bugs.
  • Fewer bugs released because they are easier to identify and remediate.
  • Faster QA cycle due to rapid problem detection, isolation, and resolution.
  • More stable production environment due to better development and QA.

If you’re still using server monitoring and log analysis to try and figure out your application problems you’re wasting valuable time and doing a disservice to the business you support. Server monitoring and log analytics complement a good APM tool and strategy. There are still too many organizations today that are doing the bare minimum of server monitoring. Do yourself and your business a favor and start down the path of a holistic monitoring strategy by trying AppDynamics for free today.

Software that “Just Works”

This is my first blog.  I’ve been a sales engineer for three application performance management (APM) products over the last 7 years (CA Wily Introscope, SpringSource/Hyperic, and now AppDynamics). I hadn’t really considered myself much of a “blogger” because I have alway thought that actions speak louder than words.  So I guess you are wondering why I would start now.  I guess you could say I was inspired by a recent experience at a customer site. It was quite a bit different than what I’ve been used to in my earlier APM career.

We recently had an experience with a customer who called and inquired about AppDynamics for monitoring several of their mission critical applications. As always our sales team kicked into high gear and had a conversation with the customer.  In less than 15 minutes we  agreed to start a Proof of Concept for AppDynamics running on a critical application.  Later that day, we setup an online conference with the customer and commenced an installation that took about 15 minutes.