Why Alerts Suck and Monitoring Solutions need to become Smarter

I have yet to meet anyone in Dev or Ops who likes alerts. I’ve also yet to meet anyone who was fast enough to acknowledge an alert, so they could prevent an application from slowing down or crashing. In the real world alerts just don’t work, nobody has the time or patience anymore, alerts are truly evil and no-one trusts them. The most efficient alert today is an angry end user phone call, because Dev and Ops physically hear and feel the pain of someone suffering 🙂

Why? There is little or no intelligence in how a monitoring solution determines what is normal or abnormal for application performance. Today, monitoring solutions are only as good as the users that configure them, which is bad news because humans make mistakes, configuration takes time, and time is something many of us have little of.

Its therefore no surprise to learn that behavioral learning and analytics are becoming key requirements for modern application performance monitoring (APM) solutions. In fact, Will Capelli from Gartner recently published a report on IT Operational Analytics and pattern based strategies in the data center. The report covered the role of Complex Event Processing (CEP), behavior learning engines (BLEs) and analytics as a means for monitoring solutions to deliver better intelligence and quality information to Dev and Ops. Rather than just collect, store and report data, monitoring solutions must now learn and make sense of the data they collect, thus enabling them to become smarter and deliver better intelligence back to their users.

Change is constant for applications and infrastructure thanks to agile cycles, therefore monitoring solutions must also change so they can adapt and stay relevant. For example, if the performance of a business transaction in an application is 2.5 secs one week, and that drops to 200ms the week after because of a development fix. 200ms should become the new performance baseline for that same transaction, otherwise the monitoring solution won’t learn or alert of any performance regression. If the end user experience of a business transaction goes from 2.5 secs to 200ms, then end user expectations change instantly, and users become used to an instant response. Monitoring solutions have to keep up with user expectations, otherwise IT will become blind to the one thing that impacts customer loyalty and experience the most.