If all you have is logs, you are doing it wrong

Ten years ago, the standard way to troubleshoot an application issue was to look at the logs. Users would complain about a problem, you’d go to operations and ask for a thread dump, and then you’d spend some time poring over log files looking for errors, exceptions, or anything that might indicate a problem. There are some people who still use this approach today with some success, but for most modern applications logging is simply not enough. If you’re depending on log files to find and troubleshoot performance problems, then chances are your users are suffering – and you’re losing money for your business. In this blog we’ll look at how and why logging is no longer enough for managing application performance.

The Legacy Approach

The typical legacy web application was monolithic and fairly static, with a single application tier talking to a single database that was updated every six months. The legacy approach to monitoring production web applications was essentially a customer support loop. A customer would contact the support team to report an outage or bug, the customer support team reports the incident to the operations team, and then the operations team would investigate by looking at the logs with whatever useful information they had from the customer (username, timestamps, etc.). If the operations team was lucky and the application had ample logging, the operations team would spot the error and bring in developers to find the root cause and provide a resolution. This is the ideal scenario, but more often than not the logs were of very little use and the operations team would have to wait for another user to complain about a similar problem and kick off the process again. Ten years ago, this was what production monitoring looked like. Apart from some rudimentary server monitoring tools that could alert the operations team if a server was unavailable, it was the end users who were counted on to report problems.

Monolithic App

Logging is inherently reactive

The most important reason that logging was never truly an application performance management strategy is that logging is an inherently reactive approach to performance. Typically this means an end user is the one alerting you to a problem, which means that they were affected by the issue – and (therefore) so was your business. A reactive approach to application performance loses you money and damages your reputation. So logging isn’t going to cut it in production.

You’re looking for a needle in a haystack

Another reason why logging was never a perfect strategy is that system logs have a particularly low signal to noise ratio. This means that most of the data you’re looking at (which can amount to terabytes for some organizations) isn’t helpful. Sifting through log files can be a very time-consuming process, especially as your application scales, and every minute you spend looking for a problem is time that your customers are being affected by a performance issue. Of course, newer tools like Splunk, Loggly, SumoLogic and others have made sorting through log files easier, but you’re still looking for a needle in a haystack.

Logging requires an application expert

Which brings us to another reason logging never worked: Even with tools like Loggly and Splunk, you need to know exactly what to search for before you start, whether it’s a specific string, a time range, or a particular file. This means the person searching needs to be someone who knows the application well, usually a developer or an architect. Even then, their hunches could be wrong, especially if it’s a performance issue that you’ve never encountered before.

Not everyone has access to logs

Logging is a great tool for developers to debug their code on their laptops, but things get more complicated in production, especially if the application is dealing with sensitive data like credit card numbers. There are usually restrictions on the production system that prevent people like developers from accessing the production logs. In some organizations, these can be requested from the operations team, but this step can take a while. In a crisis, every second counts, and these costly processes (while important) can cost organization money if your application is down.

It doesn’t work in production

Even in a perfect world where you have complete access to your application’s log files, you still won’t have complete visibility into what’s going on in your application. The developer who wrote the code is ultimately the one who decides what gets logged, and the verbosity of those logs is often limited by performance constraints in production. So even if you do everything right there’s still a chance you’ll never find what you’re looking for.

The Modern Approach

Today, enterprise web applications are much more complex than they were ten years ago. The new normal for these applications includes multiple application tiers communicating via a service-oriented architecture (SOA) that interacts with several databases and third-party web services while processing items out of caches and queues. The modern application has multiple clients from browser-based desktops to native applications on mobile. As a result, it can be difficult just to know where to start if you’re depending on log files for troubleshooting performance issues.

Distributed App

Large, complex and distributed applications require a new approach to monitoring. Even if you have centralized log collection, finding which tier a problem is on is often a large challenge on its own. Usually increasing logging verbosity is not an option, as production does not run with debugging enabled due to performance constraints. First the operations team must figure out which log file to look through and once finding something that resembles an error, reach out to the developers to find out what it means and how to resolve it. Furthermore with the evolution of web 2.0 applications which are heavily dependent on JavaScript and the capabilities of the browser in order to properly monitor the application is operating as expected you must have end user monitoring that is capable of capturing errors on the client-side and in native mobile applications.

Logging is simply not enough

Logging is not enough – modern applications require application performance management to enable application owners to stay informed to minimize the business impact of performance degradation and downtime

Logging is simply not enough information to get to the root cause of problems in modern distributed applications. The problems of production monitoring have changed and so has the solution. Your end users are demanding and fickle, and you can’t afford to let them down. This means you need the fastest and most effective way to troubleshoot and solve performance problems, and you can’t rely on the chance that you might find the message in the log. Business owners, developers, and operations need in-depth visibility into the app, and the only way to get that is by using application performance monitoring.

Get started with AppDynamics Pro today for in-depth application performance management.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

AppDynamics & Splunk – Better Together

AppD & Splunk LogoA few months ago I saw an interesting partnership announcement from Foursquare and OpenTable.  Users can now make OpenTable reservations at participating restaurants from directly within the Foursquare mobile app.  My first thought was, “What the hell took you guys so long?” That integration makes sense on so many levels, I’m surprised it hadn’t already been done.

So when AppDynamics recently announced a partnership with Splunk, I viewed that as another no-brainer.  Two companies with complementary solutions making it easier for customers to use their products together – makes sense right?  It does to me, and I’m not alone.

I’ve been demoing a prototype of the integration for a few months now at different events across the country, and at the conclusion of each walk-through I’d get some variation of the same question, “How do I get my hands on this?”  Well, I’m glad to say the wait is over – the integration is available today as an App download on Splunkbase.  You’ll need a Splunk and AppDynamics license to get started – if you don’t already have one, you can sign up for free trials of Splunk and AppDynamics online.

Gartner recognizes AppDynamics as APM Innovator

It’s been a great start to 2012 for us at AppDynamics. Last week, we were recognized by Forrester Research in their APM market overview, and at the end of 2011, Gartner included us in their report “APM Innovators: Driving APM Technology and Delivery Evolution” which was co-written by Will Capelli and Jonah Kowall.

According to Gartner’s report, APM is evolving into four key market requirements:

1. Complex and varied End Points

2. Cloud Services

3. Packaged Applications

4. Big Data

Storm Clouds in 2012? – Results of AppDynamics APM Survey

We recently finished conducting our annual Application Performance Management survey. Over 250 IT professionals participated, and they shared insights such as:
– Many Ops and Dev teams are anticipating growth in their applications by 20% or more
– Over 50% are planning to move to the cloud, and are architecting brand-new applications to be cloud-ready
– Most teams are using log files to monitor application performance, rather than an Application Performance Management (APM) tool.

We’ll release the full report soon, but here’s an infographic that summarizes some of the main findings:

AppDynamics Inforgraphic - Storm Clouds in 2012

Embed this image on your site:

What I found personally surprising was the heavy reliance on log files. When you’re troubleshooting distributed architectures, time is of the essence–and there’s no way to cut your MTTR down when you’re relying on log files to identify root cause.

In fact, there’s only one guy who ever made using a log file look cool:

And I think we can all agree that’s a pretty unique use case.

We’ll have the full survey results available soon.