How Do you Monitor Your Logging?

Applications typically log additional data such as exceptions to different data sources. Windows event logs, local files, and SQL databases are most commonly used in production. New applications can take advantage of leveraging big data instead of individual files or SQL.

One of the most surprising experiences when we start monitoring applications is noticing the logging is not configured properly in production environments. There have been two types of misconfiguration errors we’ve seen often in the field:

  1. logging configuration is copied from staging settings

  2. while deploying the application to production environment, logging wasn’t fully configured and the logging failed to log any data

To take a closer look, I have a couple of sample applications to show how the problems could manifest themselves. These sample applications were implemented using MVC5 and are running in Windows Azure and using Microsoft Enterprise Library Exception Handling and Logging blocks to log exceptions to the SQL database. There is no specific preference regarding logging framework or storage, just wanted to demonstrate problems similar to what we’ve seen with different customers.

Situation #1 Logging configuration was copied from staging to production and points to the staging SQL database

When we installed AppDynamics and it automatically detected the application flowmap, I noticed the application talks to the production UserData database and… a staging database for logging.

The other issue was the extremely slow response time while calling the logging database. The following snapshot can explain the slow performance, as you see there’s an exception happening while trying to run an ADO.NET query:

Exception details confirm the application was not able to connect to a database, which is expected — the production environment in located in DMZ and usually can’t reach a staging network.

To restate what we see above — this is a failure while trying to log the original exception which could be anything from a user not being able to log into the website to failing to checkout.

At the same time the impact is even higher because the application spends 15 seconds trying to connect to the logging database and timeout, all while the user is waiting.

Situation #2 During deployment the service account wasn’t granted permissions to write to the logging database

This looks similar to the example above but when we drill inside the error we can see the request has an internal exception happened during the processing:

The exception says the service account didn’t have permissions to run the stored procedure “WriteLog” which logs entries to the logging database. From the performance perspective, the overhead of security failure is less from timeouts in the example above but the result is the same — we won’t be able to see the originating exception.

Not fully documenting or automating the application deployment/configuration process usually causes such problems.

These are one-time issues that once you fix it will work on the machine. However, next time you deploy the application to a new server or VM this will happen again until you fix the deployment.

Let’s check the EntLigLogging database — it has no rows

Here’s some analysis to explain why this happened:

  1. We found exceptions when the application was logging the error

  2. This means there was an original error and the application was trying to report it using logging

  3. Logging failed which means the original error was never reported!

  4. And… logging doesn’t log anywhere about its failures, which means from a logging perspective the application has no problems!!

This is logically correct — if you can’t log data to the storage database you can’t log anything. Typically, loggers are implemented similar to the following example:

Logging is the last option in this case and when it fails nothing else happens as you see in the code above.

Just to clarify, AppDynamics was able to report these exceptions because the agent instruments common methods like ADO.NET calls, HTTP calls, and other exit calls as well as error handlers, which helped in identifying the problem.

Going back to our examples, what if the deployment and configuration process is now fixed and fully automated so there can’t be a manual mistake? Do you still need to worry? Unfortunately, these issues happen more often than you’d expect, here is another real example.

Situation #3 What happens when the logging database fills up?

Everything is configured correctly but at some point the logging database fills up. In the screenshot above you can this this happened around 10:15pm. As a result, the response time and error rates have spiked.

Here is one of the snapshots collected at that time:

You can see that in this situation it took over 32 seconds trying to log data. Here are the exception details:

The worst part is at 10:15pm the application was not able to report about its own problems due to the database being completely full, which may incorrectly be translated that the application is healthy since it is “not failing” because there are no new log entries.

We’ve seen enough times that the logging database isn’t seen as a critical piece of the application therefore it gets pushed down the priority list and often overlooked. Logging is part of your application logic and it should fall into the same category as the application. It’s essential to document, test, properly deploy and monitor the logging.

This problem could be avoided entirely unless your application receives an unexpected surge of traffic due to a sales event, new release, marketing campaign, etc. Other than the rare Slashdotting effect, your database should never get to full capacity and result in a lack of logging. Without sufficient room in your database, your application’s performance is in jeopardy and you won’t know since your monitoring framework isn’t notifying you. Because these issues are still possible, albeit during a large load surge, it’s important to continuously monitor your loggingn as you wouldn’t want an issue to occur during an important event.

Key points:

  • Logging adds a new dependency to the application

  • Logging can fail to log the data – there could be several reasons why

  • When this happens you won’t be notified about the original problem or a logging failure and the performance issues will compound

This would never happen to your application, would it?

If you’d like to try AppDynamics check out our free trial and start monitoring your apps today! Also, be sure to check out my previous post, The Real Cost of Logging.

The Real Cost of Logging

In order to manage today’s highly dynamic application environments, many organizations turn to their logging system for answers – but reliance on these systems may be having an undesired impact on the applications themselves.

The vast majority of organisations use some sort of logging system — it could log errors, traces, information messages or debugging information. Applications can write logs to a file, database, Windows event log, or big data store and there are many logging frameworks and practices being used.

Logging brings good insights about the application behavior, especially about failures. However, by being part of the application, logging also participates in the execution chain, which can have its disadvantages. While working with customers we often see the negative consequences when logging alone introduced the adverse impact to the application.

Most of the time the overhead of logging is negligible. It only matters when the application is under significant load — but these are the times when it matters the most. Think about Walmart or Best Buy during Black Friday and Cyber Monday. Online sales are particularly crucial for these retail companies during this period and this is the time when their applications are under most stress.

To better explain the logging overhead I created a lightweight .NET application that:

  1. implemented using ASP.NET
  2. performs lightweight processing
  3. has an exception built in
  4. exceptions are always handled within try…catch statement
  5. exceptions are either logged using log4net or ignored based on the test

In my example I used log4net as I recently diagnosed a similar problem with a customer who was using log4net, however this could be replaced for any other framework that you use.

Test #1

First, we set up a baseline by running an application when exceptions are not being logged from the catch statement.

 

Test #2

Next I enabled logging exceptions by logging those to a local file and running same load test.

As you can see not only is the average response time significantly higher now, but also the throughput of the application is lower.

The AppDynamics snapshot is collected automatically when there is a performance problem or failure and includes full call graph with timings for each executed method.

By investigating the call graph AppDynamics produces, we see that log4net method, FileAppender, renders error information to a file using FileStream. On the right you can see duration of each call and the most time was spent in WriteFileNative as it was competing with similar requests trying to append the log file with error details.

Test #3

I often come across attempting to make exception logging asynchronous by using ThreadPool. Below is how the performance looks like in this setup under exactly the same load.

This is a clever concept and works adequately for low throughput applications, but as you can see the average response time is still in a similar range as the non-asynchronous version, but a slightly lower throughput is achieved.

Why is this? Having logging running in separate threads means the resources are still consumed, there are less threads available, and the number of context switches will be greater.

In my experience logging to a file is exceptionally common. Other types of storage could introduce better performance, however they always need to be further tested and logging to a file is the easier solution.

Summary

While logging is important and helps with application support and troubleshooting, logging should be treated as part of the application logic. This means logging has to be designed, implemented, tested, monitored and managed. In short, it should become a part of full application lifecycle management.

How well do you understand the your logging framework and it’s performance cost?

 Take five minutes to get complete visibility into the performance of your production applications with AppDynamics today.

If all you have is logs, you are doing it wrong

Ten years ago, the standard way to troubleshoot an application issue was to look at the logs. Users would complain about a problem, you’d go to operations and ask for a thread dump, and then you’d spend some time poring over log files looking for errors, exceptions, or anything that might indicate a problem. There are some people who still use this approach today with some success, but for most modern applications logging is simply not enough. If you’re depending on log files to find and troubleshoot performance problems, then chances are your users are suffering – and you’re losing money for your business. In this blog we’ll look at how and why logging is no longer enough for managing application performance.

The Legacy Approach

The typical legacy web application was monolithic and fairly static, with a single application tier talking to a single database that was updated every six months. The legacy approach to monitoring production web applications was essentially a customer support loop. A customer would contact the support team to report an outage or bug, the customer support team reports the incident to the operations team, and then the operations team would investigate by looking at the logs with whatever useful information they had from the customer (username, timestamps, etc.). If the operations team was lucky and the application had ample logging, the operations team would spot the error and bring in developers to find the root cause and provide a resolution. This is the ideal scenario, but more often than not the logs were of very little use and the operations team would have to wait for another user to complain about a similar problem and kick off the process again. Ten years ago, this was what production monitoring looked like. Apart from some rudimentary server monitoring tools that could alert the operations team if a server was unavailable, it was the end users who were counted on to report problems.

Monolithic App

Logging is inherently reactive

The most important reason that logging was never truly an application performance management strategy is that logging is an inherently reactive approach to performance. Typically this means an end user is the one alerting you to a problem, which means that they were affected by the issue – and (therefore) so was your business. A reactive approach to application performance loses you money and damages your reputation. So logging isn’t going to cut it in production.

You’re looking for a needle in a haystack

Another reason why logging was never a perfect strategy is that system logs have a particularly low signal to noise ratio. This means that most of the data you’re looking at (which can amount to terabytes for some organizations) isn’t helpful. Sifting through log files can be a very time-consuming process, especially as your application scales, and every minute you spend looking for a problem is time that your customers are being affected by a performance issue. Of course, newer tools like Splunk, Loggly, SumoLogic and others have made sorting through log files easier, but you’re still looking for a needle in a haystack.

Logging requires an application expert

Which brings us to another reason logging never worked: Even with tools like Loggly and Splunk, you need to know exactly what to search for before you start, whether it’s a specific string, a time range, or a particular file. This means the person searching needs to be someone who knows the application well, usually a developer or an architect. Even then, their hunches could be wrong, especially if it’s a performance issue that you’ve never encountered before.

Not everyone has access to logs

Logging is a great tool for developers to debug their code on their laptops, but things get more complicated in production, especially if the application is dealing with sensitive data like credit card numbers. There are usually restrictions on the production system that prevent people like developers from accessing the production logs. In some organizations, these can be requested from the operations team, but this step can take a while. In a crisis, every second counts, and these costly processes (while important) can cost organization money if your application is down.

It doesn’t work in production

Even in a perfect world where you have complete access to your application’s log files, you still won’t have complete visibility into what’s going on in your application. The developer who wrote the code is ultimately the one who decides what gets logged, and the verbosity of those logs is often limited by performance constraints in production. So even if you do everything right there’s still a chance you’ll never find what you’re looking for.

The Modern Approach

Today, enterprise web applications are much more complex than they were ten years ago. The new normal for these applications includes multiple application tiers communicating via a service-oriented architecture (SOA) that interacts with several databases and third-party web services while processing items out of caches and queues. The modern application has multiple clients from browser-based desktops to native applications on mobile. As a result, it can be difficult just to know where to start if you’re depending on log files for troubleshooting performance issues.

Distributed App

Large, complex and distributed applications require a new approach to monitoring. Even if you have centralized log collection, finding which tier a problem is on is often a large challenge on its own. Usually increasing logging verbosity is not an option, as production does not run with debugging enabled due to performance constraints. First the operations team must figure out which log file to look through and once finding something that resembles an error, reach out to the developers to find out what it means and how to resolve it. Furthermore with the evolution of web 2.0 applications which are heavily dependent on JavaScript and the capabilities of the browser in order to properly monitor the application is operating as expected you must have end user monitoring that is capable of capturing errors on the client-side and in native mobile applications.

Logging is simply not enough

Logging is not enough – modern applications require application performance management to enable application owners to stay informed to minimize the business impact of performance degradation and downtime

Logging is simply not enough information to get to the root cause of problems in modern distributed applications. The problems of production monitoring have changed and so has the solution. Your end users are demanding and fickle, and you can’t afford to let them down. This means you need the fastest and most effective way to troubleshoot and solve performance problems, and you can’t rely on the chance that you might find the message in the log. Business owners, developers, and operations need in-depth visibility into the app, and the only way to get that is by using application performance monitoring.

Get started with AppDynamics Pro today for in-depth application performance management.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.