Tales from the Field: Migrating from Log4J to Log4J2

The AppDynamics Java agent, like many Java applications, uses logging extensively. We have used Log4J as our logging framework for many years. And while the latest release of Log4J was in 2012 – and the Apache Foundation announced end-of-life for Log4J in August 2015 – we didn’t upgrade to Log4J2 because of the need to maintain support for the Java 5 VM and other competing priorities. However, our recent move from a monolithic repository to a product-specific repository has made the upgrade possible.

Log4J2 is full of enticing features. For example, the framework dramatically increases the speed of logging and reduces memory usage by providing garbage-free logging. Through native support for async logging, we can further reduce the already small amount of time we spend logging when running on customer applications. Since compression is also a native feature, our agent can tolerate more logging while simultaneously reducing file storage requirements. Both features allow us to add more frequent and better-quality logging that contains actionable information for our customer success and dev teams to help our customers.

Migration Goals and Challenges

So, what did we want to accomplish with the migration? And what were the challenges we faced during the move?

Let’s start with challenges:

– We had to namespace the framework packages to isolate our use of Log4J from our customers’ logging frameworks and we also needed to make the source Java 5 compatible because standard Log4J2 requires Java 1.6 and above.

– Since almost every class uses logging, we had to find a way to make these changes incremental and (relatively) easy to review to maintain the high quality that’s necessary in a production monitoring agent.

– We had to be able to fall back to Log4J (which is proven to work) in case Log4J2 initialization fails.

Our first goal was to repackage the jar with Java 5 compatible source. This step was easy. We programmatically refactored all the classes to namespace their packages.  We manually fixed a few issues involving APIs that only Java 6 and above supports, such as String.isEmpty().

The second step was to test the framework in a compatible environment. We used a docker container that has Java 5 installed, and created a test application mirroring our agent structure. This step took time because we needed to figure out how the configuration and customization were going to work with our agent. For example, one of the features that we have is agent error safety. We silence the logs and remove instrumentation if our agent code experiences too many internal errors. Another feature we have is reusing node names. We buffer the logging events and only write them to file after we know the node name from our UI. Using a test application, we were able to mimic all of these features in preparation for the migration.

In order to enable reversibility, we still have both frameworks present at the same time. We used the bridge pattern to extract logging to a separate shared package. This allows us to have multiple logging frameworks present in the code base, and we can easily switch between them at runtime. It also allows us to upgrade logging frameworks in the future, providing high flexibility and changeability. This step was significant because we had to change the build scripts and change every single file that uses a logger.

Lastly, we simply moved the Log4J2 version of our custom appenders we created from the second step, copied the configuration code over, and with that – we successfully upgraded our logging framework!

Log4J2 Log Correlation Support in 4.4

While working with Log4j2, we also took the opportunity to add support for it to our log correlation feature.

Log Correlation enables the user to designate a spot in their log appender pattern for us to insert our business transaction (BT) request guid at runtime.  Any call made to the logger in the context of a BT will dynamically inject the guid, regardless of whether the line ultimately ends up in a file or on the console.  The presence of these guids in the log output empowers log processing applications, including our own Log Analytics product, but also others such as Splunk. Using them, we can correlate any lines logged by individual transaction to the snapshot data we collected about that request on the APM side, all without requiring any changes to the customer application.  Conversely, it also enables users of our Controller to easily transition from a BT snapshot to the exact lines that occurred during that BT request in their logs.

Logging frameworks supported in addition to the new support for Log4J2 include Log4J, Logback, and Slf4J.

Final Thoughts

Working toward a product-wide upgrade is daunting at first. However, once it’s broken up into small independent steps, it becomes much more manageable. It seems harder to run a 10k than it does to run ten 1ks. The upgrade went smoothly because each step made changes to the product while keeping it functional and ready to ship. This is a boon to faster build verification and code review.

To lean more, see our documentation on business transaction and log correlation.

Want to see how AppDynamics Log Analytics works? Start a free trial today.

Haojun Li is a co-author of this blog post. Haojun is a software engineer that has been with AppDynamics for approximately 5 months. He is a fresh grad from UC Berkeley with a degree in Computer Science and Statistics. He enjoys sailing and biking on the road during the weekend. 

If all you have is logs, you are doing it wrong

Ten years ago, the standard way to troubleshoot an application issue was to look at the logs. Users would complain about a problem, you’d go to operations and ask for a thread dump, and then you’d spend some time poring over log files looking for errors, exceptions, or anything that might indicate a problem. There are some people who still use this approach today with some success, but for most modern applications logging is simply not enough. If you’re depending on log files to find and troubleshoot performance problems, then chances are your users are suffering – and you’re losing money for your business. In this blog we’ll look at how and why logging is no longer enough for managing application performance.

The Legacy Approach

The typical legacy web application was monolithic and fairly static, with a single application tier talking to a single database that was updated every six months. The legacy approach to monitoring production web applications was essentially a customer support loop. A customer would contact the support team to report an outage or bug, the customer support team reports the incident to the operations team, and then the operations team would investigate by looking at the logs with whatever useful information they had from the customer (username, timestamps, etc.). If the operations team was lucky and the application had ample logging, the operations team would spot the error and bring in developers to find the root cause and provide a resolution. This is the ideal scenario, but more often than not the logs were of very little use and the operations team would have to wait for another user to complain about a similar problem and kick off the process again. Ten years ago, this was what production monitoring looked like. Apart from some rudimentary server monitoring tools that could alert the operations team if a server was unavailable, it was the end users who were counted on to report problems.

Monolithic App

Logging is inherently reactive

The most important reason that logging was never truly an application performance management strategy is that logging is an inherently reactive approach to performance. Typically this means an end user is the one alerting you to a problem, which means that they were affected by the issue – and (therefore) so was your business. A reactive approach to application performance loses you money and damages your reputation. So logging isn’t going to cut it in production.

You’re looking for a needle in a haystack

Another reason why logging was never a perfect strategy is that system logs have a particularly low signal to noise ratio. This means that most of the data you’re looking at (which can amount to terabytes for some organizations) isn’t helpful. Sifting through log files can be a very time-consuming process, especially as your application scales, and every minute you spend looking for a problem is time that your customers are being affected by a performance issue. Of course, newer tools like Splunk, Loggly, SumoLogic and others have made sorting through log files easier, but you’re still looking for a needle in a haystack.

Logging requires an application expert

Which brings us to another reason logging never worked: Even with tools like Loggly and Splunk, you need to know exactly what to search for before you start, whether it’s a specific string, a time range, or a particular file. This means the person searching needs to be someone who knows the application well, usually a developer or an architect. Even then, their hunches could be wrong, especially if it’s a performance issue that you’ve never encountered before.

Not everyone has access to logs

Logging is a great tool for developers to debug their code on their laptops, but things get more complicated in production, especially if the application is dealing with sensitive data like credit card numbers. There are usually restrictions on the production system that prevent people like developers from accessing the production logs. In some organizations, these can be requested from the operations team, but this step can take a while. In a crisis, every second counts, and these costly processes (while important) can cost organization money if your application is down.

It doesn’t work in production

Even in a perfect world where you have complete access to your application’s log files, you still won’t have complete visibility into what’s going on in your application. The developer who wrote the code is ultimately the one who decides what gets logged, and the verbosity of those logs is often limited by performance constraints in production. So even if you do everything right there’s still a chance you’ll never find what you’re looking for.

The Modern Approach

Today, enterprise web applications are much more complex than they were ten years ago. The new normal for these applications includes multiple application tiers communicating via a service-oriented architecture (SOA) that interacts with several databases and third-party web services while processing items out of caches and queues. The modern application has multiple clients from browser-based desktops to native applications on mobile. As a result, it can be difficult just to know where to start if you’re depending on log files for troubleshooting performance issues.

Distributed App

Large, complex and distributed applications require a new approach to monitoring. Even if you have centralized log collection, finding which tier a problem is on is often a large challenge on its own. Usually increasing logging verbosity is not an option, as production does not run with debugging enabled due to performance constraints. First the operations team must figure out which log file to look through and once finding something that resembles an error, reach out to the developers to find out what it means and how to resolve it. Furthermore with the evolution of web 2.0 applications which are heavily dependent on JavaScript and the capabilities of the browser in order to properly monitor the application is operating as expected you must have end user monitoring that is capable of capturing errors on the client-side and in native mobile applications.

Logging is simply not enough

Logging is not enough – modern applications require application performance management to enable application owners to stay informed to minimize the business impact of performance degradation and downtime

Logging is simply not enough information to get to the root cause of problems in modern distributed applications. The problems of production monitoring have changed and so has the solution. Your end users are demanding and fickle, and you can’t afford to let them down. This means you need the fastest and most effective way to troubleshoot and solve performance problems, and you can’t rely on the chance that you might find the message in the log. Business owners, developers, and operations need in-depth visibility into the app, and the only way to get that is by using application performance monitoring.

Get started with AppDynamics Pro today for in-depth application performance management.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.