How Hallmark was able to proactively avoid application outages and stalls

Last week I was able to catch up with Chris Tranter, Technical Lead at Hallmark UK. cutsomer_carousel_logo_hallmark

Hallmark’s engineering team runs on tight deadlines, and like all popular websites, places a huge importance on customer experience. As you can imagine, their site has abnormal and inconsistent load times, especially through the holiday season. Ensuring a smooth, seamless experience without stalls, outages, or crashes is extremely important.

It was great to see how AppDynamics APM can become a necessary solution on a developer’s toolkit.

Hannah Current: What challenges did you suffer before using APM? And how did you troubleshoot before using an APM tool?

Chris Tranter: Hallmark UK undertook a business transformation project under very tight timescales, specifically in the area of technical planning and design. We had to move fast to deliver an enterprise solution and while confident we could deliver, I was nervous about hitting stalling points as you do with any development. I knew from experience that troubleshooting issues could take time, sometimes days. I needed something to help in this area, so I started to search for system monitoring tools on the internet.

HC: What was your APM selection process/criteria?

CT: I needed something that could oversee a .net/sql based development environment. Previously I’d used smaller scale tools for memory analysis, resource usage etc. They didn’t appear to fit the scale of the requirement here as we have many components working together under a service based architecture. I wasn’t sure if it was possible to monitor everything working together with one solution, so it was purely a research exercise to see what I could find; I’d expected to find a collection of different tools which we could use to help us through the process, In our case this needed to cover .net web services, web sites, windows services, MSMQ queues and server performance. Utopia was having everything captured in one central location.

HC: Why AppDynamics over the other solutions out there?

CT: After reading the promotional info and investigating AppDynamics, I was happy to see I’d found a solution which answered all my requirements. I looked around for similar products but nothing seemed to fit as well as AppDynamics. As mentioned earlier, it’s possible to put a toolkit together from various vendors, but the plus points of AppDynamics were that it was constantly monitoring and alerting and we didn’t have to do much to capture that information once it was up and running. I was also very interested in being able to move back in time and see what was happening on the system when something had occurred as debugging user test systems on the fly was out of the question, we’d have to replicate issues manually which given the nature of the testing and the required positioning of staged data, was often difficult and time consuming. Using the cloud Saas controller, it’s possible to access this data while out of the office, something which has proved invaluable with having contract resource working off site.

HC: How has AppDynamics helped to solve some critical problems?

CT: It’s difficult to quantify this answer. All I know is that there have been significant issues which on reflection would have caused long stalls in the project had we not have had this solution in place. As we have very tight timelines, the impact of these stalls could have severely affected the project plan. The issues came and went with minimal delay as we were able to easily pinpoint the cause of problems in a matter of minutes. It really is that good! … for example.. Our service based solution comprises of a number of WCF services all receiving calls from desktop applications and web applications. Under load, we noticed that we would encounter stalls which would freeze up all clients for minutes at a time. With so many different incoming calls from the estate coming through a central point, debugging was a difficult task. Only it isn’t with AppDynamics, as it’s capturing the data all the time. I staged a few tests and forced the stalls, all the time data was being captured by the AppDynamics agents. I was then able to review the snapshots and immediately see the cause of the problem and where it had occurred. Finding this information is not always easy, sifting through event logs, IIS logs etc… no need anymore, it’s all there for you.

image008
Want to see how AppDynamics APM can help you stay ahead of your performance problems? Check out a FREE trail today!

How Do you Monitor Your Logging?

Applications typically log additional data such as exceptions to different data sources. Windows event logs, local files, and SQL databases are most commonly used in production. New applications can take advantage of leveraging big data instead of individual files or SQL.

One of the most surprising experiences when we start monitoring applications is noticing the logging is not configured properly in production environments. There have been two types of misconfiguration errors we’ve seen often in the field:

  1. logging configuration is copied from staging settings

  2. while deploying the application to production environment, logging wasn’t fully configured and the logging failed to log any data

To take a closer look, I have a couple of sample applications to show how the problems could manifest themselves. These sample applications were implemented using MVC5 and are running in Windows Azure and using Microsoft Enterprise Library Exception Handling and Logging blocks to log exceptions to the SQL database. There is no specific preference regarding logging framework or storage, just wanted to demonstrate problems similar to what we’ve seen with different customers.

Situation #1 Logging configuration was copied from staging to production and points to the staging SQL database

When we installed AppDynamics and it automatically detected the application flowmap, I noticed the application talks to the production UserData database and… a staging database for logging.

The other issue was the extremely slow response time while calling the logging database. The following snapshot can explain the slow performance, as you see there’s an exception happening while trying to run an ADO.NET query:

Exception details confirm the application was not able to connect to a database, which is expected — the production environment in located in DMZ and usually can’t reach a staging network.

To restate what we see above — this is a failure while trying to log the original exception which could be anything from a user not being able to log into the website to failing to checkout.

At the same time the impact is even higher because the application spends 15 seconds trying to connect to the logging database and timeout, all while the user is waiting.

Situation #2 During deployment the service account wasn’t granted permissions to write to the logging database

This looks similar to the example above but when we drill inside the error we can see the request has an internal exception happened during the processing:

The exception says the service account didn’t have permissions to run the stored procedure “WriteLog” which logs entries to the logging database. From the performance perspective, the overhead of security failure is less from timeouts in the example above but the result is the same — we won’t be able to see the originating exception.

Not fully documenting or automating the application deployment/configuration process usually causes such problems.

These are one-time issues that once you fix it will work on the machine. However, next time you deploy the application to a new server or VM this will happen again until you fix the deployment.

Let’s check the EntLigLogging database — it has no rows

Here’s some analysis to explain why this happened:

  1. We found exceptions when the application was logging the error

  2. This means there was an original error and the application was trying to report it using logging

  3. Logging failed which means the original error was never reported!

  4. And… logging doesn’t log anywhere about its failures, which means from a logging perspective the application has no problems!!

This is logically correct — if you can’t log data to the storage database you can’t log anything. Typically, loggers are implemented similar to the following example:

Logging is the last option in this case and when it fails nothing else happens as you see in the code above.

Just to clarify, AppDynamics was able to report these exceptions because the agent instruments common methods like ADO.NET calls, HTTP calls, and other exit calls as well as error handlers, which helped in identifying the problem.

Going back to our examples, what if the deployment and configuration process is now fixed and fully automated so there can’t be a manual mistake? Do you still need to worry? Unfortunately, these issues happen more often than you’d expect, here is another real example.

Situation #3 What happens when the logging database fills up?

Everything is configured correctly but at some point the logging database fills up. In the screenshot above you can this this happened around 10:15pm. As a result, the response time and error rates have spiked.

Here is one of the snapshots collected at that time:

You can see that in this situation it took over 32 seconds trying to log data. Here are the exception details:

The worst part is at 10:15pm the application was not able to report about its own problems due to the database being completely full, which may incorrectly be translated that the application is healthy since it is “not failing” because there are no new log entries.

We’ve seen enough times that the logging database isn’t seen as a critical piece of the application therefore it gets pushed down the priority list and often overlooked. Logging is part of your application logic and it should fall into the same category as the application. It’s essential to document, test, properly deploy and monitor the logging.

This problem could be avoided entirely unless your application receives an unexpected surge of traffic due to a sales event, new release, marketing campaign, etc. Other than the rare Slashdotting effect, your database should never get to full capacity and result in a lack of logging. Without sufficient room in your database, your application’s performance is in jeopardy and you won’t know since your monitoring framework isn’t notifying you. Because these issues are still possible, albeit during a large load surge, it’s important to continuously monitor your loggingn as you wouldn’t want an issue to occur during an important event.

Key points:

  • Logging adds a new dependency to the application

  • Logging can fail to log the data – there could be several reasons why

  • When this happens you won’t be notified about the original problem or a logging failure and the performance issues will compound

This would never happen to your application, would it?

If you’d like to try AppDynamics check out our free trial and start monitoring your apps today! Also, be sure to check out my previous post, The Real Cost of Logging.

Instrumenting .NET applications with AppDynamics using NuGet

Introduction

One of the coolest things to come out of the .NET stable at AppD this week was the NuGet package for Azure Cloud Services. NuGet makes it a breeze to deploy our .NET agent along with your web and worker roles from inside Visual Studio. For those unfamiliar with NuGet, more information can be found here.

Our NuGet package ensures that the .NET agent is deployed at the same time when the role is published to the cloud. After adding it to the project you’ll never have to worry about deploying the agent when you swap your hosting environment from staging to production in Azure or when Azure changes the machine from under your instance. For the remainder of the post I’ll use a web role to demonstrate how to quickly install our NuGet package, changes it makes to your solution and how to edit the configuration by hand if needed. Even though I’ll use a web role, things work exactly the same way for a worker role.

Installation

So, without further ado, let’s take a look at how to quickly instrument .NET code in Azure using AppD’s NuGet package for Windows Azure Cloud Services. NuGet packages can be added via the command line or the GUI. In order to use the command line, we need to bring up the package manager console in Visual Studio as shown below

PackageManager

In the console, type ‘install-package AppDynamics.WindowsAzure.CloudServices’ to install the package. This will bring up the following UI where you can enter the information needed by the agent to talk to the controller and upload metrics. You should have received this information in the welcome email from AppDynamics.

Azure

The ‘Application Name’ is the name of the application in the controller under which the metrics reported by this agent will be stored. When ‘Test Connection’ is checked we will check the information entered by trying to connect to the controller. An error message will be displayed if the test connection is unsuccessful. That’s it, enter the information, click apply and we’re done. Easy Peasy. No more adding files one by one or modifying scripts by hand. Once deployed, instances of this web role will start reporting metrics as soon as they experience any traffic. Oh, and by the way, if you prefer to use a GUI instead of typing commands on the console, the same thing can be done by right-clicking on the solution in Visual Studio and choosing ‘Manage NuGet Package’.

Anatomy of the package

If you look closely at the solution explorer you’ll notice that a new folder called ‘AppDynamics’ has been created in the solution explorer. On expanding the folder you’ll find the following two files:

  • Installer of the latest and greatest .NET agent.
  • Startup.cmd
The startup script makes sure that the agent gets installed as a part of the deployment process on Azure. Other than adding these files we also change the ServiceDefinition.csdef file to add a startup task as shown below.

Screen Shot 2013-11-27 at 8.11.27 PM

In case, you need to change the controller information you entered in the GUI while installing the package, it can be done by editing the startup section of the csdef file shown above. Application name, controller URL, port, account key etc. can all be changed. On re-deploying the role to Azure, these new values will come into effect.

Next Steps

Microsoft Developer Evangelist, Bruno Terkaly blogged about monitoring the performance of multi-tiered Windows Azure based web applications. Find out more on Microsoft Developer Network.

Find out more in our step-by-step guide on instrumenting .NET applications using AppDynamics Pro. Take five minutes to get complete visibility into the performance of your production applications with AppDynamics Pro today.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

AppDynamics Pro on the Windows Azure Store

Over a year ago, AppDynamics announced a partnership with Microsoft and launched AppDynamics Lite on the Windows Azure Store. With AppDynamics Lite, Windows Azure users were able to easily monitor their applications at the code level, allowing them to identify and diagnose performance bottlenecks in real time. Today we’re happy to announce that AppDynamics Pro is now available as an addon in the Windows Azure store, which makes it easier for developers to get complete visibility into their mission-critical applications running on Windows Azure.

  • Easier/simplified buying experience in Windows Azure Store
  • Tiered pricing based on number of agents and VM size
  • Easy deployment from Visual Studio with NuGet
  • Out-of-the-box support for more Windows Azure services

“AppDynamics is one of only a handful of application monitoring solutions that works on Windows Azure, and the only one that provides the level of visibility required in our distributed and complex application environments,” said James Graham, project manager at MacMillan Publishers. “The AppDynamics monitoring solution provides insight into how our .NET applications perform at a code level, which is invaluable in the creation of a dynamic, fulfilling user experience for our students.”

Easy buying experience

Purchasing the AppDynamics Pro add-on in the Windows Azure Store only takes a couple of minutes. In the Azure portal click NEW at the bottom left of the screen and then select STORE. Search for AppDynamics, choose your plan, add-on name and region.

4-choose-appdynamics

Tiered pricing

AppDynamics Pro for Windows Azure features new tiered pricing based on the size of your VM (extra small, small or medium, large, or extra large) and the number of agents required (1, 5 or 10). This new pricing allows organizations with smaller applications to pay less to store their monitoring data than those with larger, more heavily trafficked apps. The cost is added to your monthly Windows Azure bill, and you can cancel or change your plan at any time.

AppDynamics on Windows Azure Pricing

Deploying with NuGet

Use the AppDynamics NuGet package to deploy AppDynamics Pro with your solution from Visual Studio. For detailed instructions check out the how-to guide.

2-vs-package-search

Monitoring with AppDynamics

  • Monitor the health of Windows Azure applications
  • Troubleshoot performance problems in real time
  • Rapidly diagnose root cause of performance problems
  • Dynamically scale up and scale down their Windows Azure application based on performance metrics

AppDynamics .Net App

Additional platform support

AppDynamics Pro automatically detects and monitors most Azure services out-of-the-box, including web and worker roles, SQL, Azure Blob, Azure Queue and Windows Azure Service Bus. In addition, AppDynamics Pro now supports MVC 4. Find out more in our getting started guide for Windows Azure.

Get started monitoring your Windows Azure app by adding the AppDynamics Pro add-on in the Windows Azure Store.

Top Tips for Managing .NET Application Performance

There are many technical articles/blogs on the web that jump straight into areas of .NET code you can instantly optimize and tune. Before we get to some of those areas, it’s good to take a step back and ask yourself, “Why am I here?” Are you interested in tuning your app, which is slow and keeps breaking, or are you looking to prevent these things from happening in the future? When you start down the path of Application Performance Management (APM), it is worth asking yourself another important question – what is success? This is especially important if you’re looking to tune or optimize your application. Knowing when to stop is as important as knowing when to start.

A single code or configuration change can have a dramatic impact on your application’s performance. It’s therefore important that you only change or tune what you need to – less is often more when it comes to improving application performance. I’ve been working with customers in APM for over a decade and it always amazes me how dev teams will browse through packages of code and rewrite several classes/methods at the same time with no real evidence that what they are changing will actually make an impact. For me, I learned the most about writing efficient code in code reviews with peers, despite how humbling it was. What I lacked the most as a developer, though, was visibility into how my code actually ran in a live production environment. Tuning in development and test is not enough if the application still runs slow in production. When manufacturers design and build cars they don’t just rely on simulation tests – they actually monitor their cars in the real world. They drive them for hundreds of thousands of miles to see how their cars will cope in all conditions they’ll encounter. It should be the same with application performance. You can’t simulate every use case or condition in dev and test, so you must understand your application performance in the real world.