The Digital Enterprise – Problems and Solutions

According to a recent article featured in Wall Street and Technology, Financial Services (FS) companies have a problem. The article explains that FS companies built more datacenter capacity than they needed when profits were up and demand was rising. Now that profits are lower and demand has not risen as expected the data centers are partially empty and very costly to operate.

FS companies are starting to outsource their IT infrastructure and this brings a new problem to light…

“It will take a decade to complete the move to a digital enterprise, especially in financial services, because of the complexity of software and existing IT architecture. “Legacy data and applications are hard to move” to a third party, Bishop says, adding that a single application may touch and interact with numerous other applications. Removing one system from a datacenter may disrupt the entire ecosystem.”

Serious Problems

The article calls out a significant problem that FS companies are facing now and will be for the next decade but doesn’t mention a solution.

The problem is that you can’t just pick up an application and move it without impacting other applications. Based upon my experience working with FS applications I see multiple related problems:

  1. Disruption of other applications
  2. Baselining performance and availability before the move
  3. Ensuring performance and availability after the move

All of these problems increase risk and the chance that users will be impacted.

Solutions

1. Disruption of other applications – The solution to this problem is easy in theory and traditionally difficult in practice. The theory is that you need to understand all of the external interactions with application you want to move.

One solution is to use ADDM (Application Discovery and Dependency Mapping) tools that scan your infrastructure looking for application components and the various communications to and from them. This method works okay (I have used it in the past) but typically requires a lot of manual data manipulation after the fact to improve the accuracy of the discovered information.

ADDM1

ADDM product view of application dependencies.

Another solution is to use an APM (Application Performance Management) tool to gather the information from within the running application. The right APM tool will automatically see all application instances (even in a dynamically scaled environment) as well as all of the communications into and out of the monitored application.

Distributed Application View

APM visualization of an application and it’s components with remote service calls.

Remote Services 1

APM tool list of remote application calls with response times, throughput and errors.

 

A combination of these two types of tools would provide the ultimate in accurate and easy to consume information (APM strength) along with flexibility to cover all of the one off custom application processes that might not be supported by an APM tool (ADDM strength).

2. Baselining performance and availability before the move – It’s critically important to understand the performance characteristics of your application before you move. This will provide the baseline required for comparison sake after you make the move. The last thing you want to do is degrade application performance and user satisfaction by moving an application. The solution here is leveraging the APM tool referenced in solution #1. This is a core strength of APM and should be leveraged from multiple perspectives:

  1. Overall application throughput, response times, and availability
  2. Individual business transaction throughput and response times
  3. External dependency throughput and response times
  4. Application error rate and type
Application overview and baseline

Application overview with baseline information.

transactions and baselines

Business transaction overview and baseline information.

3. Ensuring performance and availability after the move – Now that your application has moved to an outsourcer it’s more important than ever to understand performance and availability. Invariably your application performance will degrade and the finger pointing between you and your outsourcer will begin. That is, unless you are using an APM tool to monitor your application. The whole point of APM tools is to end finger pointing and to reduce mean time to restore service (MTRS) as much as possible. By using APM after the application move you will provide the highest level of service to your customers as possible.

Compare Releases

Comparison of two application releases. Granular comparison to understand before and after states. – Key Performance Indicators

Compare releases 2

Comparison of two application releases. Granular comparison to understand before and after states. – Load, Response Time, Errors

If you’re considering or in the process of transitioning to a digital enterprise you should seriously consider using APM to solve a multitude of problems. You can click here to sign up for a free trial of AppDynamics and get started today.

The Most Important Lesson I Ever Learned About Solving Performance Problems

I’m an operations guy. I’ve been one for over 15 years. From the time when I was a Systems Administrator I was always intrigued by application performance and jumped at every opportunity to try and figure out a performance problem. All of that experience has taught me that there is one aspect of troubleshooting that makes the biggest difference in the most cases.

My Charts Will Save The Day

Before I jump right in with that single most important lesson learned I want to tell the story that set me on my path to learning this lesson. I was sitting at my desk one day when I got called into a P1 issue (also called Sev 1, customers were impacted by application problems) for an application that had components on some of my servers. This application had many distributed components like most of the applications at this particular company did. I knew I was prepared for this moment since I had installed OS monitoring that gave me charts on every metric I was interested in and I had a long history of these charts (daily dating back for months).

Simply put, I was confident I had the data I needed to solve the problem. So I joined the 20+ other people on the conference call, listened to hear what the problem was and what had already been done, and began digging through my mountains of graphs. Within the first 30 minutes of pouring over my never ending streams of data I realized that I had no clue where any of the data points should be for each metric at any given time. I had no reference point to decipher good data points from bad data points. “No problem!” I thought to myself. I have months of this data just waiting for me to look at and determine what’s gone wrong.

Now I don’t know if you’ve ever tried to manually compare graphs to each other but I can tell you that comparing 2 charts that represent 2 metrics on 2 different days is pretty easy. Comparing ~50 daily charts to multiple days or weeks in history is a nightmare that consumes a tremendous amount of time. This was the Hell I had resigned myself to when I made that fateful statement in my head “No problem!”.

bangheadSkip ahead a few hours. I’ve been flipping between multiple workbooks in Excel to try and visually identify where the charts are different. I’ve been doing this for hours. Click-flip, click-flip, click-flip, click-flip… My eyes are strained and my head is throbbing. I want the pain to end but I’m a performance geek that doesn’t give up. I’ve looked at so many charts by now that I can no longer remember why I was zeroing in on a particular metric in the first place. I’m starting to think my initial confidence was a bit misguided. I slowly start banging my head on my desk in frustration.

From Hours To Seconds

Isn’t this one of the most commonly asked questions in any troubleshooting scenario? “What changed?” It’s also one of the toughest questions to answer in a short amount of time. If you want to resolve problems in minutes you need to know the answer to this question immediately. So that leads me to the most important lesson I ever learned about solving performance problems. I need something that will tell me exactly what has changed at any given moment in time.

I need a system that tracks my metrics, automatically baselines their normal behavior, and can tell me when these metrics have deviated from their baselines and by how much. Ideally I want this all in context of the problem that has been identified either by an alert or an end user calling in a trouble ticket (I’d rather know about the problem before a customer calls though).

Thankfully today this type of system does exist. Within AppDynamics Pro, every metric is automatically baselined and a candidate for alerting based upon deviation from that baseline. By default all business transactions are classified as slow or very slow based upon how much they deviate from their historic baselines but this is only the tip of the iceberg. The really cool feature is available after you drill down into a business transaction. Take a look at the screen grab below. This grab was taken from a single “Product Search” business transaction that was slow. Notice we are in the “Node Problems” area. I’ve requested that the software automatically find any JVM metrics that have deviated higher than their baseline during the time of this slow transaction. The charts on the right side of the screen are the resulting data set in descending order of most highly deviated to least highly deviated.

Screen Shot 2013-07-19 at 7.01.48 AM

Whoa… we just answered the “What changed?” question in 30 seconds instead of manually doing hours of analysis. I wish I had this functionality years ago. It would have saved me countless hours and countless forehead bruises. We veterans of the performance wars now have a bigger gun in the battle to restore performance faster. Leave the manual analysis and correlation to the rookies and click here to start your free trial of AppDynamics Pro right now so you can test this out for yourself.