Why is APM Important?

Application Performance Management, or APM, is the monitoring and management of the availability and performance of software applications. Different people can interpret this definition differently so this article attempts to qualify what APM is, what it includes, and why it is important to your business. If you are going to take control of the performance of your applications, then it is important that you understand what you want to measure and how you want to interpret it in the context of your business.

What is Application Performance Management (APM)?

As applications have evolved from stand-alone applications to client-server applications to distributed applications and ultimately to cloud-based elastic applications, application performance management has evolved to follow suit. When we refer to APM we refer to managing the performance of applications such that we can determine when they are behaving normally and when they are behaving abnormally. Furthermore, when someone goes wrong and an application is behaving abnormally, we need to identify the root cause of the problem quickly so that we can remedy it.

We might observe things like:

  • The physical hardware upon which the application is running
  • The virtual machines in which the application is running
  • The JVM that is hosting the application environment
  • The container (application server or web container) in which the application is running
  • The behavior of the application itself
  • Supporting infrastructure, such as network communications, databases, caches, external web services, and legacy systems

Once we have captured performance metrics from all of these sources, we need to interpret and correlate them with respect to the impact on your business transactions. This is where the magic of APM really kicks in. APM vendors employ experts in different technologies so that they can understand, at a deep level, what performance metrics mean in each individual system and then aggregate those metrics into a holistic view of your application.

The next step is to analyze this holistic view your application performance against what constitutes normalcy. For example, if key business transactions typically respond in less than 4 seconds on Friday mornings at 9am but they are responding in 8 seconds on this particular Friday morning at 9am then the question is why? An APM solution needs to identify the paths through your application for those business transactions, including external dependencies and environmental infrastructure, to determine where they are deviating from normal. It then needs to bundle all of that information together into a digestible format and alert you to the problem. You can then view that information, identify the root cause of the performance anomaly, and respond accordingly.

Finally, depending on your application and deployment environment, there may be things that you can tell the APM solution to do to automatically remediate the problem. For example, if your application is running in a cloud-based environment and your application has been architected in an elastic manner, you can configure rules to add additional servers to your infrastructure under certain conditions.

Thus we can refine our definition of APM to include the following activities:

  • The collection of performance metrics across an entire application environment
  • The interpretation of those metrics in the light of your business applications
  • The analysis of those metrics against what constitutes normalcy
  • The capture of relevant contextual information when abnormalities are detected
  • Alerts informing you about abnormal behavior
  • Rules that define how to react and adapt your application environment to remediate performance problems

Why is APM Important?

It probably seems obvious to you that APM is important, but you will likely need to answer the question of APM importance to someone like your boss or the company CFO that wants to know why she must pay for it. In order to qualify the importance of APM, let’s consider the alternatives to adopting an APM solution and assess the impact in terms of resolution effort and elapsed down time.

First let’s consider how we detect problems. An APM solution alerts you to the abnormal application behavior, but if you don’t have an APM solution then you have a few options:

  • Build synthetic transactions
  • Manual instrumentation
  • Wait for your users to call customer support!?

A synthetic transaction is a transaction that you execute against your application and with which you measure performance. Depending on the complexity of your application, it is not difficult to build a small program that calls a service and validates the response. But what do you do with that program? If it runs on your machine then what happens when you’re out of the office? Furthermore, if you do detect a functional or performance issue, what do you do with that information? Do you connect to an email server and send alerts? How do you know if this is a real problem or a normal slowdown for your application at this hour and day of the week? Finally, detecting the problem is one thing, how do you find the root cause of the problem?

The next option is manually instrumenting your application, which means that you add performance monitoring code directly to your application and record it somewhere like a database or a file system. Some challenges in manual instrumentation include: what parts of my code do I instrument, how do I analyze it, how do I determine normalcy, how do I propagate those problems up to someone to analyze, what contextual information is important, and so forth. Plus you have introduced a new problem: you have introduced performance monitoring code into your application that you need to maintain. Furthermore, can you dynamically turn it on and off so that your performance monitoring code does not negatively affect the performance of your application? If you learn more about your application and identify additional metrics you want to capture, do you need to rebuild your application and redeploy it to production? What if your performance monitoring code has bugs?

There are other technical options, but what I find most often is that companies are alerted to performance problems when their custom service organization receives complaints from users. I don’t think I need to go into details about why this is a bad idea!

Next let’s consider how we identify the root cause of a performance problem without an APM solution. Most often I have seen companies do one of two things:

  • Review runtime logs
  • Attempt to reproduce the problem in a development / test environment

Log files are great sources of information and many times they can identify functional defects in your application (by capturing exception stack traces), but when experiencing performance issues that do not raise exceptions, they typically only introduce additional confusion. You may have heard of, or been directly involved in, a production war room. These war rooms are characterized by finger pointing and attempts to indemnify one’s own components so that the pressure to resolve the issue falls on someone else. The bottom line is that these meetings are not fun and not productive.

Alternatively, and usually in parallel, the development team is tasked with reproducing the problem in a test environment. The challenge here is that you usually do not have enough context for these attempts to be fruitful. Furthermore, if you are able to reproduce the problem in a test environment, that is only the first step, now you need to identify the root cause of the problem and resolve it!

So to summarize, APM is important to you so that you can understand the behavior of your application, detect problems before your users are impacted, and rapidly resolve those issues. In business terms, an APM solution is important because it reduces your Mean Time To Resolution (MTTR), which means that performance issues are resolved quicker and more efficiently so that the impact to your business bottom line is reduced.

5 .NET Performance Issues Where Dev and Ops Play the Blame Game

I spend a large portion of my time working with customers to review performance issues in production environments. I was asked by some colleagues to share some insight into some of the more interesting findings.

Before launching into a list of offenders, I think it’s worth explaining how often I hear the phrase “We need to prove it’s not our tin but their code” and how this relates to the discoveries here.  Typically, when a developer sits down to write some code, they do it in isolation, they’ll apply theory and best practise and throw in a bit of experience to come up with the best (hopefully) code they can produce for the task.

What happens, too often, is when the code reaches a production environment it’s executed in a way not predicted by the developer, architect or designer. Quite often the dynamics of a live environment varies drastically from the test and pre-production environments in terms of data and patterns of use. Just recently a client told me how their eCommerce site goes from a normal state of “browsing products” to a Black Friday or major event that’s “buy-only”, their application behaved very different under each circumstance.

So all that aside, let’s take a look of some of my favourite top 5 from this year…

1. Object Relational Mapping (ORM)

ORM is a broad term, but I apply it to anything that abstracts a developer from writing raw SQL – this includes LINQ, Entity Framework and NHibernate.

In a nutshell, something like Entity Framework allows a developer to create a neat design of their object and use syntax within .NET to query and manipulate the objects, which then results in queries and inserts into an underlying datastore. Figure 1 shows the type of view and code a developer would use.

Figure 1 – Entity Framework and the underlying code

Very few developers get a chance to review what goes on under the hood. I can normally tell when an Entity Framework is being used because it generates a lot of square brackets!

In fact, the best Entity Framework generated query I’ve encountered when pasted into Word covered 21 pages! To quote my customer at the time, “No sane human being would write that”. Most of the SQL was to cater for NULL cases.

In addition to creating lengthy SQL calls, ORMs can also create a lot of SQL calls — especially in production environments where data volumes can vary drastically from those in development environments and developers didn’t foresee the potential behaviour of their code. A client sent me this screenshot of their application, which spent close to 25 seconds executing about 2,000 queries into their database, this was all Entity Framework developed code.

BUT – before going on I need to point out – I like Entity Framework. It speeds up development time and improves collaboration between teams – just be careful with it and ensure you monitor what it does in the backend.

2. Data Volumes in Queries

Another client recently sent me the following screenshot to enquire what a SNIReadSyncOverAsync call was? I’ve seen this a few times; essentially this is the code that runs in .NET after running a query to get the data you’ve just queried.

I put together the following as a neat example of how this works, (No EF here I’m afraid).

The query we’re executing is to query all 1.6 million postcodes in the UK! The resulting underlying code shows our query is fast. The DBA’s done a great job, but retrieving and reading this data is is the time-consuming part and that’s what SNIReadSyncOverAsync is. You’ll see this a lot in scenarios where you’re retrieving a lot of data or you’re reading a cursor over a slow network connection.

3. Web Service and WCF calls

In a similar manner to our ORMs making multiple database calls, I’ve seen production code perform in a similar manner with Web Service and WCF calls. In one case, a front-end UI component was responsible for crunching 8 seconds worth of time performing close to 300 backend web service calls simply to render a grid control (just think of all that redundant SOAP information passed to and fro):

The same happens in WCF when you don’t consider how many calls your code may need to make in a production environment:

Another trap to be aware of in WCF is the Throttling settings. Early versions of WCF had very low Throttling settings which meant that in high volume sites you suddenly hit the buffers. Take a quick review of what your WCF applications are using, such as:

<serviceThrottling maxConcurrentCalls=”500″

maxConcurrentInstances =”100″

maxConcurrentSessions =”200″/>

Older implementations of the framework also hit issues with the maximum number of threads available, which will result in seeing a lot of “WaitOneNative” as your code in stuck in a “Holding Pattern”:

4. File and Disk IO

If you’re doing a lot of file manipulation, consider using some of the Win32 API calls instead of managed code. I was tasked a few years back writing a quick utility to recursively work through files scattered amongst ~100,000 directories and update some text in them. When we ran the tests with my initial code, it took close to a day to execute! By swapping out to Win32 APIs, I shortened the time to around 30 mins!

5. Task.WaitAll and The Task Parallel Library

I’ll cover this in another post, but I’m starting to see a huge adoption in the use of the Task Parallel Library (TPL) in applications such as Insurance Aggregators, sites that compile as much information from multiple insurers for quoting insurance premiums. The same applies to a number of travel sites retrieving flight or hotel details from multiple providers.

What happens from a DevOps perspective is that the production code you’re managing gets much harder to understand and maintain. Even visualising the complexity of what’s going on under the hood becomes far more challenging than non-multithreading applications. In short, you want to reduce the amount of time your application spends waiting on tasks to complete, therefore keep an eye on how much time your application spends waiting on tasks.

The Enter and ReliableEnter methods of the Monitor object are similar examples of code hanging around before it can continue. In these cases, code is “locking and blocking”, i.e. a Thread is executing on an object and locking out other threads. (refer to https://msdn.microsoft.com/en-us/library/hf5de04k(v=vs.110).aspx for more information on this one).

Interested in gaining end-to-end visibility over your complex application? Check out a FREE trial of AppDynamics today!

Monitoring ADO.NET Performance

One of the most overlooked parts of .NET is the glue that links your application through to a database known as ADO.NET (an evolution of ActiveX Data Objects). The lowly connection string is the only real access we get to the underlying technology.

I was with a client recently and noticed they had the following connection string:

I was interested in why they had a pool size setting of 1500. The answer lay in the fact that 2 years ago they had a pooling issue (much like the one we’ll explore shortly!) and had set the Max Pool Size to 15 times more than the default!  We then set about seeing if that was needed with their current application.

Using AppDynamics we monitored the calls to their database from the Application’s perspective, you can automatically detect the calls:

Down to the methods where the connection and queries are run:

Extracting the Connection String (without the password) and seeing how long it takes to get a connection from the pool:

These snapshots provide deep visibility into an application’s health and performance, but in the case of performance issues – what’s the easiest way to tune production applications? What happens we you start to see the following?:

ADO.NET pools connections for reuse, there are a number of key rules for ADO.NET and using these connections:

  •       “Open Late and Close Early” – Don’t hold on to connections in code, only use them when you need to.
  •       Beware of “Pool Fragmentation” – Connections are pooled according to the connection string plus the user identity. Therefore, if you use Integrated Security, you may get a Connection pool for every user.

To aid tuning the following parameters can be used:

Using Windows performance counters you can monitor multiple ADO.NET metrics, for any .NET deployments I’d recommend monitoring:

  •       The Number of Pooled Connections
  •       Number of Active Connections

Within AppDynamics, you can monitor and baseline in your production application the health of your ADO.NET connection pool strategy.

The following illustrates a custom dashboard created to simply monitor the database throughput rate and response time against the size of a connection pool. In this scenario by slowly ramping up the number of concurrent users on the application, you can visualise whether your application will scale.

As new users are added to the application, the Connection Pool size increases automatically. This metric can be monitored in conjunction with database response time, error rates, and overall load.

SQL Server, for example, has a default maximum connection pool of 100 and default connection timeout of 30 seconds. I often see organisations dropping the Connection Timeout to 10-15 seconds if you are going to fail to get a database connection it’s better to fail sooner than later, leaving your user hanging for 30 seconds!

Utilize AppDynamics database view to monitor average and maximum pool times:

By registering a performance counter, AppDynamics will automatically baseline and then enable you to configure effective alerts for your application.

The following performance counters are off by default (they can be enabled in the Web.Config file see ActivatingOffByDefault):

  •       NumberOfFreeConnections
  •       NumberOfActiveConnections
  •       SoftDisconnectsPerSecond
  •       SoftConnectsPerSecond

By enabling NumberOfActiveConnections, you can visualize how the connection pool size increases.

To add any performance counter into AppDynamics use the following tool:

http://community.appdynamics.com/t5/eXchange-Community-AppDynamics/Windows-Performance-Counter-Configuration-Extension/idi-p/9713

Start monitoring your .NET application, check out a FREE trial today!

Delivering Value with BizDevOps

Companies are rapidly adopting DevOps practices to create higher-quality software faster and more efficiently to improve customer experience, lower IT costs and enhance productivity.  But the key to success for software-driven businesses is delivering value with BizDevOps.

Using both lean and agile methodologies, DevOps brings software development, IT operations and quality assurance (QA) teams together to create a more collaborative process to deliver software and services in a faster and more continuous fashion.  DevOps aims to break down IT and organizational data silos by promoting a culture of shared resources that make change and state processes across the entire application delivery chain more transparent.  The end result is continuous delivery, an operational concept that is crucial to the software-defined business.

The more performance-oriented a company is the more successful the business outcomes of its DevOps initiative.  But DevOps is more about people than technology.  These organizations foster a culture based on communication, collaboration and trust.  They tend not to fear failure; instead they embrace it as learning experiences for continuous improvement.  Companies that are successful with DevOps are more stable, with higher-performing IT teams that work better together to drive stronger operating results.

More enterprises are demanding that their application development teams be more aligned with business objectives, allowing them to start realizing value from an early stage of the lifecycle.  To achieve business agility, enterprises need a strategy that spans the entire value chain – from business requirements to deployment.

BizDevOps and the Common Language of Business Transactions

BizDevOps underscores the need to more closely align IT and business groups with business performance.  It strengthens application governance by adding business perspective and accountability to the process.  BizDevOps bridges operational data with business data to provide a deeper understanding of how application performance and user experience directly impact business outcomes.

Business transactions are the common language that brings DevOps and business teams into productive collaboration.  A business transaction is the interaction between a business and its customers, vendors, partners or employees that provides a desired outcome of mutual benefit.

In a software-defined business, transactions are executed by applications.  Depending on the application, transactions can be as simple as users entering information into a database or highly complex, such as online trading that involves multiple systems and applications that are mutually dependent.  As a result, the successful completion of business transactions is the most critical metric for IT and business success.

BizDevOps not only brings effective collaboration between business and DevOps teams, but also provides automation and a tight feedback loop.  Sharing transactional data with business teams accelerates the feedback loop from all application stakeholders.

Application governance is essential for all constituents to make changes about what capabilities the application should have.  The DevOps team can then make adjustments faster and more continuously while maintaining focus on optimizing the user experience.  Better communication and collaboration between teams increases business agility, improves efficiency and raises predictability.

A Unified Approach to Transaction Monitoring Assures User Experience

As enterprise applications become more numerous, intertwined and com­plex, IT organizations are placing more emphasis than ever on finding new approaches to manage applications and optimize their availability and performance.  As a result, application performance management (APM) has become an essential part of enterprise IT framework since it directly involves all key stakeholder groups, including application owners, application developers and application users.

Left shifting business transaction monitoring in the development lifecycle alerts DevOps teams to dependencies and operational or quality issues in the pre-production phase to identify potential bottlenecks.  Transaction metrics can be used to establish key performance indicators (KPIs) against which the production environment can be measured.  This allows the DevOps team, working in concert with the business owners, to build models in accordance with KPIs.

Every business transaction impacts user experience and operating performance.  By understand why and where a transaction delay or failure resulted in a missed opportunity or operating loss can help prevent future service outages.  Depending on the industry sector, slow responsiveness or complete outage (brownouts or downtime) of a company’s most business critical application can cost between $100,000 and $1 million per hour.  The fallout from poor transaction performance can be a loss of customers, regulatory fines and damage to firm reputation.

Whether it’s improving customer satisfaction or operational performance to minimize costly downtime and drive business objectives, business transaction completion is the most relevant metric that all team members can understand.  Dynamically tagging and following a business transaction across the entire application delivery chain addresses the needs of BizDevOps. 

Unified transaction monitoring with big data scalability and management is the only way IT and business owners can ensure that end-to-end user experience and business objectives are being met.  It drives customer satisfaction and improves competitiveness, strengthening financial performance and market valuation. 

To learn more about the BizDevOps topic, read my free white paper or register for my upcoming webinar.

Prospects uses AppDynamics APM to see into their complex environment

Recently, I was able to meet and chat briefly with Kalpesh Vadera, Computer Services Manager for Prospects — an online seller of higher education study guides. Since Prospects is ultimately an e-commerce site, their performance is crucial to their bottom line. Any outages or stalls will hurt incoming revenue. Therefore, they put a tremendous amount of value on uptime and optimal performance. As their business grew, their application environment became more distributed and complex, furthering the need for an enterprise-grade application performance management (APM) solution. Enter AppDynamics.

Hannah Current: Please briefly explain your role at Prospects

Kalpesh Vadera: Graduate Prospects is the UK’s leading provider of information, advice and opportunities to all students and graduates. Our graduate employment and postgraduate study guides are available online, digitally and in printed form at all UK and Ireland career services, careers fairs and other campus outlets across the UK.

I am the Computer Services Manager and my/teams role is to look after the infrastructure to enable the efficient delivery of all our services for both internal and external clients.

HC: What challenges did you suffer before using APM? How did you troubleshoot those issues and what was frustrating before using an APM tool?

KV: We had an issue several years ago with a part of the website running particularly slow and found it extremely difficult to get to the root of the problem. We are Java based and with so many services running and dependent on each other it was like trying to find a needle in a haystack. We found the issue eventually (an out of date class file) but it was painful getting there.

HC: What was your APM selection process like?

KV: A team was created to include the IT Director, Senior Architects, Development Manager and myself to list down requirements. A checklist was made of essential requirements, which we then reviewed several options and shortlisted a few to take a more in-depth look at. After several conference calls to prioritize our needs, and a look at AppDynamics in action via a temporary license (so that we could actually see it with real production info), it ticked all the boxes for us.

HC: Why AppDynamics over the other solutions out there?

KV: AppDynamics ticked all the boxes and was the only solution that allowed us to see our entire environment, giving us end-to-end visibility. Most importantly, it works in production. This was a major advantage as it gave us more confidence that it would actually provide us with some useful tools to help not just diagnosis but with proactive alerting prior to similar issues happening.

HC: How has AppDynamics helped solve some critical problems?

KV: We set-up email alerting which sends us immediate alerts when the services monitored exceed health thresholds, plus daily reports showing that day’s performance. The convenient alerts enable us to login to our controller with links straight from the email. We can then drill down to see exactly where the bottleneck/issue is likely to be. With the amount of traffic we get this is incredibly time saving. Having a snapshot of the environment at the exact time of the issue is vital to resolve the issue quickly.

Want to see how AppDynamics APM can help you stay ahead of your performance problems? Check out a FREE trail today!

Top 5 Java Performance Metrics to Capture in Enterprise Applications

The last couple articles presented an introduction to Application Performance Management (APM) and identified the challenges in effectively implementing an APM strategy. This article builds on these topics by reviewing five of the top performance metrics to capture to assess the health of your enterprise Java application.

Specifically this article reviews the following:

  • Business Transactions
  • External Dependencies
  • Caching Strategy
  • Garbage Collection
  • Application Topology

1. Business Transactions

Business Transactions provide insight into real-user behavior: they capture real-time performance that real users are experiencing as they interact with your application. As mentioned in the previous article, measuring the performance of a business transaction involves capturing the response time of a business transaction holistically as well as measuring the response times of its constituent tiers. These response times can then be compared with the baseline that best meets your business needs to determine normalcy.

If you were to measure only a single aspect of your application I would encourage you to measure the behavior of your business transactions. While container metrics can provide a wealth of information and can help you determine when to auto-scale your environment, your business transactions determine the performance of your application. Instead of asking for the thread pool usage in your application server you should be asking whether or not your users are able to complete their business transactions and if those business transactions are behaving normally.

As a little background, business transactions are identified by their entry-point, which is the interaction with your application that starts the business transaction. A business transaction entry-point can be defined by interactions like a web request, a web service call, or a message on a message queue. Alternatively, you may choose to define multiple entry-points for the same web request based on a URL parameter or for a service call based on the contents of its body. The point is that the business transaction needs to be related to a function that means something to your business.

Once a business transaction is identified then its performance is measured across your entire application ecosystem. The performance of each individual business transaction is evaluated against its baseline to assess normalcy. For example, we might determine that if the response time of the business transaction is slower than two standard deviations from the average response time for this baseline that it is behaving abnormally, as shown in figure 1.

 

Figure 1 Evaluating BT Response Time Against its Baseline

 

The baseline used to evaluate the business transaction is evaluated is consistent for the hour in which the business transaction is running, but the business transaction is being refined by each business transaction execution. For example, if you have chosen a baseline that compares business transactions against the average response time for the hour of day and the day of the week, after the current hour is over, all business transactions executed in that hour will be incorporated into the baseline for next week. Through this mechanism an application can evolve over time without requiring the original baseline to be thrown away and rebuilt; you can consider it as a window moving over time.

In summary, business transactions are the most reflective measurement of the user experience so they are the most important metric to capture.

2. External Dependencies

External dependencies can come in various forms: dependent web services, legacy systems, or databases; external dependencies are systems with which your application interacts. We do not necessarily have control over the code running inside external dependencies, but we often have control over the configuration of those external dependencies, so it is important to know when they are running well and when they are not. Furthermore, we need to be able to differentiate between problems in our application and problems in dependencies.

From a business transaction perspective, we can identify and measure external dependencies as being in their own tiers. Sometimes we need to configure the monitoring solution to identify methods that really wrap external service calls, but for common protocols, such as HTTP and JDBC, external dependencies can be automatically detected. For example, when I worked at an insurance company, we had an AS/400 and we used a proprietary protocol to communicate with it. We identified that method call as an external dependency and attributed its execution to the AS/400. But we also had web service calls that could be automatically identified for us. And similar to business transactions and their constituent application tiers, external dependency behavior should be baselined and response times evaluated against those baselines.

Business transactions provide you with the best holistic view of the performance of your application and can help you triage performance issues, but external dependencies can significantly affect your applications in unexpected ways unless you are watching them.

3. Caching Strategy

It is always faster to serve an object from memory than it is to make a network call to retrieve the object from a system like a database; caches provide a mechanism for storing object instances locally to avoid this network round trip. But caches can present their own performance challenges if they are not properly configured. Common caching problems include:

  • Loading too much data into the cache
  • Not properly sizing the cache

I work with a group of people that do not appreciate Object-Relational Mapping (ORM) tools in general and Level-2 caches in particular. The consensus is that ORM tools are too liberal in determining what data to load into memory and in order to retrieve a single object, the tool needs to load a huge graph of related data into memory. Their concern with these tools is mostly unfounded when the tools are configured properly, but the problem they have identified is real. In short, they dislike loading large amounts of interrelated data into memory when the application only needs a small subset of that data.

When measuring the performance of a cache, you need to identify the number of objects loaded into the cache and then track the percentage of those objects that are being used. The key metrics to look at are the cache hit ratio and the number of objects that are being ejected from the cache. The cache hit count, or hit ratio, reports the number of object requests that are served from cache rather than requiring a network trip to retrieve the object. If the cache is huge, the hit ratio is tiny (under 10% or 20%), and you are not seeing many objects ejected from the cache then this is an indicator that you are loading too much data into the cache. In other words, your cache is large enough that it is not thrashing (see below) and contains a lot of data that is not being used.

The other aspect to consider when measuring cache performance is the cache size. Is the cache too large, as in the previous example? Is the cache too small? Or is the cache sized appropriately?

A common problem when sizing a cache is not properly anticipating user behavior and how the cache will be used. Let’s consider a cache configured to host 100 objects, but that the application needs 300 objects at any given time. The first 100 calls will load the initial set of objects into the cache, but subsequent calls will fail to find the objects they are looking for. As a result, the cache will need to select an object to remove from the cache to make room for the newly requested object, such as by using a least-recently-used (LRU) algorithm. The request will need to execute a query across the network to retrieve the object and then store it in the cache. The result is that we’re spending more time managing the cache rather than serving objects: in this scenario the cache is actually getting in the way rather than improving performance. To further exacerbate problems, because of the nature of Java and how it manages garbage collection, this constant adding and removing of objects from cache will actually increase the frequency of garbage collection (see below).

When you size a cache too small and the aforementioned behavior occurs, we say that the cache is thrashing and in this scenario it is almost better to have no cache than a thrashing cache. Figure 2 attempts to show this graphically.

Figure 2 Cache Thrashing

 

In this situation, the application requests an object from the cache, but the object is not found. It then queries the external resource across the network for the object and adds it to the cache. Finally, the cache is full so it needs to choose an object to eject from the cache to make room for the new object and then add the new object to the cache.

4. Garbage Collection

One of the core features that Java provided, dating back to its initial release, was garbage collection, which has been both both a blessing and a curse. Garbage collection relieves us from the responsibility of manually managing memory: when we finish using an object, we simply delete the reference to that object and garbage collection will automatically free it for us. If you come from a language that requires manually memory management, like C or C++, you’ll appreciate that this alleviates the headache of allocating and freeing memory. Furthermore, because the garbage collector automatically frees memory when there are no references to that memory, it eliminates traditional memory leaks that occur when memory is allocated and the reference to that memory is deleted before the memory is freed. Sounds like a panacea, doesn’t it?

While garbage collection accomplished its goal of removing manual memory management and freeing us from traditional memory leaks, it did so at the cost of sometimes-cumbersome garbage collection processes. There are several garbage collection strategies, based on the JVM you are using, and it is beyond the scope of this article to dive into each one, but it suffices to say that you need to understand how your garbage collector works and the best way to configure it.

The biggest enemy of garbage collection is known as the major, or full, garbage collection. With the exception of the Azul JVM, all JVMs suffer from major garbage collections. Garbage collections come in a two general forms:

  • Minor
  • Major

Minor garbage collections occur relatively frequently with the goal of freeing short-lived objects. They do not freeze JVM threads as they run and they are not typically significantly impactful.

Major garbage collections, on the other hand, are sometimes referred to as “Stop The World” (STW) garbage collections because they freeze every thread in the JVM while they run. In order to illustrate how this happens, I’ve included a few figures from my book, Pro Java EE 5 Performance Management and Optimization.

Figure 3 Reachability Test

When garbage collection runs, it performs an activity called the reachability test, shown in figure 3. It constructs a “root set” of objects that include all objects directly visible by every running thread. It then walks across each object referenced by objects in the root set, and objects referenced by those objects, and so on, until all objects have been referenced. While it is doing this it “marks” memory locations that are being used by live objects and then it “sweeps” away all memory that is not being used. Stated more appropriately, it frees all memory to which there is not an object reference path from the root set. Finally, it compacts, or defragments, the memory so that new objects can be allocated.

Minor and major collections vary depending on your JVM, but figures 4 and 5 show how minor and major collections operate on a Sun JVM.

Figure 4 Minor Collection

In a minor collection, memory is allocated in the Eden space until the Eden space is full. It performs a “copy” collector that copies live objects (reachability test) from Eden to one of the two survivor spaces (to space and from space). Objects left in Eden can then be swept away. If the survivor space fills up and we still have live objects then those live objects will be moved to the tenured space, where only a major collection can free them.

Figure 5 Major Collection

Eventually the tenured space will fill up and a minor collection will run, but it will not have any space in the tenured space to copy live objects that do not fit in the survivor space. When this occurs, the JVM freezes all threads in the JVM, performs the reachability test, clears out the young generation (Eden and the two survivor spaces), and compacts the tenured space. We call this a major collection.

As you might expect, the larger your heap, the less frequently major collections run, but when the do run they take much longer than smaller heaps. Therefore it is important to tune your heap size and garbage collection strategy to meet your application behavior.

5. Application Topology

The final performance component to measure in this top-5 list is your application topology. Because of the advent of the cloud, applications can now be elastic in nature: your application environment can grow and shrink to meet your user demand. Therefore, it is important to take an inventory of your application topology to determine whether or not your environment is sized optimally. If you have too many virtual server instances then your cloud-hosting cost is going to go up, but if you do not have enough then your business transactions are going to suffer.

It is important to measure two metrics during this assessment:

  • Business Transaction Load
  • Container Performance

Business transactions should be baselined and you should know at any given time the number of servers needed to satisfy your baseline.  If your business transaction load increases unexpectedly, such as to more than two times the standard deviation of normal load then you may want to add additional servers to satisfy those users.

The other metric to measure is the performance of your containers. Specifically you want to determine if any tiers of servers are under duress and, if they are, you may want to add additional servers to that tier. It is important to look at the servers across a tier because an individual server may be under duress due to factors like garbage collection, but if a large percentage of servers in a tier are under duress then it may indicate that the tier cannot support the load it is receiving.

Because your application components can scale individually, it is important to analyze the performance of each application component and adjust your topology accordingly.

Conclusion

This article presented a top-5 list of metrics that you might want to measure when assessing the health of your application. In summary, those top-5 items were:

  • Business Transactions
  • External Dependencies
  • Caching Strategy
  • Garbage Collection
  • Application Topology

In the next article we’re going to pull all of the topics in this series together to present the approach that AppDynamics took to implementing its APM strategy. This is not a marketing article, but rather an explanation of why certain decisions and optimizations were made and how they can provide you with a powerful view of the health of a virtual or cloud-based application.

Autotrader uses AppDynamics to Solve Performance Issues in Production

There is a common theme among the stories told by our new enterprise customers freshly adopting our AppDynamics platform. As their application grows in scale to meet their customer demands, their time to diagnose performance issues were painfully increasing, too. Operations and development teams slowly seemed to come to accept an untrue reality that solving performance issues are an inevitable and painful part of large-scale software infrastructure.

Luckily, this belief could not be further from the truth, and we prove that time and time again with each new customer we have onboard! I love hearing the excitement from our customers as they share with me their stories about how their jobs have become exponentially easier since adopting AppDynamics. Most recently, I had the pleasure to sit down and chat about the experiences of one of our newest customers: Autotrader.

I spoke with Morgan White, an Application Engineer at Autotrader, to pick his brain on what caused them to switch to AppDynamics. For starters, I was not surprised to learn about the scale of their environment growing in complexity and size in an attempt to meet the demands of their traffic. Specifically, they applied decoupling techniques to build up a true service-oriented architecture as they isolated responsibilities into various internal services.

Unfortunately, as Autotrader’s architecture became more distributed, the amount of monitoring tools increased. Morgan filled me in on the frustration of having an influx of tools being used by various teams, including development and operations. This is a common issue among companies such as Autotrader—multiple tools, siloed logs and reports, and disconnect among different teams within an organization. Luckily, the AppDynamics platform was pivotal in helping Autotrader solve all their performance pain points.

Qualifying AppDynamics versus the competition was a clear choice for Morgan and his colleagues. AppDynamics provided Autotrader an APM tool that consolidates all the various aspects of application performance monitoring into a single unified and easier tool than the competition. Morgan shared with me a particular anecdote involving him and his team experiencing excessive 400 errors and failing to pinpoint their root cause. By using AppDynamics, Morgan was able to locate the specific business transaction throwing the excess errors, diagnosed the problem, and immediately pushed a fix into production—all within the same day. I never grow tired of hearing use cases such as this from our customer testimonials.

Make no mistake, though: Using AppDynamics to solve performance issues is just half the battle. Brian Boatright, a Senior Integration Engineer at Autotrader, glowingly spoke about the advantage of avoiding potential issues by using AppDynamics’ alerting and release comparison features. By creating specific health rules, Brian and his team are immediately alerted when a problem arose and manage to fix it quickly before becoming exposed to a large percentage of customers. Further, Brian loves being able to compare the KPIs among various deployments to ensure the customer base has an optimal experience as the development and operations team continuously deploy new software upgrades.

I am proud to say that AppDynamics has helped multiple teams at Autotrader continue to focus on their core business and do what they do best: helping millions of customers find the vehicles of their dreams. I invite you also to come and explore what AppDynamics can do you for so that you too can stop worrying about software performance issues and regain your focus on what you also do best!

How ING Gains Visibility into their Complex Distributed Environment

One of the greatest benefits of my job is the first-hand insight I gain about our customers’ experiences with AppDynamics. I have the privilege of understanding the impact their technical challenges have on their business execution and how AppDynamics can help them reach their optimal application performance.

Most recently, Davide Franzino, ITS Service Technical Manager at ING Direct, shared with us their challenges and how AppDynamics has helped them meet their goals as a team. ING is a global financial institution offering banking, investments, life insurance, and retirement services to a broad customer base. Davide is responsible for managing Quality of Service and supporting Incident and Problem Management. While he has no team to manage, he works closely with the Operations and Application Support teams.

Initially, ING implemented Dell’s Foglight in 2011 as their system monitoring and APM solution. After the initial rollout of the Foglight product, they immediately began to experience challenges with the solution not meeting their complex needs. While Foglight met their basic requirements, it failed to provide value beyond simple use-cases. Specifically, ING required capabilities only offered by a complete APM solution.

Soon after that, ING adopted AppDynamics and began to experience positive results immediately. By 2014, they had AppDynamics deployed across their entire production environments, and soon switched off Foglight entirely. The Business Transactions paradigm provided by AppDynamics gave the ING team an entirely new perspective on how to view requests across their distributed system. Like most larger enterprise applications, ING’s architecture has evolved into a relatively large and complex environment to meet the demands of their customers.

Not only did AppDynamics help provide the insight necessary to diagnose performance issues quickly, but it also helped the ING team correlate software performance metrics to their business impact. Multiple teams can now collaborate internally using the AppDynamics platform to extract value and measure business impact. The flowmap image below is a glimpse at the scale of their SOA and network of distributed tiers.

Dynamics

In addition to providing value for the development and operations teams, AppDynamics allows the ING marketing team to extract valuable metrics pertaining to their visitors. This new custom dashboard allows the marketing department to understand how potential new users are converted into ING customers. Specifically, this custom dashboard highlights the End User Response time for specific key business transactions and provides insight on any deviation on critical transactions from their normal healthy baselines.

Dinamics - custom dashboard
Because of AppDynamics, there now exists collaboration among the development, operations, business, and marketing teams at ING to take part in a new era of sophisticated application intelligence with AppDynamics.

The End of my Affair with Apdex

A decade ago, when I first learned of Apdex, it was thanks to a wonderful technology partner, Coradiant. At the time, I was running IT operations and web operations, and brought Coradiant into the fold. Coradiant was ahead of its time, providing end-user experience monitoring capabilities via packet analysis. The network-based approach was effective in a day when the web was less rich. Coradiant was one of the first companies to embed Apdex in its products.

As a user of APM tools, I was looking for the ultimate KPI, and the concept of Apdex resonated with me and my senior management. A single magical number gave us an idea of how well development, QA, and operations were doing in terms of user experience and performance. Had I found the metric to rule all metrics? I thought I had, and I was a fan of Apdex for many years leading up to 2012, when I started to dig into the true calculations behind this magical number.

As my colleague Jim Hirschauer pointed out in a 2013 blog post, the Apdex index is calculated by putting the number of satisfied versus tolerating requests into a formula. The definition of a user being “satisfied” or “tolerating” has to do with a lot more than just performance, but the applied use cases for Apdex are unfortunately focused on performance only. Performance is still a critical criterion, but the definition of satisfied or tolerating is situational.

I’m currently writing this from 28,000 feet above northern Florida, over barely usable in-flight internet, which makes me wish I had a 56k modem. I am tolerating the latency and bandwidth, but not the $32 I paid for this horrible experience , but hey, at least Twitter and email work. I self-classify as an “un-tolerating” user, but I am happy with some connectivity. People who know me will tell you I have a bandwidth and network problem. Hence, my level of a tolerable network connection is abnormal. My Apdex score would be far different than the average user due to my personal perspective, as would the business user versus the consumer, based on their specific situation as they use an application. Other criteria that affect satisfaction include the type of device in use and connection type of that device.

The thing that is missing from Apdex is the notion of a service level. There are two ways to manage service level agreements. First, a service level may be calculated, as we do at AppDynamics with our baselines. Secondarily, it may be a static threshold, which the customer expects; we support this use case in our analytics product. These two ways of calculating an SLA cover the right ways to measure and score performance.

This is AppDynamics’ Transaction Analytics Breakdown for users who had errors or poor user experience over the last week, and their SLA class:

 

 

Simplistic SLAs are in the core APM product. Here is a view showing requests that were below the calculated baseline, showing which were in SLA violation.

The notion of combining an SLA with Apdex will result in a meaningful number being generated. Unfortunately, I cannot take credit for this idea. Alain Cohen, one of the brightest minds in performance analysis, was the co-founder and CTO (almost co-CEO) of OPNET. Alain discussed his ideas with me around this new performance index concept called OpDex, which fixes many of the ApDex flaws by applying an SLA. Unfortunately, Alain is no longer solving performance problems for customers; he’s decided to take his skills and talents elsewhere after a nice payout.

Alain shared his OpDex plan with me in 2011; thankfully all of the details are outlined in this patent, which was granted in 2013. But OPNET’s great run of innovation has ended, and Riverbed has failed to pick up where they left off, but at least they have patents to show for these good ideas and concepts.

The other issue with Apdex is that users are being ignored by the formula. CoScale outlined this issues in a detailed blog post.They explain that histograms are far better ways to analyze a variant population. This is no different than looking at performance metrics coming from the infrastructure layer, but the use of histograms and heat charts tend to provide much better visual analysis.

AppDynamics employs automated baselines for every metric collected, and measures based on deviations out of the box. We also support static SLA thresholds as needed. Visually, AppDynamics has a lot of options including viewing data in histograms, looking at percentiles, and providing an advanced analytics platform for whatever use cases our users come up with. We believe these are valid approaches to the downsides of using Apdex extensively in a product, which has it’s set of downsides.

 

 

The Evolution of APM

In my previous post, I discussed the importance of APM. The APM market has evolved substantially over the years, mostly in an attempt to adapt to changing application technologies and deployments. When we had very simple applications that directly accessed a database then APM was not much more than a performance analyzer for a database. But as applications moved to the web and we saw the first wave of application servers then APM solutions really came into their own. At the time we were very concerned with the performance and behavior of individual moving parts, such as:

  • Physical servers and the operating system hosting our applications
  • JVM
  • Application server behavior
  • Application response time

We captured metrics from all of these sources and stitched them together into a holistic story. We were deeply interested in garbage collection behavior, thread and connection pools, operating system reads and writes, and so forth. Not to mention, we raised fatal alerts whenever a server went down. Advanced implementations even introduced the ability to trace a request from the web server that received it across tiers to any backend system, such as a database. These were powerful solutions, but then something happened to rock our world: the cloud.

The cloud changed our view of the world because no longer did we take a system-level view of the behavior of our applications, but rather we took an application-centric view of the behavior of our applications. The infrastructure upon which an application runs is still important, but what is more important is whether or not an application is able to execute its business transactions in a normal fashion. If a server goes down, we do not need to worry as long as the application business transactions are still satisfied. As a matter of fact, cloud-based applications are elastic, which means that we should expect the deployment environment to expand and contract on a regular basis. For example, if you know that your business experiences significant load on Fridays from 5pm-10pm then you might want to start up additional virtual servers to support that additional load at 4pm and shut them down at 11pm. The former APM monitoring model of raising alerts when servers go down would drive you nuts.

Furthermore, by expanding and contracting your environment, you may find that single server instances only live for a matter of a few hours. I have heard of one large cloud-based application that uses a very large amount of RAM in its JVMs, but its recycling strategy ensures that those servers are shut down before garbage collection ever has a chance to run. This might be an extreme example, but it illustrates that what was once one of the most impactful performance issues has been rendered a non-issue by a creative deployment model.

You may still find some APM solutions from the old world, but the modern APM vendors have seen these changes in the industry and have designed APM solutions to focus on your application behavior and have placed a far greater importance on the performance and availability of business transactions than on the underlying systems that support them.

Buy versus Build

This article has covered a lot of ground and now you’re faced with a choice: do you evaluate APM solutions and choose the one that best fits your needs or do you try to roll your own. I really think this comes down to the same questions that you need to ask yourself in any buy versus build decision: what is your core business and is it financially worth building your own solution? A previous post goes into the specific costs associated with buying an APM solution versus building your own in house, read it here.

If your core business is selling widgets then it probably does not make a lot of sense to build your own performance management system. If, on the other hand, your core business is building technology infrastructure and middleware for your clients then it might make sense (but see the answer to question two below). You also have to ask yourself where your expertise lies. If you are a rock star at building an eCommerce site but have not invested the years that APM vendors have in analyzing the underlying technologies to understand how to interpret performance metrics then you run the risk of leaving your domain of expertise and missing something vital.

The next question is: is it financially worth building your own solution? This depends on how complex your applications are and how downtime or performance problems affect your business. If your applications leverage a lot of different technologies (e.g. Java, .NET, PHP, web services, databases, NoSQL data stores) then it is going to be a large undertaking to develop performance management code for all of these environments. But if you have a simple servlet that calls a database then it might not be insurmountable.

Finally, ask yourself about the impact of downtime or performance issues on your business. If your company makes its livelihood by selling its products online then downtime can be disastrous. And in a modern competitive online sales world, performance issues can impact you more than you might expect. Consider how the average person completes a purchase: she typically researches the item online to choose the one she wants. She’ll have a set of trusted vendors (and hopefully you’re in that honored set) and she’ll choose the one with the lowest price. If the site is slow then she’ll just move on to the next vendor in her list, which means you just lost the sale. Additionally, customers place a lot of value on their impression of your web presence. This is a hard metric to quantify, but if your web site is slow then it may damage customer impressions of your company and hence lead to a loss in confidence and sales.

All of this is to say that if you have a complex environment and performance issues or downtime are costly to your business then you are far better off buying an APM solution that allows you to focus on your core business and not on building the infrastructure to support your core business.

Conclusion

Application Performance Management involves measuring the performance of your applications, capturing performance metrics from the individual systems that support your applications, and then correlating them into a holistic view. The APM solution observes your application to determine normalcy and, when it detects abnormal behavior, it captures contextual information about the abnormal behavior and notifies you of the problem. Advanced implementations even allow you to react to abnormal behavior by changing your deployment, such as by adding new virtual servers to your application tier that is under stress. An APM solution is important to your business because it can help you reduce your mean time to resolution (MTTR) and lessen the impact of performance issues on your bottom line. If you have a complex application and performance or downtime issues can negatively affect your business then it is in your best interested to evaluate APM solutions and choose the best one for your applications.

This article reviewed APM and helped outline when you should adopt an APM solution. In the next article, we’ll review the challenges in implementing an APM strategy and dive much deeper into the features of APM solutions so that you can better understand what it means to capture, analyze, and react to performance problems as they arise.