Best Practices for Instrumenting Containers with AppDynamics Agents

In this blog I will show some best practices for instrumenting Docker containers, using docker-compose with a few popular AppDynamics application agent types. The goal here is to avoid rebuilding your application containers in the event of an agent upgrade, or having to hard-code AppDynamics configuration into your container images. In my role as a DevOps engineer working on AppDynamics’ production environments, I use these techniques to simplify our instrumented container deployments. I’ll cover the install of binary agents like the Java agent, as well as agents installed via a repository such as Node.js or Python.

Before getting into the best practices, let’s review the most common deployment pattern—which isn’t a best practice at all.

Common (but not best-practice) Pattern:  Install Agent During Container Image Build

The first approach we’ll cover is installing the agent via Dockerfile as part of the application container build. This has the advantage of following the conventional practice of easily copying in your source files and providing transparency of the build in your Dockerfile, making adoption simpler and more intuitive. AppDynamics does not recommend this approach, however, as it requires a fresh copy of your application image to be rebuilt every time an agent needs an upgrade. This is inefficient and unnecessary because the agent is not a central part of your application code. Additionally, hard-coding the agent install in this manner may prove more difficult when you automate your builds and deployments.

Java Example

In this Dockerfile example for installing the Java agent, we have the binary stored in AWS S3 and simply copy over the agent during build time of the application image.

Dockerfile snippet: Copy from S3

Here is a similar step where we copy the agent locally.

Dockerfile snippet: Copy locally

Node.js Example

In this example, we use npm to install a specific Node.js agent version during build time.

Dockerfile snippet

Python Example

In this example, we use pip to install a specific Python agent version during build time.

Dockerfile snippet

Best Practices Pattern: Install Agents at Runtime Using Environment Variables and Sidecar Container

The below examples cover two different patterns, depending on agent type. For Java and similarly packaged agents, we’ll use something called a “sidecar container” to install the agent at container runtime.  For repository-installed agents like Node.js and Python, we’ll use environment variables and a startup script that will install the agent at container runtime.

Java Example

For the sidecar container pattern, we build a container image with the agent binary that we want to install. We then volume-mount the directory that contains the agent, so our application container can copy the agent during container runtime, and then install. This can be simplified by unpackaging the agent in the sidecar container, volume-mounting the newly unpackaged agent directory, and then having the application container point to the volume-mounted directory and using it as its agent directory. We’ll cover both examples below, starting with how we create the sidecar container or “agent-repo.”

In the Dockerfile example for the Java agent, we store the binary in AWS S3 (in an agent version-specific bucket) and simply copy the agent during build-time. We then unzip the agent, allowing us to either copy the agent to the application container and then unzipping, or simply pointing to the unzipped agent directory. Notice we use a build ARG, which allows for a more automated build using a build script.

Agent Repo Dockerfile: Copy from S3

Here’s the same example as above, but one where we copy the agent locally without using a build ARG.

Agent Repo Dockerfile: Copy locally

The build script utilizes a build ARG. If you’re using the S3 pattern above, this allows you to pass in the agent version you like.

Now that we have built our sidecar container image, let’s cover how to build the Java agent container image to utilize this agent deployment pattern.

In the Docker snippet below, we copy in two new scripts, and The script copies and extracts the agent from the volume-mounted directory, /sharedFiles/, to the application container. The script is used as our ENTRYPOINT.  This script will call and start the application.

Java Dockerfile snippet

The script (below) calls, which copies and unzips the agent into the $CATALINA_HOME directory. We then pass in that directory as part of our Java options in the application-startup command. snippet

In the docker-compose.yml, we simply add the agent-repo container with volume mount. Our Tomcat container references the agent-repo container and volume, but also uses agent-dependent environment variables so that we don’t have to edit any configuration files. This makes the deployment much more automated and portable/reusable.


In the example below, we show another way to do this. We skip the entire process of adding the and scripts, electing instead to copy a customized script and using that as our CMD. This pattern still uses the agent-repo sidecar container, but points to the volume-mounted, unzipped agent directory as part of the $CATALINA_OPTS.

Java Dockerfile snippet snippet

OK, that covers the sidecar container agent deployment pattern. So what about agents that utilize a repository to install an agent? How do we automate that process so we don’t have to rebuild our application container image every time we want to upgrade our agents to a specific version? The answer is quite simple and similar to the examples above. We add a script, which is used as our ENTRYPOINT, and then use environment variables set in the docker-compose.yml to install the specific version of our agent.

Node.js Example

Dockerfile snippet

In our index.js that is copied in (not shown in the above Dockerfile snippet), we reference our agent-dependent environment variables, which are set in the docker-compose.yml.

index.js snippet

In the script, we use npm to install the agent. The version installed will depend on whether we specifically set the $AGENT_VERSION variable in the docker-compose.yml. If set, the version set in the variable will get installed. If not, the latest version will be installed.

In the docker-compose.yml, we set the $AGENT_VERSION to the agent version we want npm to install. We also set our agent-dependent environment variables, allowing us to avoid hard-coding these values. This makes the deployment much more automated and portable/reusable.


Python Example

This example is very similar to the Node.js example, except that we are using pip to install our agent.

Dockerfile snippet

In the script, we use pip to install the agent. The version installed will depend on whether we specifically set the $AGENT_VERSION variable in the docker-compose.yml. If set, the version set in the variable will get installed. If not, the latest version will be installed.

In the docker-compose.yml, we set the $AGENT_VERSION to the agent version we want pip to install. We also set our agent-dependent environment variables, allowing us to avoid hard-coding these values. This makes the deployment much more automated and portable/reusable.



Pick the Best Pattern

There are many ways to instrument your Docker containers with AppDynamics agents.  We have covered a few patterns and shown what works well for my team when managing a large Docker environment.

In the Common Pattern (but not best-practice) example, I showed how you must rebuild your application container every time you want to upgrade the agent version—not an ideal approach.

But with the Best Practices Pattern, you decouple the agent specifics from the application container images, and direct that responsibility to the sidecar container and the docker-compose environment variables.

Automation, whenever possible, is always a worthy goal. Following the Best Practices Pattern will allow you to improve script deployments, leverage version control and configuration management, and plug them all into CI/CD pipelines.

For in-depth information on related techniques, read these AppDynamics blogs:

Deploying AppDynamics Agents to OpenShift Using Init Containers

The AppD Approach: Composing Docker Containers for Monitoring

The AppD Approach: Leveraging Docker Store Images with Built-In AppDynamics



APM vs aaNPM – Cutting Through the Marketing BS

Marketing_BSMarketing; mixed feelings! I’m really looking forward to the new super bowl ads (some are pure marketing genius) in a few weeks but I really dislike all of the confusion that marketing tends to create in the technology world. In todays blog post I’m going to attempt to cut through all of the marketing BS and clearly articulate the differences between APM (Application Performance Management) and aaNPM (Application Aware Network Performance Management). I’m not going to try to convince you that you need one instead of the other. This isn’t a sales pitch, it’s an education session.


APM – Gartner has been hard at work trying to clearly define the APM space. According to Gartner, APM tools should contain the following 5 dimensions:

  • End-user experience monitoring (EUM) – (How is the application performing according to what the end user sees)
  • Runtime application architecture discovery modeling and display (A view of the logical application flow at any given point in time)
  • User-defined transaction profiling (Tracking user activity across the entire application)
  • Component deep-dive monitoring in application context (Application code execution, sql query execution, message queue behavior, etc…)
  • Analytics

aaNPM – Again, Gartner has created a definition for this market segment which you can read about here

“These solutions allow passive packet capture of network traffic and must include the following features, in addition to packet capture technology:

  • Receive and process one or more of these flow-based data sources: NetFlow, sFlow and Internet Protocol Flow Information Export (IPFIX).
  • Provide roll-ups and dashboards of collected data into business-relevant views, consisting of application-centric performance displays.
  • Monitor performance in an always-on state, and generate alarms based on manual or automatically generated thresholds.
  • Offer protocol analysis capabilities to decode and understand multiple applications, including voice, video, HTTP and database protocols. The tool must provide end-user experience information for these applications.
  • Have the ability to decrypt encrypted traffic if the proper keys are provided to the solution.”


So what do these definitions mean in real life?

APM tools are typically used by application support, operations, and development teams to rapidly identify, isolate, and repair application issues. These teams usually have a high level understanding of how networks operate but not nearly the detailed knowledge required to resolve network related issues. They live and breathe application code, integration points, and component (server, OS, VM, JVM, etc…) metrics. They call in the network team when they think there is a network issue (hopefully based upon some high level network indicators). These teams usually have no control over the network and must follow the network teams process to get a physical device connected to the network. APM tools do not help you solve network problems.

aaNPM tools are network packet sniffers by definition. The majority of these tools (NetScout, Cisco, Fluke Networks, Network Instruments, etc.) are hardware appliances that you connect to your network. They need to be connected to each segment of the network where you want to collect packets or they must be fed filtered and aggregated packet streams by NPB (Network Packet Broker) devices. aaNPM tools contain a wealth of network level details in addition to some valuable application data (EUM metrics, transaction details, and application flows). aaNPM tools help network engineers solve network problems that are manifesting themselves as application problems. aaNPM tools are not capable of solving code issues as they have no visibility into application code.

If I were on a network support team I would want to understand if applications were being impacted by network issues so I could prioritize remediation properly and have a point of reference if I was called out by an application support team.

Network and Application Team Convergence

I’ve been asked if I see network and application teams converging in a similar way that dev and ops teams are converging as a result of the DevOps movement. I have not seen this with any of the companies I get to interact with. Based on my experience working in operations at various companies in the past, network teams and application support teams think in very different ways. Is it impossible for these teams to work in unison? No, but I see it as unlikely over the next few years at least.

But what about SDN (Software Defined Networking)? I think most network engineers see SDN as a threat to their livelihood whether that is true or not. No matter the case, SDN will take a long time to make it’s way into operational use and in the mean time network and application teams will remain separate.

I hope this was helpful in cutting through the marketing spin that many vendors are using to expand their reach. When it comes right down to it, your use case may require APM, aaNPM, a combination of both, or some other technology not discussed today. Technology is made to solve problems. I highly recommend starting out by defining your problems and then exploring the best solutions available to help solve your problem.

If you’ve decided to explore your APM tool options you can try AppDynamics for free by clicking here.

DevOps Scares Me – Part 4: Dev and Ops Collaborate Across the Lifecycle

Today we are blogging something a little different than our normal. I’m Jim Hirschauer the Operations Guy, and this is my esteemed colleague Dustin Whittle the Developer. In this blog post we’re going to discuss how we would take an application from inception through development, testing, QA, and into production. We’ll each comment on the different stages and provide our perspective on the tools that we need to use at each stage and how they help with automation, testing, and monitoring. Along the way we’ll call out the potential collaboration points to identify the areas where the DevOps approach provides the most value.

The software development loop looks like this:

DevOps Cycle

Inception and working with a product team

From an operational perspective, my first instinct is to understand the application architecture so that I can start thinking about the proper deployment model for the infrastructure components. Here are some of my operational questions and considerations for this stage:

  • Are we using a public or private cloud?
  • What is the lead time for spinning up each component and ensuring they comply with my companies regulations?
  • When do I need to provide a development environment to my dev team or will they handle it themselves?
  • Does this application perform functions that other applications or services already handle? Operations should have high-level visibility into the application and service portfolio.

From a development perspective, my first milestone is to make sure the ops team fully understands the application and what it takes to deploy it to a pre-production environment. This is where we the developers sync with the product and ops team and make sure we are aligned.

Planning for the product team:

  • Is the project scope well defined? Is there a product requirements document?
  • Do we have a well defined product backlog?
  • Are there mocks of the user experience?

Planning for the ops team:

  • What tools will we use for deployment and configuration management?
  • How will we automate the deployment process and does the ops team understand the manual steps?
  • How will we integrate our builds with our continuous integration server?
  • How will we automate the provisioning of new environments?
  • Capacity Planning – Do we know the expected production load?

There’s not a ton of activity at this stage for the operations team. This is really where the devops synergy comes into play. DevOps is simply operations working together with engineers to get things done faster in an automated and repeatable way. When it comes to scaling, the more automation in place the easier things will be in the long run.

Development and scoping production

This should start with a conversation between the dev and ops teams to control domain ownership. Depending on your organization and peers strengths this is a good time to decide who will be responsible for automating the provisioning and deployment of the application. The ops questions for deploying complex web applications:

  • How do you provision virtual machines?
  • How do you configure network devices and servers?
  • How do you deploy applications?
  • How do you collect and aggregate logs?
  • How do you monitor services?
  • How do you monitor network performance?
  • How do you monitor application performance?
  • How do you alert and remediate when there are problems?

During the development phase the operations focused staff normally make sure the development environment is managed and are actively working to set up the test, QA and Prod environments. This can take a lot of time if automation tools aren’t used.

Here are some tools you can use to automate server build and configuration:

Meanwhile, the operations staff should also make sure that the developers have access to tools which will help them with release management and application monitoring and troubleshooting. Here are some of those tools:

Deployment Automation:

Infrastructure Monitoring:


Application + Network Performance Management:

Testing and Quality Assurance

Once developers have built unit and functional tests we need to ensure the tests are running after every commit and we don’t allow regressions in our promoted environments. In theory, developers should do this before they commit any code, but often times problems don’t show up until you have production traffic running under production infrastructure. The goal of this step is really to simulate as much as possible everything that can go wrong and find out what happens and how to remediate.

QA Problems

The next step is to do capacity planning and load testing to be confident the application doesn’t fall over when it is needed most. There are a variety of tools for load testing:

  • Apica Load Test – Cloud-based load testing for web and mobile applications
  • Soasta – Build, execute, and analyze performance tests on a single, powerful, intuitive platform.
  • Bees with Machine Guns – A utility for arming (creating) many bees (micro EC2 instances) to attack (load test) targets (web applications).
  • MultiMechanize – Multi-Mechanize is an open source framework for performance and load testing. It runs concurrent Python scripts to generate load (synthetic transactions) against a remote site or service. Multi-Mechanize is most commonly used for web performance and scalability testing, but can be used to generate workload against any remote API accessible from Python.
  • Google PageSpeed Insights – PageSpeed Insights analyzes the content of a web page, then generates suggestions to make that page faster. Reducing page load times can reduce bounce rates and increase conversion rates.

The last step of testing is discovering all of the possible failure scenarios and coming up with a disaster recovery plan. For example what happens if we lose a database or a data center or have a 100x surge in traffic.

During the test and QA stages operations needs to play a prominent role. This is often overlooked by ops teams but their participation in test and QA can make a meaningful difference in the quality of the release into production. Here’s how.

If the application is already in production (and monitored properly), operations has access to production usage and load patterns. These patterns are essential to the QA team for creating a load test that properly exercises the application. I once watched a functional test where 20+ business transactions were tested manually by the application support team. Directly after the functional test I watched the load test that ran the same 2 business transactions over and over again. Do you think the load test was an accurate representation of production load? No way! When I asked the QA team why there were only 2 transactions they said “Because that is what the application team told us to model.”

The development and application support teams usually don’t have time to sit with the QA team and give them an accurate assessment of what needs to be modeled for load testing. Operations teams should work as the middle man and provide business transaction information from production or from development if this is an application that has never seen production load.

Here are some of the operational tasks during testing and QA:

  • Ensure monitoring tools are in place.
  • Ensure environments are properly configured
  • Participate in functional, load, stress, leak, etc… tests and provide analysis and support
  • Providing guidance to the QA team


Production is traditionally the domain of the operations team. For as long as I can remember, the development teams have thrown applications over the production wall for the operations staff to deal with when there are problems. Sure, some problems like hardware issues, network issues, and cooling issues are purely on the shoulders of operations–but what about all of those application specific problems? For example, there are problems where the application is consuming way too many resources, or when the application has connection issues with the database due to a misconfiguration, or when the application just locks up and has to be restarted.

I recall getting paged in the middle of the night for application-related issues and thinking how much better each release would be if the developers had to support their applications once they made it to production. It was really difficult back in those days to say with any certainty that the problem was application related and that a developer needed to be involved. Today’s monitoring tools have changed that and allow for problem isolation in just minutes. Since developers in financial services organizations are not allowed access to production servers, it makes having the proper tools all the more important.

Production devops is all about:

  • deploying code in a fast, repeatable, scalable manner
  • rapidly identifying performance and stability problems
  • alerting the proper team when a problem is detected
  • rapidly isolating the root cause of problems
  • automatic remediation of known problems and rapid manual remediation of new problems (runbooks and runbook automation)

Your application must always be available and operating correctly during business hours (this may be 24×7 for your specific application).

Alerting Tools:

In case of failures alerting tools are crucial to notify the ops team of serious issues. The operations team will usually have a runbook to turn to when things go wrong. A best practice is to collaborate on incident response plans.


Finally we’ve made it to the last major category of the SDLC, maintenance. As an operations guy my mind focuses on the following tasks:

  • Capacity planning – Do we have enough resources available to the application? If we use dynamic scaling, this is not an issue but a task to ensure the scaling is working properly.
  • Patching – are we up to date with patches on the infrastructure and application components? This is supposed to help with performance and/or security and/or stability but it doesn’t always work out that way.
  • Support – are we current with our software support levels (aka, have we paid and are we on supported versions)?
  • New releases (application updates) – New releases always made me cringe since I assumed the release would have issues the first week. I learned this reaction from some very late nights immediately following those new releases.

As a developer the biggest issues during the maintenance phase is working with the operations team to deploy new versions and make critical bug fixes. The other primary concern is troubleshooting production problems. Even when no new code has been deployed, sometimes failures happen. If you have a great process, application performance monitoring, and a devops mentality collaborating with ops to resolve the root cause of failures becomes easy.

As you can see, the dev and ops perspectives are pretty different, but that’s exactly why those 2 sides of the house need to tear down the walls and work together. DevOps isn’t just a set of tools, but a philosophical shift that needs that requires buy-in from all folks involved to really succeed. It’s only through a high level of collaboration that things will change for the better. AppDynamics can’t change the mindset of your organization, but it is a great way to foster collaboration across all of your organizational silos. Sign up for your free trial today and make a difference for you organization.

UX – Monitor the Application or the Network?

Last week I flew into Las Vegas for #Interop fully suited and booted in my big blue costume (no joke). I’d been invited to speak in a vendor debate on User eXperience (UX): Monitor the Application or the Network? NetScout represented the Network, AppDynamics (and me) represented the Application, and “Compuware dynaTrace Gomez” sat on the fence representing both. Moderating was Jim Frey from EMA, who did a great job introducing the subject, asking the questions and keeping the debate flowing.

At the start each vendor gave their usual intro and company pitch, followed by their own definition on what User Experience is.

Defining User Experience

So at this point you’d probably expect me to blabber on about how application code and agents are critical for monitoring the UX? Wrong. For me, users experience “Business Transactions”–they don’t experience applications, infrastructure, or networks. When a user complains, they normally say something like “I can’t Login” or “My checkout timed out.” I can honestly say I’ve never heard them say –  “The CPU utilization on your machine is too high” or “I don’t think you have enough memory allocated.”

Now think about that from a monitoring perspective. Do most organizations today monitor business transactions? Or do they monitor application infrastructure and networks? The truth is the latter, normally with several toolsets. So the question “Monitor the Application or the Network?” is really the wrong question for me. Unless you monitor business transactions, you are never going to understand what your end users actually experience.

Monitoring Business Transactions

So how do you monitor business transactions? The reality is that both Application and Network monitoring tools are capable, but most solutions have been designed not to–just so they provide a more technical view for application developers and network engineers. This is wrong, very wrong and a primary reason why IT never sees what the end user sees or complains about. Today, SOA means applications are more complex and distributed, meaning a single business transaction could traverse multiple applications that potentially share services and infrastructure. If your monitoring solution doesn’t have business transaction context, you’re basically blind to how application infrastructure is impacting your UX.

The debate then switched to how monitoring the UX differs from an application and network perspective. Simply put, application monitoring relies on agents, while network monitoring relies on sniffing network traffic passively. My point here was that you can either monitor user experience with the network or you can manage it with the application. For example, with network monitoring you only see business transactions and the application infrastructure, because you’re monitoring at the network layer. In contrast, with application monitoring you see business transactions, application infrastructure, and the application logic (hence why it’s called application monitoring).

Monitor or Manage the UX?

Both application and network monitoring can identify and isolate UX degradation, because they see how a business transaction executes across the application infrastructure. However, you can only manage UX if you can understand what’s causing the degradation. To do this you need deep visibility into the application run-time and logic (code). Operations telling a Development team that their JVM is responsible for a user experience issue is a bit like Fedex telling a customer their package is lost somewhere in Alaska. Identifying and Isolating pain is useful, but one could argue it’s pointless without being able to manage and resolve the pain (through finding the root cause).

Netscout made the point that with network monitoring you can identify common bottlenecks in the network that are responsible for degrading the UX. I have no doubt you could, but if you look at the most common reason for UX issues, it’s related to change–and if you look at what changes the most, it’s application logic. Why? Because Development and Operations teams want to be agile, so their applications and business remains competitive in the marketplace. Agile release cycles means application logic (code) constantly changes. It’s therefore not unusual for an application to change several times a week, and that’s before you count hotfixes and patches. So if applications change more than the network, then one could argue it’s more effective for monitoring and managing the end user experience.

UX and Web Applications

We then debated which monitoring concept was better for web-based applications. Obviously, network monitoring is able to monitor the UX by sniffing HTTP packets passively, so it’s possible to get granular visibility on QoS in the network and application. However, the recent adoption of Web 2.0 technologies (ajax, GWT, Dojo) means application logic is now moving from the application server to the users browser. This means browser processing time becomes a critical part of the UX. Unfortunately, Network monitoring solutions can’t monitor browser processing latency (because they monitor the network), unlike application monitoring solutions that can use techniques like client-side instrumentation or web-page injection to obtain browser latency for the UX.

The C Word

We then got to the Cloud and which made more sense for monitoring UX. Well, network monitoring solutions are normally hardware appliances which plug direct into a network tap or span port. I’ve never asked, but I’d imagine the guys in Seattle (Amazon) and Redmond (Windows Azure) probably wouldn’t let you wheel a network monitoring appliance into their data-centre. More importantly, why would you need to if you’re already paying someone else to manage your infrastructure and network for you? Moving to the Cloud is about agility, and letting someone else deal with the hardware and pipes so you can focus on making your application and business competitive. It’s actually very easy for application monitoring solutions to monitor UX in the cloud. Agents can piggy back with application code libraries when they’re deployed to the cloud, or cloud providers can embed and provision vendor agents as part of their server builds and provisioning process.

What’s interesting also is that Cloud is highlighting a trend towards DevOps (or NoOps for a few organizations) where Operations become more focused on applications vs infrastructure. As the network and infrastructure becomes abstracted in the Public Cloud, then the focus naturally shifts to the application and deployment of code. For private clouds you’ll still have network Ops and Engineering teams that build and support the Cloud platform, but they wouldn’t be the people who care about user experience. Those people would be the Line of Business or application owners which the UX impacts.

In reality most organizations today already monitor the application infrastructure and network. However, if you want to start monitoring the true UX, you should monitor what your users experience, and that is business transactions. If you can’t see your users’ business transactions, you can’t manage their experience.

What are your thoughts on this?

AppDynamics is an application monitoring solution that helps you monitor business transactions and manage the true user experience. To get started sign-up for a 30-day free trial here.

I did have an hour spare at #Interop after my debate to meet and greet our competitors, before flying back to AppDynamics HQ. It was nice to see many of them meet and greet the APM Caped Crusader.

App Man.

APM vs NPM. 2nd Round K.O.

Round Two – Last time I wrote a blog comparing APM versus network-based APM tools, which I still consider NPM at it’s core regardless of what some critics and competitors claim. Let me make one thing clear though, NPM is great for equipping IT network administrators to see how fast or slow data is traveling through the pipes of their application. Unfortunately, network-based APM tools simply cannot provide App Ops granular visibility into the application runtime when isolating bottlenecks go beyond the system level and it’s final destination – the end user’s browser.

I find several of the blogs and YouTube clips from such NPM vendors quite comical as they try to throw punches at APM companies. Their arguments are centered primarily against agent-based approaches being an inadequate APM solution due to today’s fickle and distributed application architectures. It’s not like I haven’t heard it before.

The amusing thing about it…they’re completely right! In fact, we couldn’t agree more, and that’s why Jyoti Bansal founded AppDynamics to address these perennial shortcomings legacy APM vendors have been ignoring. Even the smallest businesses next to the largest enterprises have complex applications that have outpaced their App Ops teams’ current set of monitoring tools. That’s why AppDynamics is reinventing and reigniting the application performance management space by enabling IT operations to monitor complex, modern applications running in the cloud or the data center. So let me respond to those claims they’ve made.

The Claims

“Agents have high deployment and ongoing maintenance burden.”
Legacy APM: TRUE
AppDynamics: FALSE. No manual instrumentation required. It’s automatic.

“Agents are invasive which can perturb the systems being monitored.”
Legacy APM: TRUE
AppDynamics: FALSE. Our customers see less than 1-2% overhead in production. 

“Performance management vendors have over promised and under delivered for decades.”
Legacy APM: TRUE
AppDynamics: FALSE. Things are going well thanks. Check our customer list and 400% growth.

All AppDynamics. The next-gen of APM.

Example App with application performance issues

I drew a parallel in my previous post that using NPM concepts to monitor application performance is like inspecting UPS packages en-route to figure out why operations at a hub came to a screeching halt. Remember, even if the package contents is visible from afar, it doesn’t explain why the hub conveyors, which electronically guide packages to their appropriate destination chute is broken, nor can it identify why cargo operations have stalled. In other words, good luck trying to gather anything beyond the scope of the application’s infrastructure. Using network monitoring tools to collect even the most basic system health metrics such as CPU utilization, memory usage, thread pool consumption and thrashing? Time to throw in the towel.

And what about End User Monitoring?

What’s becoming just as important as being able to monitor server side processing and network time is the ability to monitor end user performance. When NPM tools are only able to see the last packet sent from the server, how does that help you understand the browser’s performance? It doesn’t since once again, this kind of analysis is only feasible higher up the stack at the Application Layer. And just to clarify when I say Application Layer, I mean application execution time, not “network process time to application” as defined by OSI Layer 7.

On the other hand, injected agents residing in that layer can insert JavaScript into the Web page to determine the execution time spent in the browser. This is becoming more of a concern for App Ops and Dev Ops now that 80-90% of the end-user response time is spent on the frontend executing JavaScript, rendering markup and stylesheets. As business logic continues it’s migration to the browser while increasing it’s processing burden, the client is looking more and more like the new server. Network monitoring tools must move to an agent-based approach if they are to truly deliver the monitoring visibility needed for the application and end user experience, otherwise their visibility will remain between a rock and a hard place.

On top of that, what about those customers running their applications in a public cloud? Are you going to convince your cloud provider to install a network appliance into their infrastructure? I highly doubt it. With AppDynamics, we have partnerships with cloud providers such as Amazon EC2, Azure, RightScale and Opsource allowing developers and operations to easily deploy AppDynamics with a flick of a switch and monitor their applications in production 24/7.

Once again, next-gen APM triumphs over NPM based application performance on not just the server side, but also the browser. AppDynamics is embracing this and fully aware of the technical and business significance of monitoring end user performance. We’re delighted to offer this kind of end-to-end visibility to our customers who will now be able to monitor application performance from the end users’ browser to the backend application tiers (databases, mainframes), all through a single pane of glass view.

APM vs NPM. Round One.

Another three-letter acronym I see frequently mixed in with APM is NPM which stands for Network Performance Management. At first glance they look very similar. The distinction appears very subtle with just a one letter difference, but it speaks volumes because their core technologies and approaches to monitoring application performance are fundamentally different.

Application Performance Management tools typically use agents that live in the application run-time which capture performance details of how application logic computes across the application infrastructure.

In contrast, Network Performance Monitoring tools are agent-less appliances that sit on the network analyzing content and traffic by capturing packets flowing through the network. They can measure response times and find errors for your applications by understanding the wire protocols across the application tiers. Like agent based APM solutions, they can automatically discover your application topology, but NPM tools lack deep code-level diagnostics which is paramount to helping App Ops and Dev teams solve problems. Here’s why.

Imagine your application’s architecture is FedEx providing package delivery services to it’s customers in a timely manner. If you were to apply the concept of network monitoring to this analogy, using an NPM tool would help track how long it takes for packages to travel in transit from one hub to the next, examining it’s contents and the expected arrival time to it’s final destination.

Now let’s say that a package was being sent from Paris to Los Angeles but hit a snag at the London hub along the way. Operations at London came to a screeching hault for some inexplicable reason and your task is to figure out why and how to fix it ASAP so operations are running smooth again. In this case, APM is the favorable solution over NPM here. APM tools not only empower you with the ability to see how long it takes for your package to travel to different locations, but also how the package is processed at the facility and why operations came to a standstill. At this juncture, you probably wouldn’t even care to analyze the package’s contents since you already know what’s inside and it’s still in London.

When applying this concept to the world of business applications, NPM tools won’t provide you with visibility into application logic inefficiencies, especially when experiencing a performance issue or service level breach to key business transactions. NPM performance metrics are derived from captured packets and the network protocols the servers support, but if you’re making inefficient, iterative business logic calls in your code, how will capturing and analyzing packets help you triage the problem? Its the same story for identifying the root cause of memory leaks or thread synchronization issues; execution and visibility like this occurs at the application layer, not at the network layer. Thus, I would argue that NPM compliments APM but doesn’t serve as a viable replacement.

Now before I go enumerating my theories as to why I see this trend, let me just say while I respect their hustle, nice try but not quite. While scouring the web I found many NPM companies trying to crash the popular APM party by marketing their network performance monitoring utilities as an application monitoring and management solution. It can be confusing for a first time buyer trying to evaluate APM tools when you see NPM companies using phrases like, “network-based APM” or “gather data already on your network” in their product messaging.

One of the reasons for this messaging is because NPM is an immensely crowded market. Here’s a list of NPM tools compiled by some IT folks at Stanford. There’s over 300 network monitoring tools, and that’s only going back five years to 2007!! Take your pick. Second, applications and the tools to monitor their performance and availability speaks to customers at a higher level from a technical and business perspective. If I’m in charge of some transaction intensive application, my customers are directly interfacing with the Application Layer in order to complete business transactions which takes place at the highest layer on the stack. Customers aren’t 5-6 levels deep at the Network Layer interacting with packets or the routers and switches they pass through.

By the way, have you seen Gartner’s APM spend analysis? It’s a $2 billion market and growing. This most certainly is a valid reason why NPM solutions are now positioning themselves as an APM solution to customers.

If you have absolutely nothing to monitor any part of your applications infrastructure, then something is better than nothing. However, at some point you’ll need better visibility at the application layer regardless how fast packets are traveling through the pipes. NPM won’t necessarily provide the range of visibility your App Ops and Dev teams need to diagnose, troubleshoot, and idenfity problem root causes.

Those who live day in and day out the networking world will naturally gravitate to what they’re familiar with, but if you haven’t ventured out into the evolving APM market, I highly recommend it. Even though you can achieve some level of performance monitoring with NPM, you’ll find yourself running into limitations in application visibility pretty quick. But as the old saying goes, “If all you have is a hammer, then everything looks like a nail.”

I’m curious to know what are your thoughts are. Which class of tools would you choose? Why?