How to Identify Impactful Business Transactions in AppDynamics

New users of APM software often believe their company has hundreds of critical business transactions that must be monitored. But that’s not the case. In my role as Professional Services Consultant (EMEA) at AppDynamics, I’ve worked at dozens of customer sites, and the question of “What to monitor?” is always foremost in new users’ minds.

AppDynamics’ Business Transactions (BTs) reflect the core value of your applications. Since our inception a decade ago, we’ve built our APM solution around this concept. Given the critical importance of Business Transactions, you’ll want to configure them the right way. While AppDynamics will automatically create BTs for you, you’ll benefit greatly by taking a few extra steps to optimize your monitoring environment.

APM users often think of a BT as a technical transaction in their system, but it’s much more than that. The BT is a key component of effective application monitoring. It consists of all required  services within your environment—things like login, search and checkout—that are utilized to fulfill and respond to a user-initiated request. These transactions reflect the logical way users interact with your applications. Activities such as adding an item to a shopping cart or checking out will summon various applications, databases, third-party APIs and web services.

If you’re new to APM, you may find yourself asking “Where should I begin?” By applying essential best practices, BT configuration can be a smooth and orderly process.

Start by asking yourself two key questions:

  1. What are my business goals for monitoring?
  2. What pain points am I trying to address by using APM?

You may already know the answers. Perhaps you want to resolve major problems that consume a lot of your time and resources, or insure that your most critical business operations are performing optimally. From there, you can drill down to more specific goals and operations to focus on. A retail website, for instance, may choose to focus on its checkout or catalog operation. Financial services firms may focus on the most-used APIs provided for their mobile clients. By prioritizing your business goals early in the process, you’ll find BTs much easier to configure.

AppDynamics automatically discovers and maps Business Transactions for you. Actions like Add to Cart are tagged and traced across every component of your application and visualized on a topology map, helping you to better understand performance across an entire application.

It’s tempting to think configuration is complete once you’ve instrumented with an agent and start seeing traffic coming in. But that’s just the technical side of things. You’ll also need to align with the business, asking questions like, “Do we have SLAs on this?” and “What’s the performance requirement?” You’ll also need to establish health rules and work with the business to determine, for instance, what action to take if a particular rule is violated.

Choose Your BTs Wisely

At a high level, a Business Transaction is more like a use case, even though users often think of it as a technical transaction. Sometimes I must remind users: “No, this activity you want to monitor is not a business transaction. It’s just a technical functionality of the system, but it’s not being used by a customer or an API user.” These cross-cutting metrics may be better served by monitoring through views like Service Endpoints or specific technical metrics.

Be very selective when choosing your Business Transactions. Here’s a rule of thumb: Configure up to 20 to 30 BTs per business application. This may not seem like a lot, but really it is. One of AppDynamics’ largest banking customers identified that 90% of its business activity was reflected in just 25 or so business transactions.

It’s not uncommon for new users to balk at this. They may say, “But we have many more important processes to track!” Fear not: the recommended number of BTs isn’t set in stone, although our 20-to-30 guideline is a good starting point. You may have 20 key Business Transactions and another 20 that are less critical, but you really want to monitor all 40. You can do this, of course, but you’ll need to prioritize these transactions. Capturing too many BTs can lead users to miss the transactions that are truly important to the business.

Best Practices

During APM setup, you’ll have many questions. Should you work exclusively with your own technical team? With the application owner? The business that’s using the application?

Start with these three key steps:

  1. Get to know your business.
  2. Identify the major flows.
  3. Talk to the application owner.

 

Whenever I’m onsite with a customer, the first thing I advise is that we login as an end user to see how they use the system. For example, we’ll order a product or renew a subscription, and then track these transactions end-to-end through the system. This very important step will help you identify the transactions you want to monitor.

It’s also critical to check the current major incidents you have, or at least the P1s and P2s. Find out what problems you’re experiencing right now. What are the major complaints involving the application?

Focus on the the low-hanging fruit—your most troublesome applications—which you’ll find by instrumenting systems and talking to applications owners. This will deliver value in the early setup stage, providing information you can take to the business to make them more receptive to working with you.

Prioritize Your Operations

Business Transactions are key to configuring APM. Before starting configuration, ask yourself these critical questions:

  1. What are my business goals for monitoring?
  2. What pain points am I trying to solve with AppDynamics?
  3. What are the typical problems that take up my time and resources?
  4. What are the most critical business operations that need to perform optimally?

 

Then take a closer look at your application. Decide which operations you must focus on to achieve your goals.

These key steps will help you prioritize operations and make it easier to configure them as Business Transactions. Go here to learn more!

Monitoring Kubernetes and OpenShift with AppDynamics

Here at AppDynamics, we build applications for both external and internal consumption. We’re always innovating to make our development and deployment process more efficient. We refactor apps to get the benefits of a microservices architecture, to develop and test faster without stepping on each other, and to fully leverage containerization.

Like many other organizations, we are embracing Kubernetes as a deployment platform. We use both upstream Kubernetes and OpenShift, an enterprise Kubernetes distribution on steroids. The Kubernetes framework is very powerful. It allows massive deployments at scale, simplifies new version rollouts and multi-variant testing, and offers many levers to fine-tune the development and deployment process.

At the same time, this flexibility makes Kubernetes complex in terms of setup, monitoring and maintenance at scale. Each of the Kubernetes core components (api-server, kube-controller-manager, kubelet, kube-scheduler) has quite a few flags that govern how the cluster behaves and performs. The default values may be OK initially for smaller clusters, but as deployments scale up, some adjustments must be made. We have learned to keep these values in mind when monitoring OpenShift clusters—both from our own pain and from published accounts of other community members who have experienced their own hair-pulling discoveries.

It should come as no surprise that we use our own tools to monitor our apps, including those deployed to OpenShift clusters. Kubernetes is just another layer of infrastructure. Along with the server and network visibility data, we are now incorporating Kubernetes and OpenShift metrics into the bigger monitoring picture.

In this blog, we will share what we monitor in OpenShift clusters and give suggestions as to how our strategy might be relevant to your own environments. (For more hands-on advice, read my blog Deploying AppDynamics Agents to OpenShift Using Init Containers.)

OpenShift Cluster Monitoring

For OpenShift cluster monitoring, we use two plug-ins that can be deployed with our standalone machine agent. AppDynamics’ Kubernetes Events Extension, described in our blog on monitoring Kubernetes events, tracks every event in the cluster. Kubernetes Snapshot Extension captures attributes of various cluster resources and publishes them to the AppDynamics Events API. The snapshot extension collects data on all deployments, pods, replica sets, daemon sets and service endpoints. It captures the full extent of the available attributes, including metadata, spec details, metrics and state. Both extensions use the Kubernetes API to retrieve the data, and can be configured to run at desired intervals.

The data these plug-ins provide ends up in our analytics data repository and instantly becomes available for mining, reporting, baselining and visualization. The data retention period is at least 90 days, which offers ample time to go back and perform an exhaustive root cause analysis (RCA). It also allows you to reduce the retention interval of events in the cluster itself. (By default, this is set to one hour.)

We use the collected data to build dynamic baselines, set up health rules and create alerts. The health rules, baselines and aggregate data points can then be displayed on custom dashboards where operators can see the norms and easily spot any deviations.

An example of a customizable Kubernetes dashboard.

What We Monitor and Why

Cluster Nodes

At the foundational level, we want monitoring operators to keep an eye on the health of the nodes where the cluster is deployed. Typically, you would have a cluster of masters, where core Kubernetes components (api-server, controller-manager, kube-schedule, etc.) are deployed, as well as a highly available etcd cluster and a number of worker nodes for guest applications. To paint a complete picture, we combine infrastructure health metrics with the relevant cluster data gathered by our Kubernetes data collectors.

From an infrastructure point of view, we track CPU, memory and disk utilization on all the nodes, and also zoom into the network traffic on etcd. In order to spot bottlenecks, we look at various aspects of the traffic at a granular level (e.g., reads/writes and throughput). Kubernetes and OpenShift clusters may suffer from memory starvation, disks overfilled with logs or spikes in consumption of the API server and, consequently, the etcd. Ironically, it is often monitoring solutions that are known for bringing clusters down by pulling excessive amounts of information from the Kubernetes APIs. It is always a good idea to establish how much monitoring is enough and dial it up when necessary to diagnose issues further. If a high level of monitoring is warranted, you may need to add more masters and etcd nodes. Another useful technique, especially with large-scale implementations, is to have a separate etcd cluster just for storing Kubernetes events. This way, the spikes in event creation and event retrieval for monitoring purposes won’t affect performance of the main etcd instances. This can be accomplished by setting the –etcd-servers-overrides flag of the api-server, for example:

–etcd-servers-overrides =/events#https://etcd1.cluster.com:2379;https://etcd2. cluster.com:2379;https://etcd3. cluster.com:2379

From the cluster perspective we monitor resource utilization across the nodes that allow pod scheduling. We also keep track of the pod counts and visualize how many pods are deployed to each node and how many of them are bad (failed/evicted).

A dashboard widget with infrastructure and cluster metrics combined.

Why is this important? Kubelet, the component responsible for managing pods on a given node, has a setting, –max-pods, which determines the maximum number of pods that can be orchestrated. In Kubernetes the default is 110. In OpenShift it is 250. The value can be changed up or down depending on need. We like to visualize the remaining headroom on each node, which helps with proactive resource planning and to prevent sudden overflows (which could mean an outage). Another data point we add there is the number of evicted pods per node.

Pod Evictions

Evictions are caused by space or memory starvation. We recently had an issue with the disk space on one of our worker nodes due to a runaway log. As a result, the kubelet produced massive evictions of pods from that node. Evictions are bad for many reasons. They will typically affect the quality of service or may even cause an outage. If the evicted pods have an exclusive affinity with the node experiencing disk pressure, and as a result cannot be re-orchestrated elsewhere in the cluster, the evictions will result in an outage. Evictions of core component pods may lead to the meltdown of the cluster.

Long after the incident where pods were evicted, we saw the evicted pods were still lingering. Why was that? Garbage collection of evictions is controlled by a setting in kube-controller-manager called –terminated-pod-gc-threshold.  The default value is set to 12,500, which means that garbage collection won’t occur until you have that many evicted pods. Even in a large implementation it may be a good idea to dial this threshold down to a smaller number.

If you experience a lot of evictions, you may also want to check if kube-scheduler has a custom –policy-config-file defined with no CheckNodeMemoryPressure or CheckNodeDiskPressure predicates.

Following our recent incident, we set up a new dashboard widget that tracks a metric of any threats that may cause a cluster meltdown (e.g., massive evictions). We also associated a health rule with this metric and set up an alert. Specifically, we’re now looking for warning events that tell us when a node is about to experience memory or disk pressure, or when a pod cannot be reallocated (e.g., NodeHasDiskPressure, NodeHasMemoryPressure, ErrorReconciliationRetryTimeout, ExceededGracePeriod, EvictionThresholdMet).

We also look for daemon pod failures (FailedDaemonPod), as they are often associated with cluster health rather than issues with the daemon set app itself.

Pod Issues

Pod crashes are an obvious target for monitoring, but we are also interested in tracking pod kills. Why would someone be killing a pod? There may be good reasons for it, but it may also signal a problem with the application. For similar reasons, we track deployment scale-downs, which we do by inspecting ScalingReplicaSet events. We also like to visualize the scale-down trend along with the app health state. Scale-downs, for example, may happen by design through auto-scaling when the app load subsides. They may also be issued manually or in error, and can expose the application to an excessive load.

Pending state is supposed to be a relatively short stage in the lifecycle of a pod, but sometimes it isn’t. It may be good idea to track pods with a pending time that exceeds a certain, reasonable threshold—one minute, for example. In AppDynamics, we also have the luxury of baselining any metric and then tracking any configurable deviation from the baseline. If you catch a spike in pending state duration, the first thing to check is the size of your images and the speed of image download. One big image may clog the pipe and affect other containers. Kubelet has this flag, –serialize-image-pulls, which is set to “true” by default. It means that images will be loaded one at a time. Change the flag to “false” if you want to load images in parallel and avoid the potential clogging by a monster-sized image. Keep in mind, however, that you have to use Docker’s overlay2 storage driver to make this work. In newer Docker versions this setting is the default. In addition to the Kubelet setting, you may also need to tweak the max-concurrent-downloads flag of the Docker daemon to ensure the desired parallelism.

Large images that take a long time to download may also cause a different type of issue that results in a failed deployment. The Kubelet flag –image-pull-progress-deadline determines the point in time when the image will be deemed “too long to pull or extract.” If you deal with big images, make sure you dial up the value of the flag to fit your needs.

User Errors

Many big issues in the cluster stem from small user errors (human mistakes). A typo in a spec—for example, in the image name—may bring down the entire deployment. Similar effects may occur due to a missing image or insufficient rights to the registry. With that in mind, we track image errors closely and pay attention to excessive image-pulling. Unless it is truly needed, image-pulling is something you want to avoid in order to conserve bandwidth and speed up deployments.

Storage issues also tend to arise due to spec errors, lack of permissions or policy conflicts. We monitor storage issues (e.g., mounting problems) because they may cause crashes. We also pay close attention to resource quota violations because they do not trigger pod failures. They will, however, prevent new deployments from starting and existing deployments from scaling up.

Speaking of quota violations, are you setting resource limits in your deployment specs?

Policing the Cluster

On our OpenShift dashboards, we display a list of potential red flags that are not necessarily a problem yet but may cause serious issues down the road. Among these are pods without resource limits or health probes in the deployment specs.

Resource limits can be enforced by resource quotas across the entire cluster or at a more granular level. Violation of these limits will prevent the deployment. In the absence of a quota, pods can be deployed without defined resource limits. Having no resource limits is bad for multiple reasons. It makes cluster capacity planning challenging. It may also cause an outage. If you create or change a resource quota when there are active pods without limits, any subsequent scale-up or redeployment of these pods will result in failures.

The health probes, readiness and liveness are not enforceable, but it is a best practice to have them defined in the specs. They are the primary mechanism for the pods to tell the kubelet whether the application is ready to accept traffic and is still functioning. If the readiness probe is not defined and the pods takes a long time to initialize (based on the kubelet’s default), the pod will be restarted. This loop may continue for some time, taking up cluster resources for no reason and effectively causing a poor user experience or outage.

The absence of the liveness probe may cause a similar effect if the application is performing a lengthy operation and the pod appears to Kubelet as unresponsive.

We provide easy access to the list of pods with incomplete specs, allowing cluster admins to have a targeted conversation with development teams about corrective action.

Routing and Endpoint Tracking

As part of our OpenShift monitoring, we provide visibility into potential routing and service endpoint issues. We track unused services, including those created by someone in error and those without any pods behind them because the pods failed or were removed.

We also monitor bad endpoints pointing at old (deleted) pods, which effectively cause downtime. This issue may occur during rolling updates when the cluster is under increased load and API request-throttling is lower than it needs to be. To resolve the issue, you may need to increase the –kube-api-burst and –kube-api-qps config values of kube-controller-manager.

Every metric we expose on the dashboard can be viewed and analyzed in the list and further refined with ADQL, the AppDynamics query language. After spotting an anomaly on the dashboard, the operator can drill into the raw data to get to the root cause of the problem.

Application Monitoring

Context plays a significant role in our monitoring philosophy. We always look at application performance through the lens of the end-user experience and desired business outcomes. Unlike specialized cluster-monitoring tools, we are not only interested in cluster health and uptime per se. We’re equally concerned with the impact the cluster may have on application health and, subsequently, on the business objectives of the app.

In addition to having a cluster-level dashboard, we also build specialized dashboards with a more application-centric point of view. There we correlate cluster events and anomalies with application or component availability, end-user experience as reported by real-user monitoring, and business metrics (e.g., conversion of specific user segments).

Leveraging K8s Metadata

Kubernetes makes it super easy to run canary deployments, blue-green deployments, and A/B or multivariate testing. We leverage these conveniences by pulling deployment metadata and using labels to analyze performance of different versions side by side.

Monitoring Kubernetes or OpenShift is just a part of what AppDynamics does for our internal needs and for our clients. AppDynamics covers the entire spectrum of end-to-end monitoring, from the foundational infrastructure to business intelligence. Inherently, AppDynamics is used by many different groups of operators who may have very different skills. For example, we look at the platform as a collaboration tool that helps translate the language of APM to the language of Kubernetes and vice versa.

By bringing these different datasets together under one umbrella, AppDynamics establishes a common ground for diverse groups of operators. On the one hand you have cluster admins, who are experts in Kubernetes but may not know the guest applications in detail. On the other hand, you have DevOps in charge of APM or managers looking at business metrics, both of whom may not be intimately familiar with Kubernetes. These groups can now have a productive monitoring conversation, using terms that are well understood by everyone and a single tool to examine data points on a shared dashboard.

Learn more about how AppDynamics can help you monitor your applications on Kubernetes and OpenShift.

Diving Into What’s New in Java & .NET Monitoring

In the AppDynamics Spring 2014 release we added quite a few features to our Java and .NET APM solutions. With the addition of the service endpoints, an improved JMX console, JVM crash detection and crash reports, additional support for many popular frameworks, and async support we have the best APM solution in the marketplace for Java and .NET applications.

Added support for frameworks:

  • TypeSafe Play/Akka

  • Google Web Toolkit

  • JAX-RS 2.0

  • Apache Synapse

  • Apple WebObjects

Service Endpoints

With the addition of service endpoints customers with large SOA environments can define specific service points to track metrics and get associated business transaction information. Service endpoints helps service owners monitor and troubleshoot their own specific services within a large set of services:

JMX Console

The JMX console has been greatly improved to add the ability to manage complex attributes and provide executing mBean methods and updating mBean attributes:

JVM Crash Detector

The JVM crash detector has been improved to provide crash reports with dump files that allow tracing the root cause of JVM crashes:

Async Support

We  added improved support for asynchronous calls and added a waterfall timeline for better clarity in where time is spent during requests:

  

AppDynamics for .NET applications has been greatly improved by adding better integration and support for Windows AzureASP.Net MVC 5, improved Windows Communication Foundation support, and RabbitMQ support:

 

 

Take five minutes to get complete visibility into the performance of your production applications with AppDynamics today.