Monitoring Kubernetes and OpenShift with AppDynamics

Here at AppDynamics, we build applications for both external and internal consumption. We’re always innovating to make our development and deployment process more efficient. We refactor apps to get the benefits of a microservices architecture, to develop and test faster without stepping on each other, and to fully leverage containerization.

Like many other organizations, we are embracing Kubernetes as a deployment platform. We use both upstream Kubernetes and OpenShift, an enterprise Kubernetes distribution on steroids. The Kubernetes framework is very powerful. It allows massive deployments at scale, simplifies new version rollouts and multi-variant testing, and offers many levers to fine-tune the development and deployment process.

At the same time, this flexibility makes Kubernetes complex in terms of setup, monitoring and maintenance at scale. Each of the Kubernetes core components (api-server, kube-controller-manager, kubelet, kube-scheduler) has quite a few flags that govern how the cluster behaves and performs. The default values may be OK initially for smaller clusters, but as deployments scale up, some adjustments must be made. We have learned to keep these values in mind when monitoring OpenShift clusters—both from our own pain and from published accounts of other community members who have experienced their own hair-pulling discoveries.

It should come as no surprise that we use our own tools to monitor our apps, including those deployed to OpenShift clusters. Kubernetes is just another layer of infrastructure. Along with the server and network visibility data, we are now incorporating Kubernetes and OpenShift metrics into the bigger monitoring picture.

In this blog, we will share what we monitor in OpenShift clusters and give suggestions as to how our strategy might be relevant to your own environments. (For more hands-on advice, read my blog Deploying AppDynamics Agents to OpenShift Using Init Containers.)

OpenShift Cluster Monitoring

For OpenShift cluster monitoring, we use two plug-ins that can be deployed with our standalone machine agent. AppDynamics’ Kubernetes Events Extension, described in our blog on monitoring Kubernetes events, tracks every event in the cluster. Kubernetes Snapshot Extension captures attributes of various cluster resources and publishes them to the AppDynamics Events API. The snapshot extension collects data on all deployments, pods, replica sets, daemon sets and service endpoints. It captures the full extent of the available attributes, including metadata, spec details, metrics and state. Both extensions use the Kubernetes API to retrieve the data, and can be configured to run at desired intervals.

The data these plug-ins provide ends up in our analytics data repository and instantly becomes available for mining, reporting, baselining and visualization. The data retention period is at least 90 days, which offers ample time to go back and perform an exhaustive root cause analysis (RCA). It also allows you to reduce the retention interval of events in the cluster itself. (By default, this is set to one hour.)

We use the collected data to build dynamic baselines, set up health rules and create alerts. The health rules, baselines and aggregate data points can then be displayed on custom dashboards where operators can see the norms and easily spot any deviations.

An example of a customizable Kubernetes dashboard.

What We Monitor and Why

Cluster Nodes

At the foundational level, we want monitoring operators to keep an eye on the health of the nodes where the cluster is deployed. Typically, you would have a cluster of masters, where core Kubernetes components (api-server, controller-manager, kube-schedule, etc.) are deployed, as well as a highly available etcd cluster and a number of worker nodes for guest applications. To paint a complete picture, we combine infrastructure health metrics with the relevant cluster data gathered by our Kubernetes data collectors.

From an infrastructure point of view, we track CPU, memory and disk utilization on all the nodes, and also zoom into the network traffic on etcd. In order to spot bottlenecks, we look at various aspects of the traffic at a granular level (e.g., reads/writes and throughput). Kubernetes and OpenShift clusters may suffer from memory starvation, disks overfilled with logs or spikes in consumption of the API server and, consequently, the etcd. Ironically, it is often monitoring solutions that are known for bringing clusters down by pulling excessive amounts of information from the Kubernetes APIs. It is always a good idea to establish how much monitoring is enough and dial it up when necessary to diagnose issues further. If a high level of monitoring is warranted, you may need to add more masters and etcd nodes. Another useful technique, especially with large-scale implementations, is to have a separate etcd cluster just for storing Kubernetes events. This way, the spikes in event creation and event retrieval for monitoring purposes won’t affect performance of the main etcd instances. This can be accomplished by setting the –etcd-servers-overrides flag of the api-server, for example:

–etcd-servers-overrides =/events#https://etcd1.cluster.com:2379;https://etcd2. cluster.com:2379;https://etcd3. cluster.com:2379

From the cluster perspective we monitor resource utilization across the nodes that allow pod scheduling. We also keep track of the pod counts and visualize how many pods are deployed to each node and how many of them are bad (failed/evicted).

A dashboard widget with infrastructure and cluster metrics combined.

Why is this important? Kubelet, the component responsible for managing pods on a given node, has a setting, –max-pods, which determines the maximum number of pods that can be orchestrated. In Kubernetes the default is 110. In OpenShift it is 250. The value can be changed up or down depending on need. We like to visualize the remaining headroom on each node, which helps with proactive resource planning and to prevent sudden overflows (which could mean an outage). Another data point we add there is the number of evicted pods per node.

Pod Evictions

Evictions are caused by space or memory starvation. We recently had an issue with the disk space on one of our worker nodes due to a runaway log. As a result, the kubelet produced massive evictions of pods from that node. Evictions are bad for many reasons. They will typically affect the quality of service or may even cause an outage. If the evicted pods have an exclusive affinity with the node experiencing disk pressure, and as a result cannot be re-orchestrated elsewhere in the cluster, the evictions will result in an outage. Evictions of core component pods may lead to the meltdown of the cluster.

Long after the incident where pods were evicted, we saw the evicted pods were still lingering. Why was that? Garbage collection of evictions is controlled by a setting in kube-controller-manager called –terminated-pod-gc-threshold.  The default value is set to 12,500, which means that garbage collection won’t occur until you have that many evicted pods. Even in a large implementation it may be a good idea to dial this threshold down to a smaller number.

If you experience a lot of evictions, you may also want to check if kube-scheduler has a custom –policy-config-file defined with no CheckNodeMemoryPressure or CheckNodeDiskPressure predicates.

Following our recent incident, we set up a new dashboard widget that tracks a metric of any threats that may cause a cluster meltdown (e.g., massive evictions). We also associated a health rule with this metric and set up an alert. Specifically, we’re now looking for warning events that tell us when a node is about to experience memory or disk pressure, or when a pod cannot be reallocated (e.g., NodeHasDiskPressure, NodeHasMemoryPressure, ErrorReconciliationRetryTimeout, ExceededGracePeriod, EvictionThresholdMet).

We also look for daemon pod failures (FailedDaemonPod), as they are often associated with cluster health rather than issues with the daemon set app itself.

Pod Issues

Pod crashes are an obvious target for monitoring, but we are also interested in tracking pod kills. Why would someone be killing a pod? There may be good reasons for it, but it may also signal a problem with the application. For similar reasons, we track deployment scale-downs, which we do by inspecting ScalingReplicaSet events. We also like to visualize the scale-down trend along with the app health state. Scale-downs, for example, may happen by design through auto-scaling when the app load subsides. They may also be issued manually or in error, and can expose the application to an excessive load.

Pending state is supposed to be a relatively short stage in the lifecycle of a pod, but sometimes it isn’t. It may be good idea to track pods with a pending time that exceeds a certain, reasonable threshold—one minute, for example. In AppDynamics, we also have the luxury of baselining any metric and then tracking any configurable deviation from the baseline. If you catch a spike in pending state duration, the first thing to check is the size of your images and the speed of image download. One big image may clog the pipe and affect other containers. Kubelet has this flag, –serialize-image-pulls, which is set to “true” by default. It means that images will be loaded one at a time. Change the flag to “false” if you want to load images in parallel and avoid the potential clogging by a monster-sized image. Keep in mind, however, that you have to use Docker’s overlay2 storage driver to make this work. In newer Docker versions this setting is the default. In addition to the Kubelet setting, you may also need to tweak the max-concurrent-downloads flag of the Docker daemon to ensure the desired parallelism.

Large images that take a long time to download may also cause a different type of issue that results in a failed deployment. The Kubelet flag –image-pull-progress-deadline determines the point in time when the image will be deemed “too long to pull or extract.” If you deal with big images, make sure you dial up the value of the flag to fit your needs.

User Errors

Many big issues in the cluster stem from small user errors (human mistakes). A typo in a spec—for example, in the image name—may bring down the entire deployment. Similar effects may occur due to a missing image or insufficient rights to the registry. With that in mind, we track image errors closely and pay attention to excessive image-pulling. Unless it is truly needed, image-pulling is something you want to avoid in order to conserve bandwidth and speed up deployments.

Storage issues also tend to arise due to spec errors, lack of permissions or policy conflicts. We monitor storage issues (e.g., mounting problems) because they may cause crashes. We also pay close attention to resource quota violations because they do not trigger pod failures. They will, however, prevent new deployments from starting and existing deployments from scaling up.

Speaking of quota violations, are you setting resource limits in your deployment specs?

Policing the Cluster

On our OpenShift dashboards, we display a list of potential red flags that are not necessarily a problem yet but may cause serious issues down the road. Among these are pods without resource limits or health probes in the deployment specs.

Resource limits can be enforced by resource quotas across the entire cluster or at a more granular level. Violation of these limits will prevent the deployment. In the absence of a quota, pods can be deployed without defined resource limits. Having no resource limits is bad for multiple reasons. It makes cluster capacity planning challenging. It may also cause an outage. If you create or change a resource quota when there are active pods without limits, any subsequent scale-up or redeployment of these pods will result in failures.

The health probes, readiness and liveness are not enforceable, but it is a best practice to have them defined in the specs. They are the primary mechanism for the pods to tell the kubelet whether the application is ready to accept traffic and is still functioning. If the readiness probe is not defined and the pods takes a long time to initialize (based on the kubelet’s default), the pod will be restarted. This loop may continue for some time, taking up cluster resources for no reason and effectively causing a poor user experience or outage.

The absence of the liveness probe may cause a similar effect if the application is performing a lengthy operation and the pod appears to Kubelet as unresponsive.

We provide easy access to the list of pods with incomplete specs, allowing cluster admins to have a targeted conversation with development teams about corrective action.

Routing and Endpoint Tracking

As part of our OpenShift monitoring, we provide visibility into potential routing and service endpoint issues. We track unused services, including those created by someone in error and those without any pods behind them because the pods failed or were removed.

We also monitor bad endpoints pointing at old (deleted) pods, which effectively cause downtime. This issue may occur during rolling updates when the cluster is under increased load and API request-throttling is lower than it needs to be. To resolve the issue, you may need to increase the –kube-api-burst and –kube-api-qps config values of kube-controller-manager.

Every metric we expose on the dashboard can be viewed and analyzed in the list and further refined with ADQL, the AppDynamics query language. After spotting an anomaly on the dashboard, the operator can drill into the raw data to get to the root cause of the problem.

Application Monitoring

Context plays a significant role in our monitoring philosophy. We always look at application performance through the lens of the end-user experience and desired business outcomes. Unlike specialized cluster-monitoring tools, we are not only interested in cluster health and uptime per se. We’re equally concerned with the impact the cluster may have on application health and, subsequently, on the business objectives of the app.

In addition to having a cluster-level dashboard, we also build specialized dashboards with a more application-centric point of view. There we correlate cluster events and anomalies with application or component availability, end-user experience as reported by real-user monitoring, and business metrics (e.g., conversion of specific user segments).

Leveraging K8s Metadata

Kubernetes makes it super easy to run canary deployments, blue-green deployments, and A/B or multivariate testing. We leverage these conveniences by pulling deployment metadata and using labels to analyze performance of different versions side by side.

Monitoring Kubernetes or OpenShift is just a part of what AppDynamics does for our internal needs and for our clients. AppDynamics covers the entire spectrum of end-to-end monitoring, from the foundational infrastructure to business intelligence. Inherently, AppDynamics is used by many different groups of operators who may have very different skills. For example, we look at the platform as a collaboration tool that helps translate the language of APM to the language of Kubernetes and vice versa.

By bringing these different datasets together under one umbrella, AppDynamics establishes a common ground for diverse groups of operators. On the one hand you have cluster admins, who are experts in Kubernetes but may not know the guest applications in detail. On the other hand, you have DevOps in charge of APM or managers looking at business metrics, both of whom may not be intimately familiar with Kubernetes. These groups can now have a productive monitoring conversation, using terms that are well understood by everyone and a single tool to examine data points on a shared dashboard.

Learn more about how AppDynamics can help you monitor your applications on Kubernetes and OpenShift.

Business iQ Enhancements in 4.5: Connecting the dots between applications and business

Digital transformation has brought applications into the limelight. The importance of application performance monitoring has grown many-folds and application and business correlation is more critical than ever. Applications are the touch-points between businesses and the end-users, and therefore, influence the business strategy.

Business iQ, the business performance monitoring solution from AppDynamics, helps our customers correlate application performance with critical business metrics in real-time. The focus on providing these real-time insights in an uncomplicated, easily operable manner is a theme persistent across our product releases. In 4.5 as well, released in July 2018, our team focused on delivering solutions that are easy to use and solve key business performance issues.

Improved Business Correlation

Business Journeys was released in 4.4, in Nov 2017, and provides end-to-end visibility into multi-step complex business workflows by stitching them together through unique identifiers. In 4.5, we provide out-of-the-box dashboards and metrics with no manual configuration required. These dashboards provide aggregate view of the key data points such as average total time, average wait time between milestones, conversion, number of events per milestone etc. for all the business journeys. An example of a loan application approval business journey is shown below.

Out-of-the-box metrics on business journeys allow users to create health rules and alerts to track deviations from the normal business workflow. These metrics can be compared with past data using machine-learning based dynamic baselines. The screenshot below compares the average total time for loan approval in the last hour with the last 15 days baselined data.

We also added support for custom data sources to the configuration and one-click access from aggregate view to the underlying event data – all these enhancements focused on providing more business insights with just few clicks.

Experience level management (XLM) allows users to monitor and report on key service levels and end-user experience levels critical to delivering high-performing applications. Since these experience levels and service levels differ based on geographies, time zone support for such reports is a must-have requirement. XLM configuration now supports different time zones. For e.g. United Airlines can now track the login response time separately for East Coast and West Coast, or for North America and Europe.

Business iQ’s analytics and reporting capabilities are powered by AppDynamics Query Language (ADQL) and UI widgets.

Additional ADQL Operators

New ADQL operators, HAVING, and SINCE..UNTIL, enable more sophisticated aggregation and filtering. The HAVING clause is used to filter groups created by aggregate functions such as SUM or AVG as the WHERE clause cannot be used with these aggregate functions. For example in a financial application, to list the business transactions with average response time greater than 5000 ms.

SINCE..UNTIL makes it easier for the user to select specific time window in the ADQL search query and not get tied to the UI based time picker. For example, the following search query can be used to return all events from Black Friday by using unix/epoch time for Nov 24, 2017

SELECT * FROM transactions SINCE 1511510400 UNTIL 1511596800

Or use the following query to simply search for events from the last one hour

SELECT * FROM transactions SINCE 1 hour

Widgets Enhancements

Enhancements to our widgets allows more precise widget customization and make it easier to interpret trends in your event data. Log axis for time series widgets, level of significance and trailing period comparison for numeric widgets along with other enhancements enable users to compare and highlight metrics that are most important.

Business Metrics

Business Metrics are used to monitor values of certain repetitive ADQL searches such as per minute data on the number of customers impacted by the slowness in a login application. The Metric Listing page has been updated to provide a more intuitive experience. Click on a metric name to open a pre-populated Metric Browser or select multiple metrics to view them together in the Metric Browser for comparison. The Metric page now displays the underlying ADQL query for metrics, making it easy to see what a particular metric represents and how it is calculated.

Greater Scale

On our underlying Platform, called the Event Service, we continue to make improvements to our existing architecture to ensure maximum uptime, real-time availability of data, and blazing fast query response time. This will allow our platform to scale to even greater heights, ingest more events, and respond to queries at the performance AppDynamics users can expect.

Agentless Analytics

We are also excited to launch a Beta program for using Transaction Analytics without the need to install an additional Analytics Agent and enable analytics data collection with the snap of a finger.

The focus of product teams at AppDynamics is to deliver easy to use solutions providing key application and business performance insights. This cannot be achieved without the valuable feedback from our customers. Feel free to reach out to your AppDynamics account team to share your thoughts and to learn more about what’s new in 4.5 or what’s coming in near future.

Louis Huard and Stefan Hermanek also contributed to this blog post. 

Improve the Productivity of Relationship Managers and Financial Advisors with Business iQ

Every job has its mundane administrative tasks, and we all hate them. In the world of wealth management, relationship managers are pressured to serve as many existing customers and prospects as they can with the ultimate goal of increasing the assets under management (AUM)—one of the key metrics used to measure their productivity. Similarly, in the insurance industry, financial advisors are driven to maximize their time with clients. Administrative tasks are not only irritating, they also reduce a salesperson’s paycheck by cutting into his or her time with customers.

But organizational forces in both the wealth management and insurance industries are conspiring against their top revenue generators. According to Seismic, a staggering 65% of a relationship manager’s time is spent on business processes like account opening, accessing collaterals, and creating customized portfolio review with customers.

In the last few years, financial institutions and insurance companies have sought to free up their salespeople by investing in productivity tools. Mobile apps, in particular, hold the promise of speeding up processes like filling out client forms for clients, creating proposals, and building portfolios. They are also, in theory, a great way to deliver real-time market insights.

But mobile apps are only effective when relationship managers and advisors use them.

AppDynamics Business iQ allows organizations to measure the effectiveness of their mobile apps by providing a window into user behavior. In the example below, I show how a financial institution can instantly see how many relationship managers have clicked on a market insight to access AI-driven financial advice—a killer feature for increasing AUM. The dashboard, which I created in the AppDynamics demo environment, also shows how many relationship managers proceeded to “Add to Cart” and re-balanced their clients’ portfolios. We see that as relationship managers moved through the funnel, they increasingly abandoned the app. The overall conversion rate is just 5.62%. Slightly over one in twenty relationship managers used the application to send a proposal to their clients.

Below, I show how to a create conversion funnel using a built-in widget. It is as simple as going to the Add Widget tab and selecting Analytics and Funnel Analysis.

 You then select the business transaction that you’d like to include in the conversion funnel.

You can also quickly design a custom widget to highlight information such as the relationship managers who are generating the most new business.

Figure:  RMs with the highest new AUM

Or see at a glance the relationship managers who are sending the highest number of proposals. Moreover, you can break down the proposals by customer type. So you can see which customer type (Silver, Gold, Platinum, Diamond) the relationship managers are creating the proposals for. In the example below, you can see that relationship manager “aleftik” is sending all 960 proposals to only “Silver” tier customers. Relating the previous graph where the highest AUM is “aleftik” and he focuses all his effort to selling to the “Silver” tier customers, it appears that this is a desirable behaviour and strategy that the business should educate and share among other relationship managers.

Figure: RMs with the highest “Send Proposal” Transactions

Moving beyond the performance metrics of individual relationship managers and financial advisors, you can combine technical and performance metrics in order to see if updates to an application are negatively affecting sales performance.

You can see from the above conversion graph that version 2 of the code has significantly reduced the slowness (yellow and orange color within the bar) for Portfolios Summary page, positively impacting the conversion ratio from 5.53% to 14.52%

The business may also want to identify relationship managers who are not using the new productivity tool enough. Below is a way to create such a list of managers with the least number of page hits.

Figure: Number of page visit on “Market Insights” by RMs in ascending order

You can even put all of this together to have a customizable dashboard combining both technical and business performance metrics. At a glance you’re able to see the new AUM achieved by the wealth management group using the iPad application, transaction health of each key business process, top performing relationship managers and the products sold, as well as relationship managers who have yet to adopt the new application as a productivity tool.

With AppDynamics Business iQ, institutions do not need to wait for a month or a week to see business insights in relation to application performance and user behaviour. All the information is available at a glance in real time.

Monitor End-User Experience Levels and Service Availability with eXperience Level Management (XLM)

Digitization has transformed the way customers buy and use products. There has been a tectonic shift in customer expectations regarding product availability (measured by service level management) and product performance (measured by experience level management).

According to a 2017 State of Online Retail Performance report by SOASTA (now part of the content delivery network provider, Akamai), 53% of mobile site visitors leave a page that takes longer than three seconds to load (based on data equating to approximately 10 billion user visits). Another study ties customer satisfaction to website performance by highlighting the fact that only 38% of users stated website availability as an issue, whereas around 73% of users complained about slow website experience.

The difference between the percent of customers impacted by service levels vs. those affected by experience levels is enormous and brings to the fore a critical question: Can enterprises now only rely on service availability to deliver the best customer experience? The answer is, no.

Customers now have a high bar for technical performance and certain service levels, making Service Level Monitoring insufficient for enterprises looking to offer best-in-class end-user experience. Companies now need to consider end-user experience levels (to measure the efficiency and effectiveness of the service) as the key metric, and service levels (for availability and resolution) as a contributing factor to the end-user experience.

Challenges with eXperience Level Management

In the past, instrumenting end-user eXperience Level Management (XLM) has not been straightforward for any business.

One huge challenge in implementing XLM is ascertaining the data sources for compliance calculations. Businesses spend hours and days gathering all the data in spreadsheets and other tools in an attempt to feed the right data into their policy management applications. But the myriad of data collection mechanisms, with different data formats and user workflow definitions, result in an inaccurate XLM policy implementation often based on erroneous data.

And all this trouble is for a single SLA policy addressing a single product and user type. An enterprise with multiple products can’t even consider identifying the right experience level for their different products and customer segments because of this extremely complex and tedious process. Without being able to segment experiences, an important customer’s experience might get rolled up with everyone else’s, and their challenges might have a disproportionate impact on your business. For example, a delay in delivering a product for an Amazon prime member who has indicated shipping speed as a priority could result in a loss in future business.

Another challenge in deriving a proper XLM solution in an enterprise is establishing a “single source of truth” between all parties involved in an application. End-users may have one set of expectations for where they engage, while third-party service providers might have another, and some of these expectations may be expressed contractually, too. As an enterprise, straightforward communication throughout the business is key to establishing trust between all stakeholders and ensuring all agreements are being met.

AppDynamics launches XLM in Nov 2017

We at AppDynamics are excited to address these challenges for our customers with the introduction of eXperience Level Management (XLM) as part of our Business iQ product. XLM provides an ability to measure metrics that matter to businesses and their end-users, along with the ability to measure service availability.

Automated data collection and reporting, single source of data, experience levels for different product types and customer segments, and an immutable audit trail to build trust amongst all parties – AppDynamics’ XLM solves these main challenges that enterprises face in implementing their business-critical experience and service-level policies.

Data Selection with Exclusion Periods

AppDynamics Business iQ collects every bit of information flowing through an end-to-end application workflow, and can ingest data from multiple data sources. These could be events generated by AppDynamics agents like business transaction events, log events, or end-user events. These could also be events that are sent to Business iQ using REST APIs or other custom events such as Business Journeys (released in 4.4) that defines complex business workflows. An XLM report can be created on any of these event types for end-user experience management and service availability calculations.

What’s more, for any planned upgrades or maintenance schedules that can lead to potential degradations in end-user experience, XLM has the functionality to explicitly define exclusion periods to disregard compliance calculations during such intervals – ensuring that trivial data collection is excluded.

Compliance Target and Daily Thresholds

Once the data set is defined, users can set compliance targets on any business or application metric that is key to their business or end-users. These metrics can be anything from login time for gold member airline customers, to the checkout time for platinum customers on an e-commerce website.

XLM provides users the ability to define different threshold levels (Normal, Warning, and Critical) to monitor their reports. By providing multiple thresholds, XLM enables users to visualize slight degradations within their metrics and take prompt corrective actions.

XLM Configuration Settings.

Users can also specify reporting period (weekly or monthly) to define aggregation intervals and view the compliance on the aggregated data. XLM also has a drill-down functionality, allowing users to take weekly or monthly data and drill down to daily data and even as granular as individual events.

XLM Dashboard – with aggregate view of compliance for the last five periods

Car Loan Login Response Time – Daily Compliance Data for a weekly aggregate with drill-down to event level information.

Audit Trail

Lack of trust is one of the challenges for all parties involved in monitoring, implementing, and enforcing compliance. With fully automated data collection and reporting for XLM, and an immutable audit trail of any changes made to the configuration, AppDynamics can be that “single source of truth” for our customers and their partners.

Audit trail for “Car Loan Login Response Time” XLM report.

While the consistency in service availability is vital, businesses need to provide the best quality experience tailored to the product and customer segment. eXperience Level Management (XLM) is the first step towards helping our customers achieve this. We look forward to your comments and feedback.

Learn more about Business iQ or schedule a demo to learn more about AppDynamics.

Accelerate Your Digital Business with AppDynamics Winter ‘17 Product Release

Last month at AppD Summit New York, we unveiled the latest innovations in our Business iQ and App iQ platforms, paving the way for a new era of the CIO and digital business. Delivering on this vision, we’re excited to announce the general availability of AppDynamics’ Winter ‘17 Release for our customers.

As application and business success become indistinguishable, enterprises are increasing their investment in digital initiatives. According to Gartner, 71% of enterprises are actively implementing digital strategies, and IDC predicts that companies will spend $1.2 trillion on their digital transformation in 2017 alone.

But without effective tools to correlate application and business performance – and lack of end-to-end visibility across customer touchpoints, application code, infrastructure, and network – customer experiences and employee productivity are degraded, and executives can’t analyze or justify technology investments. In fact, according to McKinsey, the digital promise still seems more of a hope than a reality, with only 12% of technology and C-level executives confident that IT organizations have been effective in this shift.

Winter ‘17 Release is Here

Business iQ just got better. Bridging the gap between the app and the business, BiQ capabilities have expanded to include:

Business Journeys

With AppDynamics Business Journeys, application teams can link multiple, distributed business events into a single process that reflects the way customers interact with the business. Business events can include transaction, log, mobile, browser, synthetics, or custom events and are long-running, from hours to days.

Application teams can create performance thresholds and quickly visualize where performance issues are impacting the customer experience. KPIs for each Business Journey inform technology investments and effectively prioritize code development and release.

In the two figures below, you can see how easy it is to set up a new Business Journey for loan approvals and visualize the impact of delays through the lens of the business.

Business_Journey_Ani_720x.gif

Fig 1: Author an end-to-end Business Journey by joining multiple distributed events.

Screen Shot 2017-10-31 at 11.27.16 AM.png

Fig 2: Quickly and easily create custom dashboards visualizing business performance.

Experience Level Management (XLM)

With XLM, enterprises can establish custom service-level thresholds by customer segment, location, or device. For example, the CIO of a major retailer may deliver tailored experiences to its top customers by setting performance thresholds across its customer channels — including website, mobile apps, in-store wireless, and in-store checkout. XLM also provides an immutable audit for service-level agreements with your customers or internal business units. The product images below show the service levels setup for a connected streaming device, giving an instant view on how services are performing against set SLAs.

Screen Shot 2017-11-01 at 10.45.06 AM.png

Fig 3: Service levels setup for a connected streaming device.

Network Visibility

Application developers, IT Ops  and network teams often work in silos using a myriad of different monitoring tools. To troubleshoot application performance issues, war rooms are created, and the lack of a common language and visibility across different tools results in finger pointing, endless debates, and slower Mean Time to Resolution (MTTR).

With the introduction of AppDynamics Network Visibility, a capability AppDynamics is uniquely positioned to address now as part of Cisco, enterprises will be able to understand the impact that the network is having on application and business performance. Network performance measurements are automatically correlated with application performance in the context of the Business Transaction. IT teams will be able to triage network issues with one single pane of glass and provide the right information to network teams before there is an impact on the end-user experience. Finally, an answer to end-to-end visibility from customer, to code, to network is here.

AppDynamics automatically discovers network devices such as reverse proxy load balancers deployed on-premises and in cloud environments and eliminates the need to use expensive network tools such as SPAN/TAP to capture and analyze network traffic.

The animation below shows out-of-box visibility into network flow maps, network metrics such as latency, throughput, retransmission rates, and critical errors, enabling IT Ops to quickly identify and isolate root cause without the need to engage network teams.

Network_Viz_Ani_720x.gif

Fig 4: Correlated and out-of-box view of network performance in context of application performance.

AppDynamics IoT

IoT devices create another channel to engage with customers, and if properly measured and optimized, can create game-changing business benefits. With new IoT visibility, businesses can convert rich and invaluable insights into consumer behavior, buying patterns, and business impacts. IoT visibility includes:

Device analytics  — Together with Business iQ, IoT visibility provides an unprecedented insight into how IoT devices are driving business impact. And because these insights are delivered through a single platform, IoT visibility is the first and only solution that maps and correlates entire customer journeys — from the device to customer touchpoint, to business conversions.

Device application visibility and troubleshooting — AppDynamics’ new IoT visibility provides an aggregated view into device uptime, version status, and performance, enabling drill-down views into the device to simplify the troubleshooting of IoT applications. The screenshot below shows a list view of all active devices. A simple double-click on a specific device takes you to the device details.

Custom dashboards — Every company measures success differently. With custom dashboards in IoT visibility, companies from any vertical can quickly build new visualizations to measure the business impact of IoT devices — from the revenue impact of a slow checkout for a brick and mortar retailer, to the customer impact of a software change in a connected car.

All_active_devices.png

Fig 5: Consolidated list view of all active smart-shelf  IoT devices and key KPIs.

Synthetic Private Agent

AppDynamics Winter ‘17 Release brings Browser Synthetic Monitoring to your internal network. By running Synthetic Private Agent on-premises, you can monitor the availability and performance of internal websites and services that aren’t accessible from the public Internet. You can also test specific locations within your company and set alerts when performance issues occur and fix them before end-user experience is impacted.

Cross-Controller Federation

As application teams start using microservices architecture, the scalability requirements have exploded, necessitating APM scale. With Cross-Controller Federation, AppDynamics is taking unified monitoring to the next level. Our customers can achieve limitless scalability and flexibility to deploy application components across multiple public and private clouds.

Only with AppDynamics, customers get complete correlated visibility and quick drill-down into the line of code, irrespective of where the application components and controllers are deployed, because controllers can participate in a federation. Another important use case is keeping APM data isolated by deploying multiple controllers yet maintaining correlated visibility for compliance, architecture, and business reasons.

KPI Analyzer

KPI Analyzer applies machine learning to automate root cause analysis. With the KPI analyzer, customers can isolate the metrics that are the most likely contributors to poor performance, and identify the likely degree of impact on the KPI for each metric, automatically. The KPI analyzer makes troubleshooting root cause as simple as clicking a prompt to surface the underlying issue most likely to be the root cause of degraded performance.

The following figure shows KPI Analyzer in action. KPIs such as average response time are displayed with metrics that are automatically identified as the root cause and scored in ranked order for quick resolution.

KPIAnalyzer.png

Fig 6: Key application KPIs and automatically-detected root causes in ranked order.

Learn More

AppDynamics’ Winter ‘17 Release is rich with other important features such as Universal Agent to simplify agent installation and configuration, Enterprise Console for streamlined controller lifecycle management, and Node.js flame graph for deeper visibility, among several other features.

Join us for a webinar on November 16th to get an in-depth look into the latest innovations and features in our Winter ‘17 Release. You can also get started with the free trial of AppDynamics Winter ‘17 today!

Four Use Cases for Leveraging Business iQ

The application and the business have converged.

In fact, the performance of your business is now inseparable from the performance of your apps. Customers who are connected to the code you write or the applications and infrastructure you manage demand a flawless experience, and they are loyal to the brand that delivers it. The challenge, however, is that the traditional ways of managing services and operations are falling short, jeopardizing business success.

Many BI tools today help analyze business data, but they are mostly for historical analysis and trends. Web analytics tools on the other hand help analyze customer conversion rates and how end-users are using your website, but don’t tell you why it’s happening. And then there are traditional APM tools that tell you whether your applications are healthy or performing poorly, but offer little visibility into the impact on your business. As a result, many organizations are struggling to connect data to business outcomes.

Business iQ offers real-time context from customer to code – connecting application performance data to business outcomes to enable both the business and IT.

To see how you can leverage AppDynamics Business iQ, check out these four common use cases.

Business Health

Let’s take a scenario where there is a major event, like Black Friday or a launch day for an important product, or simply any day you need to win customers. If you are using AppDynamics for your APM solution, your DevOps may receive alerts about anomalies in your application’s performance. When this happens, business owners may want to understand the impact on key business drivers, as well as any revenue and customer experience implications. With Business iQ, you can monitor critical business KPIs to get real-time insight into the health of your business.

Figure 1

Using Figure 1 (above), let’s say you are an e-commerce site and manage a number of different brands. Overall, you care about your site’s conversion rate, number of orders processed, total sales, and the percentage of customers moving to your loyalty program.

It can be inferred from the dashboard that in the last hour, ‘Total Sales’ may be declining since not many of your loyal customers are shopping today. What’s more, you can also see from the dashboard that there’s a clear correlation between the drop in sales revenue and the the % errors in APDY Electronics. However, with just a 1% error in APDY Books, there may be a business problem impacting sales versus an IT issue.

With this level of insight, DevOps teams will be able to investigate root-cause for the performance issue, or declare that it’s not an IT problem. Business health monitoring attempts to converge business data with application and infrastructure data to give you visibility into business KPIs that allow you to diagnose and fix problems in real time.

User Journey

Now that we’ve covered business health monitoring, let’s move on to our second use case: User journey monitoring, which measures how business components and customer experience come together to drive top-level KPIs. Figure 2 (below) helps illustrates this use case:

Figure 2

Let’s say you’re a bank that would like to understand how your users are moving along the loan processing journey – from viewing rates online, submitting an application on your website, running a credit check, and finally getting the loan approval.

Business teams might be interested in knowing how many customers are in the loan journey, where there are drop-offs, and how KPIs are impacted. Your operations team, however, may be curious if customer drop-offs are a result of slow application performance. And lastly, from a developer standpoint, you may want to know how you are impacting a larger process and causing real-business impact.

In the above dashboard, you can see that a longer response time during the “Submit Application” step at 7s, is probably causing a higher drop-off, and therefore impacting the loan amount processed at $10M. Furthermore, a 15% error rate at “Credit Check” is further compounding this problem at the “Loan Approval” stage.

User Journey Monitoring allows you to visualize different parts of a process, driving a common language between business and IT, whether you are a bank looking to optimize your processes or a retailer trying to visualize how your customers shop online.

Customer Segment

Interested in understanding the end-to-end experience of your critical customer segments? With Business iQ, you can. Check out the use-case below for customer segment monitoring.

Figure 3

Let’s say you’re a travel company that connects back-end travel inventory (think flights, hotels, etc.) to multiple front-end buying channels (think websites used to book travel, like Expedia, Orbitz, Priceline, etc.) With Business iQ, you can compare customer experience across these various buying channels, and segment customers based on error codes, slow transactions, and more. You can also analyze customer experience across various travel agents or hotel brands.

This deep-level end-user monitoring is critical for businesses, as it allows you to monitor customer issues so you can proactively avoid them in the future.

Your DevOps team might also be interested in customer segment monitoring to understand where to prioritize troubleshooting efforts. For example, by looking at the above dashboard, it’s clear that Priceline customers are experiencing issues with reservation confirmation and search availability, and that the “5 Star Luxury” hotel class segment has the highest error reservations. This information allows DevOps teams to have visibility into how their critical customer segments are performing and prioritize what they want to work on.

This capability allows your business to get visibility into how application performance impacts customer interaction with your features.

Release Validation

You can also use AppDynamics in real-time as you release newer versions of your application, or migrate from a legacy infrastructure to a new infrastructure. Business iQ also allows you to compare KPIs from your previous and current versions.

Figure 4

Let’s consider this example using Figure 4 (above): APDY media has an entertainment site where customers sign up for subscriptions. The journey they go through starts with creating a profile, selecting their favorite content, setting alerts on what content they want to be notified about, and finally, subscription confirmation.

Looking at the above dashboard, it is evident that in Version 1.0 of APDY media, the subscription rate starts to decline significantly, and it appears to be linked to a performance issue when selecting “Favorite Content.”

You can use Business iQ to help identify the problem and understand where to fix the issue. You can also use Business iQ to compare releases and identify if the problems you wanted to fix in your latest release are indeed fixed. You can do this by seeing if performance in the “Favorite Content” step improved post-release and how it correlates to an increase in conversion.

You can also see from this dashboard that while you are sending a smaller amount of traffic to Version 2, you still managed to triple your conversion rate while also driving much higher subscriptions.

The above use cases are just some of the common scenarios we are seeing in our customer environments. As you try our product, you’ll learn that there’s a lot more you can achieve with Business iQ.

Harish Doddala is leading product growth and adoption initiatives of Business iQ at AppDynamics. He has over 10 years of Product Management and Software Engineering experience delivering results for Cisco, VMware, and Oracle.

Is Your Intelligence Failing You in One Critical Area?

How’s your Business Intelligence software working out for you?

If your experience has been like that of many BI users, your answer is probably a bit of a mixed bag. That’s because most BI users have experienced a combo of great insights and extreme frustration from their BI software.

Change is in the air. The reason I want to discuss this today is because as with many things IT, intelligence is up for disruption as well.

A Brief History of BI

In the early days, BI revolved around the following process:

  •      A business user would define a need, and submit a ticket to the IT department
  •      The IT department would gather the relevant data, often from data warehouses and cubes (Cognos, Business Objects, etc.), and deliver it to a business analyst
  •      The business analyst would then analyze the data using spreadsheets or some form of dashboards, and then create reports for the business user

Sounds like a cumbersome process, doesn’t it?

Even worse, it’s a very s-l-o-w process. A lag time of weeks or even months between the initial request and final delivery were common. Or perhaps I should say are common – many companies are still stuck in this “early days” process.

And data was often stale before it was even loaded into the data warehouses, since data was often collected from production databases on a weekly basis. So by the time business analysts finally got their hands on the reports, they weren’t exactly looking at up-to-date information.

Screen Shot 2015-01-14 at 3.07.27 PM

Eventually, companies like Tableau and Qlik (formerly QlikView) arose to fill the growing demand for visualization and dashboarding. Business analysts finally had the ability to slice and dice data to their own needs. That was great progress. But business users were still working with stale data.

But early on, BI processes revolved strictly around structured data. Data warehouses – 1990s technology – did not have the ability to capture unstructured data.

All data captures were from relational-based data storage. Only data that conformed to a conventional relational schema was captured; all other data was untouched.

Unstructured Data Gets Unlocked

Large companies like Yahoo and Google were crawling trillions of web pages, amassing massive amounts of data, and indexing the unstructured information for rapid searchability. They built Hadoop-like technologies to capture and analyze large volumes of unstructured data. And so the open-source technology Hadoop became quite popular for storing vast quantities of unstructured information – though not nearly as efficiently as structured data storage.

To facilitate the process of storing unstructured data, Hadoop created its own file system: HDFS (Hadoop Data File System). And Hadoop provided MapReduce technology, which allowed analytics to run on top of all that unstructured data.

The advent of Hadoop inspired companies to begin capturing ever-increasing quantities of unstructured data.

But the MapReduce technology has its flaws. It’s slow. It takes lots of time to run jobs that must sift through massive amounts of data. The time lag between asking a question and getting an answer can be substantial – and in many cases, entirely unacceptable.

Currently, many companies are using Hadoop to store huge quantities of unstructured data. And they’re combining the text data analytics with structured data analytics from data warehouses, and using applications such as Tableau to analyze the resulting amalgamation of data.

Traditional Business Intelligence solutions – based both on structured and unstructured data – have evolved to be of great value. They provide companies with a wealth of decision-making support that simply wasn’t available not so long ago.

But there’s a problem…

A Gap Between Capabilities and Needs

More and more, the business world is running on software. In many cases, software-based business models have even toppled long-entrenched business dynamos.

Netflix vs. Blockbuster is a classic example.

Netflix, enjoying the benefits and economies of operating a software-based business model, contributed greatly to putting Blockbuster and it’s huge empire of physical stores essentially out of business.

But as the business world becomes more software-oriented, companies increasingly need a way to gather insights into software operations. And traditional BI tools are failing companies in a very critical way. Software provides businesses flexibility to their operations. Code changes can easily alter how businesses are operating. The DevOps culture is resulting in multiple application updates per day, and BI tools and their huge latencies are simply not getting the job done.

Let’s Go Shopping

To illustrate the problem with traditional BI tools, let’s imagine that you’re logged-on to one of your favorite eCommerce websites to do some shopping – something that you likely do very frequently.

There’s a particular product you want to buy. But during the process, you’ll probably do some browsing around. Read some customer reviews. Consider alternatives.

And then once you’ve fulfilled your mission, and added your must-have item to your shopping cart, you’re likely to do some more browsing. Just some fun shopping. Some wish-listing.

Then, finally, you go through the checkout process and leave the site. The classic BI tool has only captured the end result of your interaction with the site – your purchase. What has the merchant company learned about you, their customer? Probably not as much as they could have or should have.

Opportunity Lost…

If the company is using only traditional BI tools, they’ve not learned nearly as much about you as opportunity offered. Sure, they’ve collected some data relating to your purchase.

But they could have learned so much more about you than what the mere transaction records offer.

They could have learned more about your interests. They could have learned how to serve you better. They could have learned ways to engage you far beyond your single purchase.

All invaluable data – and right there for the taking. But many companies don’t take it. Intentionally or not, many companies are turning up their noses at this unprecedented opportunity.

Opportunity Maximized…

Our Application Intelligence Platform offers companies a means of turning all of this disregarded or neglected data into golden opportunity. It provides:

  •      Real-time information about every interaction flowing through the software system
  •      Business context for every transaction type – logging transactions; add-to-cart transactions; checkout transactions; etc.
  •      End-to-end visibility of all transaction streams, front-end to back-end
  •      All information presented in a single dashboard

business-impact-analytics2-1-960x0 (1)

Application Intelligence fills the gap that other BI tools ignore. As more companies adopt a software-defined model, customer experience becomes one of the most valuable commodities. Understanding your customer to give them a seamless experience is vital to long term success. With Application Intelligence, you can understand your customer better. It helps you to serve your customer better.

It helps to maximize the benefits of the hard-earned relationships you’ve established with your customers. And in the end, isn’t that what Business Intelligence is all about?

Start understanding your customer better and gaining insightful metrics. Try out AppDynamics for FREE today!