Achieving Rapid Time-to-Value with AppDynamics

At AppDynamics, we are firm believers in demonstrating the value of our platform as quickly as possible. Many of our customers are able to address critical performance issues within minutes of getting their application instrumented. Below is an example of how we use AppDynamics with a fictional customer, AD-Betting, to analyze and troubleshoot the company’s business environment soon after AppD is up and running.

Getting Started

After allowing AppDynamics to watch the AD-Betting environment for a few hours, we were able to see the company’s environment, how its components were interacting with one another, and user interactions. Here is AD-Betting’s architecture diagram as represented by a close-up view of our Application Flow Map:

First we examined the last hour of AD-Betting’s performance data:

Even without having detailed knowledge of the application, we knew for certain the average response time increase at 3:00 p.m. wasn’t related to a scalability issue, as the load hadn’t increased at the same time.​

To investigate the performance issue further, we reduced the time range to cover only the problematic peak. The resulting flow map showed the HTTP call between the tier (aggregator) and the external service (ad-soccer-games.com) took four seconds on average:

With a single click on the aggregator tier, it was easy to identify the impacted business transaction (/quotes):

With a single drill-down click on View Business Transaction Dashboard, we could see how “/quotes” was impacted by this issue. The scorecard for this transaction (below) uncovered a direct correlation between the spike in application response time and an increase in “Slow” and “Very Slow” transactions:

To find the root cause, we switched to Transaction Snapshots view and filtered for slow and very slow snapshots, which AppDynamics had collected automatically. We then picked one snapshot as an example and looked at the Call Graph, which showed the call for method aggregateWithoutCaching in line 217 took five seconds. The connected HTTP exit call terminated with SocketTimeoutException: Read timed out. We also verified this issue with other snapshots at the same time and—voila!—AppDynamics had uncovered the root cause of the performance issue:

To better understand the issue’s impact, we used AppDynamics’ End User Monitoring to analyze AD-Betting’s most important page requests sorted by end-user response time. We immediately spotted the slow end-user experience for FIFA 2018 – All Games, which was an extremely critical component of the company’s business, particularly with the World Cup underway. This was one issue we needed to analyze further.

Looking at the waterfall diagram of all views for this particular page, we discovered that most of the transaction time was spent on the backend, not on the network or browser side:

We then picked a single browser snapshot for closer examination. Since AppDynamics correlates frontend (browser) and backend (Java) automatically, we were able to get an associated backend snapshot as well. In this case, we diagnosed the Call graph and again found the same performance-impacting issue. As you can see, the problem was impacting one of the company’s most important pages:

We hope this simple walk-through of AppDynamics’ powerful monitoring and diagnostic capabilities helps you analyze and troubleshoot your own business-critical environments. Take the AppDynamics Guided Tour to learn more!

Managing Software Reliability Metrics: How to Build SRE Dashboards That Drive Positive Business Outcomes

Customers expect your business application to perform consistently and reliably at all times—and for good reason. Many have built their own business systems based on the reliability of your application. This reliability target is your service level objective (SLO), the measurable characteristics of a service level agreement (SLA) between a service provider and its customer.

The SLO sets target values and expectations on how your service(s) will perform over time. It includes service level indicators (SLIs)—quantitative measures of key aspects of the level of service—which may include measurements of availability, frequency, response time, quality, throughput and so on.

If your application goes down for longer than the SLO dictates, fair warning: All hell may break loose, and you may experience frantic pages from customers trying to figure out what’s going on. Furthermore, a breach to your SLO error budget—the rate at which service level objectives can be missed—could have serious financial implications as defined in the SLA.

Why an Error Budget?

Developers are always eager to release new features and functionality. But these upgrades don’t always turn out as expected, and this can result in an SLO violation. With that being said, your SRE team should be able to do deployments and system upgrades as needed, but anytime you make changes to applications, you introduce the potential for instability.

An error budget states the numeric expectations of SLA availability. Without one, your customer may expect 100% reliability at all times. The benefit of an error budget is that it allows your product development and site reliability engineering (SRE) teams to strike a balance between innovation and reliability. If you frequently violate your SLO, the teams will need to decide whether its best to pull back on deployment and spend more time investigating the cause of the SLO breach.

For example, imagine that an SLO requires a service to successfully serve 99.999% of all queries per quarter. This means the service’s error budget has a failure rate of 0.001% for a given quarter. If a problem causes a 0.0002% failure rate, it will consume 20% of the service’s quarterly error budget.

Don’t Aim for Perfection

Developing a workable SLO isn’t easy. You need to set realistic goals, as aiming for perfection (e.g. 100% availability) can prove very expensive and nearly  impossible to achieve. Your SRE team, which is responsible for the daily operation of an application in production, must work with interested parties (e.g., product owners) to find the correct transactions to monitor for your SLO.

To begin, you must define your SLIs to determine healthy levels of service, and then use metrics that expose a negative user experience. Your engineering and application teams must decide which metric(s) to monitor, since they know the application best. A typical approach is to find a key metric that represents your SLO. For instance, Netflix uses its starts-per-second metric as an indicator of overall system health, because its baselining has led the company to expect X number of starts within any given timeframe.

Once you’ve found the right metrics, make them visible on a dashboard. Of course, not all metrics are useful. Some won’t need alerts or dashboard visibility, and you’ll want to avoid cluttering your dashboard with too many widgets. Treat this as an iterative process. Start with just a few metrics as you gain a better understanding of your system’s performance. You also can implement alerting—email, Slack, ticketing and so on—to encourage a quick response to outages and other problems.

People often ask, “What happens when SLOs aren’t met?”

Because an SLA establishes that service availability will meet certain thresholds over time, there may be serious consequences for your business—including the risk of harming your reputation and, of course, financial loss resulting from an SLO breach and a depletion of your error budget. Since the penalty for an SLA violation can be severe, your SRE team should be empowered to fix problems within the application stack. Depending on the team’s composition, it’s possible they’ll either release a fix to the feature code, make changes to the underlying platform architecture or, in a severe case, ask the feature team to halt all new development until your service returns to an acceptable level of stability as defined by the error budget.

How AppDynamics Helps You

AppDynamics enables you to track numerous metrics for your SLI.

But you may be wondering, “Which metrics should I use?”

AppD users are often excited—maybe even a bit overwhelmed—by all the data collected, and they assume everything is important. But your team shouldn’t constantly monitor every metric on a dashboard. While our core APM product provides many valuable metrics, AppDynamics includes many additional tools that deliver deep insights as well, including End User Monitoring (EUM), Business iQ and Browser Synthetic Monitoring.

 Let’s break down which AppDynamics components your SRE team should use to achieve faster MTTR:

  • APM: Say your application relies heavily on APIs and automation. Start with a few API you want to monitor and ask, “Which one of these APIs, if it fails, will impact my application or affect revenue?”  These calls usually have a very demanding SLO.

  • End User Monitoring: EUM is the best way to truly understand the customer experience because it automatically captures key metrics, including end-user response time, network requests, crashes, errors, page load details and so on.

  • Business iQ: Monitoring your application is not just about reviewing performance data.  Biz iQ helps expose application performance from a business perspective, whether your app is generating revenue as forecasted or experiencing a high abandon rate due to degraded performance.

  • Browser Synthetic Monitoring: While EUM shows the full user experience, sometimes it’s hard to know if an issue is caused by the application or the user. Generating synthetic traffic will allow you to differentiate between the two.

So how does AppDynamics help monitor your error budget?

After determining the SLI, SLO and error budget for your application, you can display your error budget on a dashboard. First, convert your SLA to minutes—for example, 99.99% SLO allows 0.01% error budget and only 8.77 hours (526 minutes) of downtime per year. You can create a custom metric to count the duration of SLO violation and display it in a graph. Of course, you’ll need to take maintenance and planned downtime into consideration as well.

With AppDynamics you can use key metrics such as response time, HTTP error count, and timeout errors. Try to avoid using system metrics like CPU and memory because they tell you very little about the user experience. In addition, you can configure Slow Transaction Percentile to show which transactions are healthy.

Availability is another great metric to measure, but keep in mind that even if your application availability is 100%, that doesn’t mean it’s healthy. It’s best to start building your dashboard in the pre-prod environment, as you’ll need to time tweak thresholds and determine which metric to use with each business transaction. The sooner AppDynamics is introduced to your application SDLC, the more time your developers and engineers will have to get acclimated to it.

 What does the ideal SRE dashboard look like? Make sure it has these KPIs:

  • SLO violation duration graph, response time (99th percentile) and load for your critical API calls

  • Error rate

  • Database response time

  • End-user response time (99th percentile)

  • Requests per minute

  • Availability

  • Session duration

Providing Value to Customers with Software Reliability Metric Monitoring

SLI, SLO, SLA and error budget aren’t just fancy terms. They’re critical to determining if your system is reliable, available or even useful to your users. You should be able to measure these metrics and tie them to your business objectives, as the ultimate goal of your application is to provide value to your customers.

Learn how AppDynamics can help measure your business success.

The AppD Approach: Performance Monitoring Single-Page Applications

The Web has evolved dramatically since its inception, embracing new and innovative technologies that have enhanced both business efficiency and the user experience. Single-Page Applications (SPA)—web apps that load a single HTML page and dynamically update the page as the user interacts with it—are among the most revolutionary of these innovations, having enabled dynamic and responsive web apps that resemble their native mobile and desktop counterparts. Dozens of frameworks are available for building SPAs, and the market is currently ruled by popular options such as React, AngularJS, and Backbone.js.

While SPAs generally make the end user experience better, they have made it harder for developers to have a true understanding of how their applications are performing. When 8 in 10 users will abandon an application if it doesn’t meet their performance expectations, it’s crucial that developers understand the performance of every important page, regardless of how it was rendered.

As a developer you should care about a few questions for your SPA:

  1. How long did a page take to be visible and usable by your end user?

  2. How did any backend requests related to the page perform?

  3. How quickly were we able to download any resources such as CSS and images for a particular page?

  4. If any errors are occuring, what page are they associated with?

Here at AppDynamics, we carefully studied the inner workings of Single-Page Applications, and have developed an out-of-the-box monitoring solution for SPA frameworks. In fact, we’re now able to monitor websites built on all of the popular frameworks, such as those mentioned above.

Introducing End User Response Time

At AppDynamics we think of two types of “pages”:

  • When a user navigates to an SPA, the initial page download is considered the “base page.” The base page includes the HTML skeleton, the core CSS, and the JavaScript framework for fetching and constructing new content. Requisite resources such as images and fonts may also be loaded by the base page.

  • SPA frameworks trick the browser into not reloading the page on every navigation; rather, they change the URL and fetch new content via XMLHttpRequests (XHRs). A “virtual page” is one that isn’t loaded from the server, but is updated via XHRs.

A SPA is therefore comprised of one base page (the initial load) and multiple virtual pages for each navigation.

End User Response Time (EURT) is defined as the amount of time a page (either base or virtual) takes to load completely. EURT is calculated by determining the start and end times for each page:

EURT = end time – start time

What Does Page ‘Start Time’ Mean?

A virtual page transition begins with a trigger. For example, a user clicking a button will cause a virtual page transition. This click marks the start time for the virtual page load.

The trigger—in this case, a user click—fetches the data necessary for the virtual page. Similarly, this action can execute logic required for the virtual page transition. Hence, the accurate start time of a virtual page is the time when it was triggered. The trigger may vary, of course—a click, a timer, and so on.

When Does a Virtual Page End?

A virtual page ends when the entire page is rendered completely and there is no more network activity. There is no explicit browser trigger to indicate this, so AppDynamics waits for a period of inactivity (both user actions and browser) to declare a page load has ended. See the documentation for more details.

AppDynamics captures both the start and end times of a virtual page, and uses that data to compute the End User Response Time. The EURT metric tells us the entire time spent fetching data for the virtual page, including the rendering time and network time.

XHR Correlation

A page is nothing without the data it’s meant to display, so you will want to know how quickly that data made it from the server back to the browser. XMLHttpRequests are the main backbone for fetching data without reloading the page. Therefore AppDynamics correlates the performance metrics for XHRs with the virtual page responsible for executing them, so that you can see all contributors to the perceived performance for any given page.

We also correlate XHRs launched between the trigger of a virtual page and the URL change. Why? Again, because the XHRs are responsible for bringing data to the page.

In the example above, XHR1 launches before the URL change, but is responsible for fetching data for virtual-page transitioning. By comparison, XHR2 launches on the virtual page and is not responsible for fetching data for a new virtual page. Hence, when clicked by the user, both XHR1 and XHR2 will be correlated with the virtual page transition.

Resource Correlation

Resources such as images, CSS and scripts, which are loaded on a virtual page, are correlated with that page. And resources loaded between a virtual page’s start and end times are correlated with the virtual page as well. Again, any element that can contribute to a user’s perception of how a page performed needs to be associated with the page, otherwise it’s difficult to understand each element’s impact on user experience.

Error Correlation

As a developer you have almost certainly opened your browser’s JavaScript console and seen errors. It’s important to know when those errors occurred and what page those errors are associated with, so that you can quickly begin investigating and working towards a clean console. AppDynamics ensures that errors are properly associated with the base page or virtual page during which they were thrown.

By accurately capturing the metrics that matter most, AppDynamics makes it easy for you to see the performance of your Single-Page Applications, identify the exact causes of any app slowdowns, and pinpoint and debug errors.

To learn more about Single-Page Application support in AppDynamics please review our updated documentation.

Keeping App-Centric Consumers Happy

Poor app performance is a serious business problem.

Customer loyalty depends on good app performance, so organizations have to deliver the best experience for every user, every time. In fact, The App Attention Index research reveals that 80 percent of consumers have deleted apps because they don’t perform correctly, and 53 percent have abandoned a website after a single disappointing experience. Businesses simply can’t afford not to deliver a good app experience.

Check out our new interactive infographic below for additional stats and highlights around the business impact of poor app performance.

The Importance of Business and Performance KPIs for IoT Applications

Over the last few years, the Internet of things (IoT) has become a trending phrase for consumers and a top priority for businesses embarking on their digital transformation. Even with the growth and interest in IoT however, the meaning can still confuse people.

So, what is IoT? IoT is a network of things connected to the internet and is uniquely identifiable through its embedded computing system. These “things” may include a variety of devices like home appliances, commercial vending machines, fitness trackers, industrial gateways, connected cars, and smart factories.

And worldwide spending on IoT devices is on the rise, with IDC’s Worldwide Semiannual Internet of Things Spending Guide predicting that global spending in IoT will leap from over $800 billion in 2017 to $1.4 trillion by 2021. This increase is attributed to continued investments made by organizations in the hardware, software, services, and connectivity that enables IoT. The goal of these IoT investments? To drive operational efficiency and increase revenue through improved consumer experience.

For example, the transportation industry is using sensors to improve fuel usage in planes and trucks, while the industrial sector is using IoT to reduce gas leaks. Environmental sensors for humidity, CO2, and electricity sensors also help reduce energy costs for buildings.

On the other hand, IoT in sectors like retail, automotive, and media are more focused on providing consumers with a rich experience by enabling new device interactions and avenues to consume data. For example, the retail industry is using devices such as smart shelves, point of sale, and digital signage to significantly improve consumer experience in brick-and-mortar stores to drive more sales. There are also voice-controlled devices like the Amazon Echo and Google Home, which offer a premium experience by allowing consumers to play music, stream podcasts, provide weather updates, control your smart home, and more.

And it’s these type of consumer experiences that drive sales and customer loyalty. In fact, IDC reports that consumer IoT spending will be the fourth largest market segment in 2017 at $62 billion, and will jump to the third largest segment come 2021.

Monitoring IoT Performance

As the number of IoT devices in the consumer and business space increase, as will the complexity of the infrastructures needed to support the new services and touch points. With this increasing software complexity, there is also a correlated demand from users for highly-responsive, real-time digital services.

As a result, a toolset to measure and deliver an exceptional end-user experience is imperative for making an IoT application successful. And that’s precisely where AppDynamics can help. AppDynamics IoT monitoring provides visibility into your connected device applications for real-time performance diagnostics and usage analytics so you can quickly understand and resolve performance issues.

 

Fig 1: AppDynamics End-to-End Performance Monitoring

In Figure 1 above, you can see how AppDynamics’ end-to-end unified monitoring solution provides visibility into a complex software infrastructure. AppDynamics follows the transaction at each hop, starting from a connected device to a data center, network equipment, and all the way to the database.

AppDynamics End-User Monitoring provides great visibility into browser and mobile applications and now – with our Winter Release – we are extending it to monitor all connected devices.

IoT Monitoring Requirements

Before we built our IoT Monitoring Platform to help operations teams manage IoT applications efficiently, it was important for us to understand monitoring requirements from both the technical and business end. We built our platform with the below technical and business requirements in mind.

Technical Requirements

– Ability to monitor IoT applications that run on devices with different processor architectures (e.g., ARM7, x86, Cortex-M series), and a multitude of operating systems. (e.g., embedded Linux, QNX, mbed OS, VxWorks)

– Ability to monitor IoT applications written in multiple languages (e.g., C, C++, Java, Python, Javascript, Node.js).

– Overhead for monitoring IoT applications should be minimal and operate within device constraints such as memory, computing resource, and network connectivity.

– Ability to ingest data generated by IoT applications that can vary significantly based on application type. For example, an industrial gateway device might generate gigabytes of sensor data whereas a point-of-sale device may trigger thousands of user transactions per day.

Business Requirements

– Ability to manage the complexity of software and services offered on the new IoT device types and applications. IT needs to detect issues proactively and keep MTTR low.

– Ability to provide the same user experience, independent of device type.

– IoT devices generate tremendous amounts of data and it’s important to be able to get insights into the business performance quickly.

– Ability to correlate business performance with IoT application performance. For example, when a business is losing money, it should be easy to quickly identify the root cause of a performance issue.

– Ability to react to real-time alerts on application or business performance issues.

Stay tuned for the next blog post in this series, where we’ll dive into the technical details of AppDynamics’ IoT product offering, how we solved design challenges, and how we’re helping businesses tackle IoT proliferation.

Learn more about IoT Monitoring or schedule a demo of our product today.

Proactively manage your user experience with EUM

As companies embark on a digital transformation to better serve their customers the challenge of managing the performance and satisfaction with each user becomes ever more critical to the success of the business. When we look at the breakout companies today like Uber, Airbnb, and Slack it’s evident that software is at the core of their success in each industry.  Consumers have gone from visiting a bank’s branch to making a transfer to using their desktop or mobile device to fulfill it instantaneously.  If the web application is not responding then it could be the equivalent of walking into a long line at the branch and walking out creating a negative impression of that brand. In a digital business making each customer interaction with your digital storefront successful should be at the core of the business objectives.

So, how do we ensure that every digital interaction is successful and performs quickly? The answer lies in using multiple approaches to manage that end-user experience. In our tool arsenal, we have real-user monitoring tools and synthetic monitoring tools that when combined with APM capabilities can help us quickly identify poor performing transactions and quickly triage the root cause to minimize end-user impact. Each tool covers a core area for the web application that gives visibility into the whole experience. For real user monitoring tools, having an understanding of the performance and sequence of every step in the customer path is critical to identifying areas of opportunity in the funnel and increase conversions. Real user monitoring provides the measurement breadth impossible to achieve with only synthetic tools and puts back end performance into a user context. On the synthetic side of monitoring, a repeatable and reproducible set of measurements focused on visual based timings can be of great value in baselining the user experience between releases, pro-actively capturing errors before users are impacted, benchmarking against competitors, and baselining and holding 3rd party content providers accountable for the performance of their content.  Synthetic measurements also allow for script assertions to validate that the expected content on a page is delivered in a timely way and accurately alert when there are deviations from the baseline or errors occur.

In a recent survey sponsored by AppDynamics, over 50% of people who manage web and mobile web applications identified 3rd party content as a factor in delivering a good end-user experience.  Our modern web and mobile sites most often contain some kind of 3rd party resource from analytics tracking, to social media integration, all the way to authentication functionality. Understanding how each of these components affects the end user experience is critical in maintaining a healthy and well performing site. Using a real-user monitoring tool like AppDynamics Browser EUM solution, you can visualize slow content that may be affecting a page load and identify the provider.  The challenge now is how do you see if this provider is living up to their performance claims and how do you hold them accountable.

Third party benchmarking is a capability that a synthetic monitoring solution best provides.  With a synthetic transaction you are able to control many variables that are impossible to do on a real user measurement.  A synthetic measurement will always use the same browser/browser version, connectivity profile, hardware configuration, and is free from spyware, virus, or adware. Using this clean room environment, you can see what the consistent performance of a page along with every resource to manage and track each element downloaded from multiple synthetic locations worldwide. In this instance, when your monitoring system picks up an unusually high number of slow transactions you can directly drill down and isolate the cause either to a core site slowdown or a 3rd party slowdown and compare the performance across synthetic to determine if it’s a user/geography centric issue or something happening across the board.

In managing the user experience, having all pertinent data in real-time on a consolidated system can be the difference between a 5 min performance degradation or a 5 hour site outage while multiple sources of discordant information are compiled and rationalized. The intersection of data from real-user monitoring and synthetic monitoring can bring context to performance events by correlating user session information like engagement and conversion with changes in performance of 3rd party content or end-user error rates. A 360 degree view of the customer experience will help ensure a positive experience for your customers.

Interested in learning more? Make sure to watch our Winter ’16 Release webinar

 

Why web analytics aren’t enough!

Every production web application should use web analytics. There are many great free tools for web analytics, the most popular of which is Google Analytics. Google Analytics helps you analyze visitor traffic and paint a complete picture of your audience and their needs. Web analytics solutions provide insight into how people discover your site, what content is most popular, and who your users are. Modern web analytics also provide insight into user behavior, social engagement, client-side page speed, and the effectiveness of ad campaigns. Any responsible business owner is data-driven and should leverage web analytics solutions to get more information about your end users.

Web Analytics Landscape

Google Analytics

While Google Analytics is the most popular and the de facto standard in the industry, there are quite a few quality web analytics solutions available in the marketplace:

The Forrester Wave Report provides a good guide to choosing an analytics solution.

Forrester Wave

There are also many solutions focused on specialized web analytics that I think are worth mentioning. They are either geared towards mobile applications or getting better analytics on your customers’ interactions:

Once you understand your user demographics, it’s great to be able to get additional information about how performance affects your users. Web analytics only tells you one side of the story, the client-side. If you are integrating web analytics, check out Segment.io which provides analytics.js for easy integration of multiple analytics providers.

It’s all good – until it isn’t

Using Google Analytics on its own is fine and dandy – until you’re having performance problems in production you need visibility into what’s going on. This is where application performance management solutions come in. APM tools like AppDynamics provide the added benefit of understanding both the server-side and the client-side. Not only can you understand application performance and user demographics in real time, but when you have problems you can use the code-level visibility to understand the root cause of your performance problems. Application performance management is the perfect complement to web analytics. Not only do you understand your user demographics, but you also understand how performance affects your customers and business. It’s important to be able to see from a business perspective how well your application is performing in production:

 

Screen Shot 2013-10-29 at 1.34.01 PM

Since AppDynamics is built on an extensible platform, it’s easy to track custom metrics directly from Google Analytics via the machine agent.

The end user experience dashboard in AppDynamics Pro gives you real time visibility where your users are suffering the most:

Profile-PageView

Capturing web analytics is a good start, but it’s not enough to get an end-to-end perspective on the performance of your web and mobile applications. The reality is that understanding user demographics and application experience are two completely separate problems that require two complementary solutions. O’Reilly has a stellar article on why real user monitoring is essential for production applications.

Get started with AppDynamics Pro today for in-depth application performance management.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

Scaling our End User Monitoring Cloud

Why End User Monitoring?

In a previous post, my colleague Tom Levey explained the value of Monitoring the Real End User Experience. In this post, we will dive into how we built a service to scale to billions of users.

The “new normal” for enterprise web applications includes multiple application tiers communicating via a service-oriented architecture that interacts with several databases and third-party web services. The modern application has multiples clients from browser-based desktops to native applications on mobile. At AppDynamics, we believe that application performance monitoring should cover all aspects of your application from the client-side to the server-side all the way back to the database. The goal of end user monitoring is to provide insight into client-side performance and capture errors from modern javascript-intensive applications. The challenge of building an end user monitoring service is that every single request needs to be instrumented. This means that for every request your application processes, we will process a beacon. With clients like FamilySearch, Fox News, BackCountry, ManPower, and Wowcher, we have to handle millions of concurrent requests.

1geo

AppDynamics End User Monitoring enables application owners to:

  • Monitor Their Global Audience and track End User Experience across the World to pinpoint which geo-locations may be impacted by poor Application Performance
  • Capture end-to-end performance metrics for all business transactions – including page rendering time in the Browser, Network time, and processing time in the Application Infrastructure
  • Identify bottlenecks anywhere in the end-to-end business transaction flow to help Operations and Development teams triage problems and troubleshoot quickly
  • Compare performance across all browsers types – such as Internet Explorer, FireFox, Google Chrome, Safari, iOS and Android
  • Track javascript errors

“Fox News already depends upon AppDynamics for ease-of-use and rapid troubleshooting capability in our production environment,” said Ryan Jairam, Internet Operations Lead at Fox News. “What we’ve seen with AppDynamics’ End-User Monitoring release is an even greater ability to understand application performance, from what’s happening on the browser level to the network all the way down to the code in the application. Getting this level of insight and visibility for an application as complex and agile as ours has been a tremendous benefit, and we’re extremely happy with this powerful new addition to the AppDynamics Pro solution.”

EUM Cloud Service

The End User Monitoring cloud is our super-scalable platform for data analysis and processing end user requests. In this post we will discuss some of the design challenges of building a cloud service capable of supporting billions of requests and the underlying architecture. Once End User Experience monitoring is enabled in the controller, your application’s requests are automatically instrumented with a very small piece of javascript that allows AppDynamics to capture critical performance metrics.

Screen Shot 2013-07-25 at 9.47.14 AM

The javascript agent leverages Web Episodes javascript timing library and the W3C Navigation Timing Specification to capture the end user experience metrics. Once the metrics are collected, they are pushed to the End User Monitoring cloud via a beacon for processing.

EUM (End User Monitoring) Cloud Service is our on-demand, cloud based, multi-tenant SaaS infrastructure that acts as an aggregator for the entire EUM metrics traffic. All the EUM metrics from the end user browsers from different customers are reported to EUM Cloud service. The raw browser information received from the browser is verified, aggregated, and rolled up at the EUM Cloud Service. All the AppDynamics Controllers (SaaS or on-premise) connect to the EUM Cloud service to download metrics every minute, for each application.

Design Challenges

On-Demand highly available

End users access customer web applications anywhere in the world and any time of the day in different time zones, whenever an AppDynamics instrumented web page is accessed. From the browser, EUM metrics are reported to the EUM Cloud Service. This requires a highly available on-demand system accessed from different geo locations and different time zones.

Extremely Concurrent usage

All end users of all AppDynamics customers using EUM solution continuously report browser information on the same EUM Cloud Service. EUM Cloud Service processes all the reported browser information concurrently and generate metrics and collect snapshot samples continuously.

High Scalability

The usage pattern for different applications throughout the day is different; the number of records to be processed at EUM Cloud vary with different applications at different times. The EUM Cloud Service automatically scale up to handle any surge in the incoming records and accordingly scale down with lower load.

Multi Tenancy support

The EUM Cloud Service process EUM metrics reported from different applications for different customers; the cloud service provides multi-tenancy. The reported browser information is partitioned based on customers and their different applications. EUM Cloud Service provides a mechanism for different customer controllers to download aggregated metrics and snapshots based on customer and application identification.

Cost

The EUM Cloud Service needs to be able to dynamically scale based on demand. The problem with supporting massive scale is that we have to pay for hardware upfront and over provision to handle huge spikes. One of the motivating factors when choosing to use Amazon Web Services is that costs scale linearly with demand.

Architecture

The EUM Cloud Service is hosted on Amazon Web Services infrastructure for horizontal scaling. The service has two functional components – collector and aggregator. Multiple instances of these components work in parallel to collect and aggregate the EUM metrics received from the end user browser/device. The transient metric data be transient is stored in Amazon S3 buckets. All the meta data information related to applications and other configuration is stored in the Amazon DynamoDB tables.

A single page load will send one or more beacon–one per base page and every iframe onload and one per ajax request. Javascript errors occurring post page load are also sent as error beacons.

The functionality of the nodes is to receive the metric data from the browser and process it for the controller:

  • Resolve the GEO information (request coming from the country/region/city) and add it to the metric using a in-process maxmind Geo-resolver.
  • Parse the User-Agent information and add browser information, device information and OS information to the metrics.
  • Validate the incoming browser reported metrics and discard invalid metrics
  • Mark the metrics/snapshots SLOW/VERY SLOW categories based on a dynamic standard deviation algorithm or using static threshold

Load Testing

For maximum scalability, we leverage Amazon Web Services global presence for optimal performance in every region (Virginia, Oregon, Ireland, Tokyo, Singapore, Sao Paulo). In our most recent load test, we tested the system as a collective to about 6.5 B requests per day. The system is designed to easily scale up as needed to support infinite load. We’ve tested the system running at many billions of requests per day without breaking a sweat.

Check out your end user experience data in AppDynamics

4breakdown

Find out more about AppDynamics Pro and get started monitoring your application with a free 15 day trial.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

Monitoring the Real End User Experience

Web application performance is fundamentally associated in the mind of the end user; with a brands reliability, stability, and credibility. Slow or unstable performance is simply not an option when your users are only a click away from taking their business elsewhere. Understanding your users, the locations they are coming from, and the devices/browsers they are using is crucial to ensuring customer loyalty and growth.

Today’s modern web applications are architected to be highly interactive, often executing complex client side logic in order to provide a rich and engaging experience to the user. This added complexity means it is no longer good enough to simply measure the effects users have on the back-end. It is also necessary to measure and optimize the client-side performance to ensure the best possible experience for your users.

Determining the root cause for poor user experience is a costly and time consuming activity which requires visibility into page composition, JavaScript error diagnostics, network latency metrics and AJAX/iframe performance.

Let’s take a look at a few of the key features available in AppDynamics 3.7 which simplify troubleshooting these problems.

End User Experience dashboard:

The first view we will look at reports EUM data by geographic location showing which regions have the highest loads, the longest response times, the most errors, etc.

1geo
The dashboard is split into three main panels:

  • A main panel in the upper left that displays geographic distribution on a map or a grid
  • A panel on the right displaying summary information: total end user response time, page render time, network time and server time
  • Trend graphs in the lower part of the dashboard that dynamically display data based on the level of information displayed in the other two panels

The geographic region for which the data is displayed throughout the dashboard is based on the region currently showed on the map or in the summary panel. For example, if you zoom down from global view to France in the map, the summary panel and the graphs display data only for France.

This view is key to understanding the geographical impact of any network & CDN latency performance issues. You can also see which are the busiest geographies, driving the highest throughput of your application. 

Browsers and Devices:

From the Browsers and Devices tabs you can see the distribution of devices, browsers and browser versions providing an understanding of which are the most popular access mechanisms for the users of your application and geographic-split by region.  From here we can isolate if a particular browser or device is delivering a reduced experience to the end user and help plan the best areas for optimisation.

2browserdevice

Troubleshooting End User Experience:

The user response breakdown shown below, is the first place we look to troubleshoot why a user is experiencing slow response times. It provides a full breakdown of where the overall time is being spent during the various stages of a page render, highlighting issues such as network latency, poor page design, too much time parsing HTML or downloading and executing JavaScript. 

Screen Shot 2013-07-25 at 9.47.14 AM

Response time metric breakdown

First Byte Time is the interval between the time that a user initiates a request and the time that the browser receives the first response byte.

Server Connection Time is the interval between the time that a user initiates a request and the start of fetching the response document from the server. This includes time spent on redirects, domain lookups, TCP connects and SSL handshakes.

Response Available Time is the interval between the beginning of the processing of the request on the browser to the time that the browser receives the response. This includes time in the network from the user’s browser to the server.

Front End Time is the interval between the arrival of the first byte of text response and the completion of the response page rendering by the browser.

Document Ready Time is the time to make the complete HTML document (DOM) available.

Document Download Time is the time for the browser to download the complete HTML document content.

Document Processing Time is the time for the browser to build the Document Object Model (DOM)

Page Render Time is the time for the browser to complete the download of remaining resources, including images, and finish rendering the page.

AppDynamics EUM reports on three different kinds of pages:

  • A base page represents what an end user sees in a single browser window. 
  • An iframe is a page embedded in another page.
  • An Ajax request is a request for data sent from a page asynchronously.

Notifications can be configured to trigger on any of these.

JavaScript error detection

JavaScript error detection provides alerting and identification of the root cause of JavaScript errors in minutes, highlighting the JavaScript file, line # and exception messages for all errors seen by your real users.

Screen Shot 2013-07-25 at 9.45.01 AM

Server-side correlation

If the above isn’t enough and you want to look into the execution within the datacentre, you can drill in-context from any of the detailed browser snapshots directly into the corresponding call stack trace in the application servers behind. This provides end-to-end visibility of a user’s interaction with your web application, from the browser all the way through the datacentre and deep into the database.

4breakdown

Deployment and scalability:

Deployment is simple – all you have to do is add a few lines of JavaScript to the web pages you want to monitor. We’ll even auto-inject this JavaScript on certain platforms at runtime. With its elastic public cloud architecture, AppDynamics EUM is designed to support billions of devices and user sessions per day, making it a perfect fit for enterprise web applications.

See Everything:

With AppDynamics you’ll get visibility into the performance of pages, AJAX requests and iframes, and you can see how performance varies by geographic region, device and browser type. In addition, you’ll get a highly granular browser response time breakdown (using the Navigation Timing API) for each snapshot, allowing you to see exactly how much time is spent in the network and in rendering the page. And if that’s not enough, you’ll see all JavaScript errors occurring at the end user’s browser down to the line number.

If you don’t currently know exactly what experience your users are getting when they access your applications or your users are complaining and you don’t know why then why not checkout AppDynamics End User Monitoring for free at appdynamics.com

Manpower Group Sees Real Results from End User Monitoring

Some companies talk about monitoring their end user experience and other companies take the bull by the horns and get it done. For those who have successfully implemented EUM (RUM, EUEM, or whatever your favorite acronym is) the technology is rewarding for both the company and the end user alike. I recently had the opportunity to discuss AppDynamics EUM with one of our customers and the information shared with me was exciting and gratifying.

The Environment

ManpowerGroup monitors their intranet and internet applications with AppDynamics. These applications are used for internal operations as well as customer facing websites; in support of their global business and accessed from around the word, 24×7. We’re talking about business critical, revenue generating applications!

I asked Fred Graichen, Manager of Enterprise Application Support, why he thought ManpowerGroup needed EUM.

“One of the key components for EUM is to shed light on what is happening in the “last mile”. Our business involves supporting branch locations. Having an EUM tool allows us to compare performance across all of our branches. This also helps us determine whether any performance issues are localized. Having the insight into the difference in performance by location allows us to make more targeted investments in local hardware and network infrastructure.”

Meaningful Results

Turning on a monitoring tool doesn’t mean you’ll automagically get the results you want. You also need to make sure your tool is integrated with your people, processes, and technologies. That’s exactly what ManpowerGroup has done with AppDynamics EUM. They have alerts based upon EUM metrics that get routed to the proper people. They are then able to correlate the EUM information with data from other (Network) monitoring tools in their root cause analysis. Below is an EUM screen shot from ManpowerGroup’s environment.

MPG EUM

By implementing AppDynamics EUM, ManpowerGroup has been able to:

  • Identify locations that are experiencing the worst performance.
  • Successfully illustrate the difference in performance globally as well. (This is key when studying the impact of latency etc. on an application that is being accessed from other countries but are located in a central datacenter.)
  • Quickly identify when a certain location is seeing performance issues and correlate that with data from other monitoring solutions.

But what does all of this mean to the business? It means that ManpowerGroup has been able to find and resolve problems faster for their customers and employees. Faster application response time combined with happier customers and more productive employees all contribute to a healthier bottom line for ManpowerGroup.

ManpowerGroup is using AppDynamics EUM to bring a higher level of performance to it’s employees, customers, and shareholders. Sign up for a free trial today and begin your journey to a healthier bottom line.