Automation Framework in Analytics – Part 1

This blog series highlights how we use our own products to test our events service which currently ingests more than three trillion events per month.

With fast iterations and deliverables, testing has always been a continuously evolving machine — and a reason why AppDynamics is aligning toward microservices-based architectures. While there are multiple ways to prudently handle the problem of testing, we’d like to share some of the learnings and key requirements which have shaped our elastic-testing framework, powered by Docker and AWS.

Applying this framework helped us deliver stellar results:

  • Ability to bring up complex test environments on the fly, based on testing needs.
  • 80% increase in speed of running and finding bugs earlier in the release cycle.
  • The flexibility to simulate environment instabilities, which potentially occur in any production (or like) environment.
  • Helps with plans to move towards continuous integration (CI).
  • Predictable testing time.
  • A robust environment to allow us to run pre-checkin as well as nightly build tests.
  • Ease of running tests more frequently for small changes vs. full cycle.

Below we will share some of the challenges we faced while end-to-end testing the AppDynamics Events Service, data store for on-premises Application Analytics, End User Monitoring (EUM) deployments, and Database Monitoring deployments. We’ll provide our approach towards solving these challenges, discuss best practices for integration with a continuous development cycle, and share ways to reduce cost on testing infrastructure when testing the application.

By sharing our experience, we hope to provide a case study that will help you and your team avoid similar challenges.

What is Application Analytics?

Application Analytics refers to the real-time analysis and visualization of automatically collected and correlated data. In our case, analytics reveal insights into IT operations, customer experience, and business outcomes. With this next generation of IT operations analytics platform, IT and business users are empowered to quickly answer more meaningful questions than ever before, all in real-time. Analytics is backed by a very powerful events service to store the ingested events, so that data can be queried back. This service is highly scalable – handling more than 3 trillion events per month.

Deployment Background

Our Unified Analytics product can be deployed in two ways:

  • on-premises deployment
  • SaaS deployment

Events Service

The AppDynamics events service is architected to cater to customers based on the deployment chosen. The events service offers a lightweight deployment for on-premises deployment to ease the handling of operating data. It will also have minimal components, which allows the events service to cater to the scalability and volume of data to be handled – a typical use case for any SaaS-based service.

The SaaS events service has:

  1. API Layer: Entry point service
  2. Kafka queue
  3. Indexer Layer, which consumes the data from kafka queue and writes to an event store
  4. Event Store – Elasticsearch

The on-premises events service has:

  1. API Interface / REST Endpoint for the service
  2. Event Store

 Architecture of events platform

Operation/Environment Matrix

The operation bypasses a few layers when it comes to on-premises deployments. A SaaS ingestion layer prevents data-loss through a kafka layer that helps coordinate the ingestion. However, in an on-premises environment, the ingestion happens directly to elasticsearch through the API interface.

Objectives for testing the Events Service:

  • CI tests can run in build systems consistently.
  • The tests are easily pluggable and can run based on the deployment type.
  • Ease of running tests in different environment types (either locally or in cloud) for the benefit of time and to ensure that the tests are environment agnostic.
  • The framework could be scalable and could also be used for functionality, performance, and scalability tests.

These objectives are mandatory to take us towards continuous deployment, where production deployment is just one-click away from committing the code.

Building the Test Framework

To build our testing framework, we analyzed the various solutions available. Below are the options we went through:

  1. Bring the whole Saas environment into a local environment via individual processes such as  elasticsearch, kafka, and web servers, and testing them in a local box.
  2. Have some separate VMs/Bare metal hosts allocated for these tests so that we deploy these components there and run.
  3. Use AWS for deploying these components and use them for testing.
  4. Use Docker containers to create a secluded environment, deploy, and test.
  5. We reviewed each option listed above and conducted a detailed analysis to understand the pros and cons of each and every option. The outcome of this exercise enabled us to pick the right choice for the testing environment.

Stay Tuned

We will publish a follow-up blog to shed more light on:

  1. The pros and cons of every option we had
  2. What choice did we come up with and why
  3. Architecture of our framework
  4. Test flow
  5. Performance of our infra-setup time and infra-based test running time

Swamy Sambamurthy works as a Principal Engineer at AppDynamics and have 11+ years of experience in building scalable automation frameworks. In the past and currently in AppDynamics, Swamy helped in building automation frameworks against distributed systems and big-data environments, which has the ability to scale through huge number of ingestion and querying requests.

The new battleground in Analytics Part 2: Transforming APM into Application Intelligence

In Part 1 of this series, I talked about my transition from the world of Analytics into the APM space and my assertion that APM is simply another form of Analytics. It’s turning large amounts of data into information you can use to take quick action. Like other forms of Analytics it can help you drive revenue, reduce costs and mitigate risks. The impact to the business can be truly compelling — but that’s only the beginning of the story.

As powerful as traditional APM is, it’s very quickly evolving into what we call Application Intelligence. Our Application Intelligence Platform allows you to do three key things:  See, Act and Know.

  • See what’s happening in real-time:  We enable an integrated view of real-time user experience, application performance and infrastructure capacity.
  • Act fast to resolve issues:  We provide the ability to automate resolution of application or infrastructure bottlenecks – in production.
  • Know that you’re running at your best:  We enable deep, real-time analytics to help businesses make better decisions and create bigger impact – all with certainty and confidence.

It gives IT the operational visibility and control they need. It gives end users the great experience they demand. It gives the business real-time insight into business performance that most have never had.

In this post I’m going to focus on real-time analytics, but before I dive into how it works let me start with the context of traditional analytics. In the past people have tried to directly query the underlying databases of an application.  Unfortunately these databases are primarily designed for inserting small amounts of data. When complex queries run against it there can be a significant performance hit to the application, which is a very bad thing. The workaround was an Operational Data Store (ODS) that would replicate the data in its current structure so you could run queries without impacting the application’s performance. However, since the structure was designed to support transactional applications with short reads and writes, query performance and the ability to ask complicated questions easily weren’t good enough. This meant extracting the data from the ODS, transforming it and loading it into a data mart or data warehouse in a structure that was much easier to query. Given ever-growing data volumes query performance was still an issue, so the next step was adding indexes and aggregates. These workarounds put the end business user further and further away from real-time information. Often the delay in usable information means you’re looking at information that’s hours, days, or even weeks old. In the last few years there have been some disruptive technologies introduced that can simplify this. Though they’re actually quite effective, they come with an exorbitant price tag.

The Application Intelligence approach is considerably different. To get the benefits of Application Intelligence, the complete path the transaction takes through the end user device, the application server tiers, middleware, third party API calls, databases etc. is fully instrumented and the transaction context is propagated in real time. That’s how it sees everything and identifies performance problems and bottlenecks, often before end users are impacted. It also provides the opportunity to extract payload information such as the quantity of items in a shopping cart, revenue in executed transactions, dollar amount of funds transferred and so much more. AppDynamics introduced this concept with Real-Time Business Metrics. It’s allowed customers access to metrics like revenue, number of new accounts, support ticket volume, items shipped and campaign effectiveness in real-time. You can even correlate them with the performance of the application to see how changes in response times for end users impact business performance. It plucks all of this information out as it passes through the application and stores it in a data platform that’s easy to query. AppDynamics is taking this concept to the next level in the recently announced Transaction Analytics, which will extend the scale of the underlying data platform, enabling larger data volumes, more complex queries and impressive visualizations. In real time you can answer questions like:

  • What was the revenue impact by product category associated with the two marketing campaigns we ran on Black Friday?
  • Which Android users experienced failed checkout transactions in the last 24 hours, and what was in their cart at the time?
  • How many customers experienced a slow ‘submit reservation’ transaction in the last hour from a Chrome browser in New York and what was the total dollar amount of those reservations?
  • How many transactions originated from Tier 1 partners over the last 90 days and what was the resource utilization and revenue associated with those requests?

This is the transformation from APM to Application Intelligence. IT has leveraged APM to drive business success through application performance and efficiency. Application Intelligence extends beyond those benefits, allowing the business to use information to make better decisions and differentiate against the competition. In an era where IT and the CIO are expected to lead the business into the future, Application Intelligence is an incredible opportunity for leaders to drive success.

If you’d like to get started using AppDynamics to try business & operational analytics out for yourself, you can create a free account and start monitoring and analyzing your applications today.  If you are interested in being a part of our upcoming Transaction Analytics beta program, you can find more information here.

 

The new battleground in Analytics Part 1: Transforming APM into Application Intelligence

As a relative newcomer to the world of Application Performance Management (APM) and the larger category of Application Intelligence, I knew I had a lot to learn. I came into this market having spent the last 15 years in the Analytics space. The idea of taking large volumes of disparate data and turning it into actionable insight is something I’ve always found compelling. Though people generally think of a flashy dashboard, Analytics is so much more. At Business Objects I got to be part of the evolution from standardized reporting through ad hoc analysis, data exploration, and predictive analysis.  There was still a significant barrier though to making it work in the real-world where data volume, variety and velocity were too much to handle at the application level. Data had to be extracted and transformed before being made available in a data warehouse or mart where it became available to an end user. That’s an expensive and time consuming process that only offered the insight you’re looking for hours, days or even weeks later. However, at SAP our customers were finally able to realize the promise of these new capabilities with the introduction of SAP HANA (High-performance Analytic Appliance). That in-memory platform delivered amazing computing power capable of simplifying the infrastructure of application environments while giving the business real-time insight.

The business impact derived with the right information at the right time is staggering to the point that customers have been willing to pay SAP handsomely for these capabilities. Companies can drive huge revenue increases, save millions in cost, and reduce their risk substantially by better leveraging their data assets. I have great passion for the space and helping customers use insights to change their business.

Given my background, getting into the APM space seemed like a huge leap. My impression was I’d be working with IT organizations more than the business owners. I’d be in a much more technical space, solving technology problems more than business issues. Clearly, I had a lot to learn. I started the learning process by focusing on two things: learning the product and understanding how it drives value for our customers.

First, I began to familiarize myself with the product and learn how the features provide benefit. As an APM tool, AppDynamics can help people quickly find bad code, inefficient code, database problems and a host of other things that negatively impact the experience for end users of applications. I don’t want to discount the value of deep knowledge and experience in the APM space, but after watching a demo or two and playing with the software a bit I felt like I could use AppDynamics to find problems and performance bottlenecks in an application. It was simple and instantly compelling.

The second phase of learning was digging into customer use cases and how they realize value with AppDynamics. There are some amazing stories of real business impact. Here are a few of my favorites:

FamilySearch:  Using AppDynamics FamilySearch saved $4.8M in infrastructure and related costs over two years by making their applications more efficient. With help from AppDynamics they scaled the use of their application 10x without growing their infrastructure!

Fox News:  After deploying AppDynamics they reduced their MTTR (mean time to resolution) from weeks to under a few hours. They decreased the number of support tickets by 90%, and they stabilized new releases in hours compared to the full week it used to take them.

Edmunds.com:  With AppDynamics they went from having 10 people working on a single problem for several days to fixing things in a few hours. That’s a huge productivity improvement and a dramatic reduction in lost revenue.

During a proof-of-value implementation with a large prospect we helped them immediately identify a bug they knew existed but couldn’t find. 30 minutes later the problem was resolved. Later, over dinner and drinks they were noticeably excited and acknowledged they had worked around this bug by adding 2,000 additional servers to their environment. Those 2,000 servers were no longer needed. That’s an unbelievable cost savings and it took less than an hour!

You can learn more about how these and other companies are getting impressive value out of AppDynamics at http://www.appdynamics.com/case-studies.

After immersing myself with these learning exercises it suddenly occurred to me Application Performance Management is simply another form of Analytics. It turns large amounts of data into information you can use to take quick action. It can help you drive revenue, reduce costs and mitigate risks. The results can be incredibly exciting.  Our customers are seeing it every day!

With that realization, it was time for me to take the next step and dig into Real-Time Business Metrics, Transaction Analytics and the corresponding transformation of APM into Application Intelligence.  By being plugged directly into an application’s code you can get real-time analytics without the challenges and costs associated with traditional analytics. It’s a game changer for business and IT, and this will be the focus of the upcoming Part 2 in this series.

AppDynamics Transaction Analytics Raises the Bar for Business Metrics

Last year, we announced our Real-Time Business Metrics (RTBM) capabilities to the world. The response from our existing and new customers has been amazing. We’ve seen business and IT dashboards created and used by business and technology executives alike.

We’ve seen major hotel chains collect metrics like searches and bookings per brand, incoming partner search and booking volumes, and detailed revenue reports by property location and booking channel. We’ve seen some of the largest car rental agencies in the world track revenue per region in real time to proactively identify business impact. We’ve also seen major footwear retailers track revenue per shoe type, checkout total and discount percentages to make better business decisions. These are just a small sample of the industries and use cases where RTBM has been successfully used.

When we introduced RTBM, it was a meaningful first step on the path to create a real-time analytics platform to serve both business and technology stakeholders equally well. After all, the most successful businesses today are transforming themselves into digital enterprises to deliver products and services through the channels consumers use most often. Mobile and internet-based consumption have taken center stage and most enterprises are investing heavily to meet that consumption model.

However, RTBM isn’t the only way we allow customers to better understand  their business problems.  AppDynamics’ overall analytics capabilities encompass three main focus areas:

Business Analytics harnesses the power of RTBM to provide meaningful business information and alerts.

analytics page - real-time business metrics graphic2

Business Analytics is used to answer the following types of questions:

  • What is our total revenue for the day?
  • How many checkout transactions failed today?
  • How many trades completed successfully today?
  • How many car rental reservations were made today?

The second type of analytics we offer is Operational Analytics, which collects all of the infrastructure data (OS metrics, JMX metrics, WMI or perfmon metrics, log data, etc.) from your devices and application components.

scalability analysis

Operational Analytics is used to answer the following types of questions:

  • What is the breakdown of failed transactions between 3G and 4G mobile networks?
  • How many connection pool errors led to failed transactions today?
  • Which application components are throwing errors that caused failed transactions?

Today, we announced the future availability of AppDynamics’ third type of analytics, Transaction Analytics, an integral component of the AppDynamics Application Intelligence Platform, which was also announced today.

Transaction Analytics captures raw business data from all layers of an application and across every single node without altering any code.  Gathering this data, correlating it with operational intelligence, and presenting it in a way that makes it easy for customers to answer important business questions is a capability we are extremely excited about.

analytics

For example, capturing the specific details of each transaction enables businesses to answer the following questions:

  • Which Android users experienced failed checkout transactions in the last 24 hours and what was in their cart at the time?
  • How many customers experienced a slow ‘submit reservation’ transaction in the last hour from a Chrome browser in New York and what was the total dollar amount of those reservations?
  • What was the revenue impact by product category associated with the two marketing campaigns we ran on Black Friday?
  • How many transactions originated from Tier 1 partners over the last 90 days and what was the resource utilization and revenue associated with those requests?
  • How many car rental reservations were made for each category of vehicle and what is the associated breakdown of revenue by region?
  • What are the total bookings from California broken down between iOS and Android, and what is the associated breakdown between wireless carriers.

The answers to the questions above help make better business decisions like:

  • Immediately fixing a business mistake that is negatively impacting revenue
  • Investing in a “mobile first” strategy
  • Choosing what marketing campaigns to run again and which to stop
  • Selecting what product types to buy more of for increased revenue growth and determining which regions to market them
  • What mobile platform should we invest more time and effort into?
  • Do we need to make our mobile application more resilient to poor network performance?
  • Do we have technology problems related to specific products, regions, or consumption models?

One of the unique differentiators of all of our analytics capabilities is that it does not require any change to existing code. AppDynamics will gather all of the requisite data from your servers without the use of special compiled API calls or extra logging statements.

If you’d like to get started using AppDynamics to try business & operational analytics out for yourself, you can create a free account and start monitoring and analyzing your applications today.  If you are interested in being a part of our upcoming Transaction Analytics beta program, you can find more information here.

How Monitoring Analytics can make DevOps more Agile

The word “analytics” is an interesting and often abused term in the world of application monitoring. For the sake of correctness, I’m going to reference Wikipedia in how I define analytics:

Analytics is the discovery and communication of meaningful patterns in data.

Simply put, analytics should make IT’s life easier. Analytics should point out the bleeding obvious from all the monitoring data available, and guide IT so they can effectively manage the performance and availability of their application(s). Think of analytics as “doing the hard work” or “making sense” of the data being collected, so IT doesn’t have to spend hours figuring out for themselves what is being impacted and why.

Discovery
This is about how effectively a monitoring solution can self-learn the environment it’s deployed in, so it’s able to baseline what is normal and abnormal for the environment. This is really important as every application and business transaction is different. A key reason why many monitoring solutions fail today is that they rely on users to manually define what is normal and abnormal using static or simplistic global thresholds. The classic “alert me if server CPU > 90%” and “alert me if response times are > 2 seconds,” both of which normally result in a full inbox (which everyone loves) or an alert storm for IT to manage.

Communication
The communication bit of analytics is equally as important as the discovery bit. How well can IT interpret and understand what the monitoring solution is telling them? Is the data shown actionable–or does it require manual analysis, knowledge or expertise to arrive at a conclusion? Does the user have to look for problems on their own or does the monitoring solution present problems by itself? A monitoring solution should provide answers rather than questions.

One thing we did at AppDynamics was make analytics central to our product architecture. We’re about delivering maximum visibility through minimal effort, which means our product has to do the hard work for our users. Our customers today are solving issues in minutes versus days thanks to the way we collect, analyze and present monitoring data. If your applications are agile, complex, distributed and virtual then you probably don’t want to spend time telling a monitoring solution what is normal, abnormal, relevant or interesting. Let’s take a look at a few ways AppDynamics Pro is leveraging analytics:

Seeing The Big Picture
Seeing the bigger picture of application performance allows IT to quickly prioritize whether a problem is impacting an entire application or just a few users or transactions. For example, in the screenshot to the right we can see that in the last day the application processed 19.2 million business transactions (user requests), of which 0.1% experienced an error. 0.4% of transactions were classified as slow (> 2 SD), 0.3% were classified as very slow (> 3 SD) and 94 transaction stalled. The interesting thing here is that AppDynamics used analytics to automatically discover, learn and baseline what normal performance is for the application. No static, global or user defined thresholds were used – the performance baselines are dynamic and relative to each type of business transaction and user request. So if a credit card payment transaction normally takes 7 seconds, then this shouldn’t be classified as slow relative to other transactions that may only take 1 or 2 seconds.

The big picture here is that application performance generally looks OK, with 99.3% of business transactions having a normal end user experience with an average response time of 123 milliseconds. However, if you look at the data shown, 0.7% of user requests were either slow or very slow, which is almost 140,000 transactions. This is not good! The application in this example is an e-commerce website, so it’s important we understand exactly what business transactions were impacted out of those 140,000 that were classified as slow or very slow. For example, a slow search transaction isn’t the same as a slow checkout or order transaction – different transactions, different business impact.

Understanding the real Business Impact
The below screenshot shows business transaction health for the e-commerce application sorted by number of very slow requests. Analytics is used in this view by AppDynamics so it can automatically classify and present to the user which business transactions are erroneous, slow, very slow and stalling relative to their individual performance baseline (which is self-learned). At a quick glance, you can see two business transactions–“Order Calculate” and “OrderItemDisplayView”–are breaching their performance baseline.

This information helps IT determine the true business impact of a performance issue so they can prioritize where and what to troubleshoot. You can also see that the “Order Calculate” transaction had 15,717 errors. Clicking on this number would reveal the stack traces of those errors, thus allowing the APM user to easily find the root cause. In addition, we can see the average response time of the “Order Calculate” transaction was 576 milliseconds and the maximum response time is just over 64 seconds, along with 10,393 very slow requests. If AppDynamics didn’t show how many requests were erroneous, slow or very slow, then the user could spend hours figuring out the true business impact of such incident. Let’s take a look at those very slow requests by clicking on the 10,393 link in the user interface.

Seeing individual slow user business transactions
As you can probably imagine, using average response times to troubleshoot business impact is like putting a blindfold over your eyes. If your end users are experiencing slow transactions, then you need to see those transactions to effectively troubleshoot them. For example, AppDynamics uses real-time analytics to detect when business transactions breach their performance baseline, so it’s able to collect a complete blueprint of how those transactions executed across and inside the application infrastructure. This enables IT to identify root cause rapidly.

 In the screenshot above you can see all “OrderCalculate” transactions have been sorted in descending order by response time, thus making it real easy for the user to drill into any of the slow user requests. You can also see looking at the summary column that AppDynamics continuously monitors the response time of business transactions using moving averages and standard deviations to identify real business impact. Given the results our customers are seeing, we’d say this is a pretty proven way to troubleshoot business impact and application performance. Let’s drill into one of those slow transactions…

Visualizing the flow of a slow transaction
Sometimes a picture says a thousands words, and that’s exactly what visualizing the flow of a business transaction can do for IT. IT shouldn’t have to look through pages of metrics, or GBs of log files to correlate and guess why a transaction maybe slow. AppDynamics does all that for you! Look at the screenshot below that shows the flow of a “OrderCalculate” transaction–which takes 63 seconds to execute across 3 different application tiers as shown below. You can see the majority of time spent is calling the DB2 database and an external 3rd party HTTP web service. Let’s drill down to see what is causing that high amount of latency.

Automating Root Cause Analysis
Finding the root cause of a slow transaction isn’t trivial, because a single transaction can invoke several thousand lines of code–kind of like finding a needle in a haystack. Call graphs of transaction code execution are useful, but it’s much faster and easier if the user can shortcut to hotspots. AppDynamics uses analytics to do just that by presenting code hotspots to the user automatically so they can pinpoint the root cause in seconds. You can see in the below screenshot that almost 30 seconds (18.8+6.4+4.1+0.6) was spent in a web service call “calculateTaxes” (which was called 4 times) with another 13 seconds being spent in a single JDBC database call (user can click to view SQL query). Root cause analysis with analytics can be a powerful asset for any IT team.

Verifying Server Resource or Capacity
It’s true that application performance can be impacted by server capacity or resource constraints. When a transaction or user request is slow, it’s always a good idea to check what impact OS and JVM resource is having. For example, was the server maxed out on CPU? Was Garbage Collection (GC) running? If so, how long did GC run for? Was the database connection pool maxed out? All these questions require a user to manually look at different OS and JVM metrics to understand whether resource spikes or exhaustion was occurring during the slowdown. This is pretty much what most sysadmins do today to triage and troubleshoot servers that underpin a slow running application. Wouldn’t it be great if a monitoring solution could answer these questions in a single view, showing IT which OS and JVM resource was deviating from its baseline during the slowdown? With analytics it can.

AppDynamics introduced a new set of analytics in version 3.4.2 called “Node Problems” to do just this. The above screenshot shows this view whereby node metrics (e.g. OS, JVM and JMX metrics) are analyzed to determine if any were breaching their baseline and contributing to the slow performance of the “OrderCalculate” transaction. The screenshot above shows that % CPU idle, % memory used and MB memory used have deviated slightly from their baseline (denoted by blue dotted lines in the charts). Server capacity on this occasion was therefore not a contributing factor to the slow application performance. Hardware metrics that did not deviate from their baseline are not shown, thus reducing the amount of data and noise the user has to look at in this view.

Analytics makes IT more Agile
If a monitoring solution is able to discover abnormal patterns and communicate these effectively to a user, then this significantly reduces the amount of time IT has to spend managing application performance, thus making IT more agile and productive. Without analytics, IT can become a slave to data overload, big data, alert storming and silos of information that must be manually stitched together and analyzed by teams of people. In today’s world, “manually” isn’t cool or clever. If you want to be agile then you need to automate the way you manage application performance, or you’ll end up with the monitoring solution managing you.

If your current monitoring solution requires you to manually tell it what to monitor, then maybe you should be evaluating a next generation monitoring solution like AppDynamics.

App Man.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Self Tuning Applications in the Cloud: It’s about time!

In my previous blog I’ve written about the hard work needed to successfully migrate applications to the cloud.   But why go through all that work to get to the cloud? It’s to take advantage of the dynamic nature of the cloud with the ability (and agility) to quickly scale applications. Your application’s load probably changes all day, all week, and all year. Now your application can utilize more or less resources based on the changes in load. Just ask the cloud for as much computing resources that you need at any given time, and unlike at data centers, the resources are available at the push of a button.

But that only works during the marketing video. Back in the real world, no one can find that magic button to push. Instead scaling in the cloud involves pushing many buttons, running many scripts, configuring various software, and then fixing whatever didn’t quite work. Oh, and of course even that is the easy part, compared to actually knowing when to scale, how much to scale and even what parts of your application to scale. And this repeats all day, every day, at least until everyone gets discouraged.