The AppD Approach: Capturing 5M metrics/minute with AWS and Aurora

AppDynamics customers have always benefited from flexibility when it comes to where the AppDynamics Controller is hosted; be it in our SaaS environment managed by AppDynamics, or hosted “on-premises” under the complete control of the enterprise. While it has always been possible to host an AppDynamics Controller in AWS, achieving large scale required effort. Using an AWS-based Controller with an Aurora backend, current benchmarks easily reached the 5M metrics/minute mark with minimal effort.

This solution is a game-changer for on-premises customers who are open to hosting the AppD platform in AWS, as it allows them to run a large-scale Controller without procuring physical hardware. (Update – as of June, 2018 new AppDynamics SaaS customers also use this solution).

So Why Aurora?

High Performance and Scalability

Aurora provides higher performance than MySQL, allowing AppDynamics to scale the Controller to handle more metrics than was previously possible in a cloud environment.
Before Aurora, it took some effort and expense to build a controller at scale in AWS, requiring setting up higher-performance storage behind your database, tweaking file systems, and so on. With AWS and Aurora, we easily start to collect 1M metrics/minute and with minimal effort scale to over 5M metrics/minute and 10K agents. Our Professional Services team can provide guidance for deployments larger than 10K agents.

With AWS, you can start small and scale up as needed, thereby avoiding the risk of underutilizing expensive hardware.

Fully Managed

With Aurora, there is no need to worry about database backups, as the Amazon Relational Database Service (RDS) manages this for you.  It also handles software-patching and minor upgrades, thus lowering the total cost of ownership.

High Availability and Durability

With the multi-AZ deployment option, Amazon Aurora offers 99.99% availability.  Aurora automatically replicates the data across multiple availability zones, and can typically failover to a new instance in less than 30 seconds.  The failover process is automatic and seamless, requiring no changes to the Controller itself. It’s possible to failover not only to another Availability Zone, but also to another Region, if needed.

AppD does support High Availability configurations for on-premise deployments, requiring 2 physical machines each with its own storage.If one machine has an issue you can quickly switch to the second system.

In AWS with Aurora, a single machine is utilized to host the Controller application server, which is separate from the Aurora backend. If the application server has an issue, you can spin up another one based on an image as fast as AWS can start the EC2 instance (usually less than 90 seconds).

The benefit: There is no longer a need for a second, barely utilized machine waiting to do something. You can spin-up a new machine on demand, as needed.

Solid Security

Amazon Aurora is highly secure. Network isolation is achieved through the use of an Amazon Virtual Private Cloud (VPC). Aurora also supports encryption for data at rest, with no measurable performance impact.

How Does It Work?

One Click Deployment with CloudFormation

The AppDynamics CloudFormation template provides a comprehensive solution for automating the process of creating the virtual infrastructure required to host a Controller in AWS. With a single button-click, you can create all the security groups, elastic network interfaces, EC2 instances, and virtual infrastructure required.

Installation Using Enterprise Console

Installation of the Controller is done via the AppD Enterprise Console command-line interface (CLI); the installer includes a new option to specify Aurora as the database type.


./ submit-job –platform-name myplatform –service controller –job install –args controllerProfile=large controllerPrimaryHost=<hostname> controllerTenancyMode=single controllerRootUserPassword=”<password>” mysqlRootPassword=”<password>” controllerAdminUsername=”admin” controllerAdminPassword=”<password>” databaseType=Aurora controllerDBPort=3388 controllerDBHost=”<aurora hostname>”


AppDynamics version 4.4.3 makes it easy to deploy a Controller in AWS with Aurora. This solution offers a host of benefits for AppD customers, including vastly improved scalability, reliability, availability, security and performance. For these reasons, AppDynamics SaaS now leverages Aurora as well.

And with support for large-scale Controllers, the entire AppD platform can now be deployed at scale in AWS, including the Events Service Cluster, EUM Server and other components.

Want to learn more? The product documentation has additional info on this great new capability from AppD.

Himanshu Sharma, co-author, is Senior DevOps Engineer on the AppDynamics Platform team, and lead on the Aurora-backed controller.

Derek Mitchell is part of AppDynamics Global Services team, which is dedicated to helping enterprises realize the value of business and application performance monitoring. AppDynamics’ Global Services’ consultants, architects, and project managers are experts in unlocking the cross-stack intelligence needed to improve business outcomes and increase organizational efficiency.

The AppD Approach: How to Monitor the AWS Cloud and Save Money

The AppDynamics platform is highly extensible, allowing our customers to monitor a variety of key Amazon Web Services (AWS) metrics. We have 20 unique AWS extensions that capture stats for everything from Auto Scaling which optimizes app performance and cost, to Storage Gateway which enables on-prem apps to use AWS cloud storage. We’re always fine-tuning these cloud-monitoring extensions and making improvements where necessary. In some cases, we’ll integrate these features into our core APM product to keep it best in class.

What can AppD’s extensions do for you? Each efficiently gathers metrics from all Regions and Availability Zones in the AWS Global Infrastructure. Using Amazon CloudWatch APIs, our extensions pull metrics from specific AWS components and pass them to the AppDynamics Controller for tracking, and for creating health rules and dashboard visualizations. These cloud-monitoring extensions give our customers greater insight into how their apps and businesses are running on AWS.

The AWS EC2 Monitoring Extension, for instance, retrieves data from Amazon CloudWatch on EC2 instances—including CPU, network and IO utilization—and displays this information in the AppDynamics Metric Browser. Similarly, the Billing Monitoring Extension captures billing statistics, while the S3 Monitoring Extension pulls in data on S3 bucket size, the number of objects in a bucket, HTTP metrics, and more.

AppDynamics users can also leverage the AWS cloud connector extension to automatically scale up or down in the cloud based on a variety of rules-based policies, such as the health of business transactions, the end-user experience, database and remote services, error rates, and overall app performance.

This cloud-monitoring extension helped one AppD customer avoid a Black Friday ecommerce meltdown by implementing health rules to automatically scale up EC2 resources when certain load and response time metrics were breached, and scale down when those metrics returned to normal. Another plus: By adding an authorization step to these workflows—one that asked permission before spinning instances up or down—the customer paid only for EC2 resources he needed.

Simple Setup

It’s easy to edit an AppDynamics AWS extension’s config file, extract the performance metrics you need, and show the data on your dashboard. Once you provide the necessary information in config.yml (see below), the extension will do the rest. You can select time ranges for data gathering, include/exclude specific metrics, and specify the way you’d like your stats collected: ave, max, min, sum, or sample count:

How AWS Extensions Save You Money

Many of our customers are gathering AWS metrics in the AppD platform. Collecting these metrics requires the use of Amazon API calls, which can get expensive when used excessively. As reported by AWS, a recent study by migration analytics firm TSO Logic found that most organizations are overpaying for cloud services, and that 35% of an average company’s cloud computing bill is wasted cost.

AWS regularly releases new products, all of which use CloudWatch APIs. The good news is that AppDynamics offers a special extension that monitors products using those APIs. By helping you collect only the metrics you need, AppD can help you manage your AWS bill.

EC2 Example

CloudWatch monitoring has two levels of pricing: Basic and Detailed. In Basic monitoring, metrics are updated in five-minute intervals; in Detailed, metrics are updated every minute.

Let’s say you have an EC2 instance that returns seven metrics, and you want to monitor all EC2 instances from one AWS region. To do so, you install the EC2 Monitoring Extension on your machine agent, and add your access key and secret key to the config.yml file. Then add the AWS region you’d like to monitor.

The extension makes a call to AWS to get the list of instances for each region listed in the config file, as well as the metrics associated with them. To get a metric value, AppDynamics calls CloudWatch to get the value associated with each instance.

Simply put, API calls add up in a hurry. Let’s look at an example:

1 list call + 7 metrics x 1 every minute   = 8 calls
x 60 minutes/hour = 480 calls
x 24 hours/day = 11,520 calls
x 30 days = 345,600 calls
x 20 instances = 6,912,000 calls

CloudWatch pricing gives you one million free API requests each month. This means you’ll be charged for the remaining ~6 million calls.

How can AppD help? By letting you specify how frequently your AWS extensions should make API calls. This feature allows you to dramatically reduce the number of AWS calls, while still monitoring all of your instances. It has been very effective for several AppD clients and is being added over time to all of our AWS extensions.

Select Your Cloud-Monitoring Metrics

Metric selection is another key feature of our AWS extensions. This is important because CloudWatch, by default, provides several metrics that may not be necessary for monitoring your environment.

AppDynamics’ extensions let you select only the metrics you’d like to monitor. In addition to better managing your CloudWatch bill, this also allows you to better manage which data is important to your business, and which isn’t.

You can monitor individual instances as well: Choose the instance you’d like to monitor, and the extension will only make calls regarding that instance. Since this feature allows you to monitor just those instances you use regularly, it saves money.

AppDynamics is always refining its AWS extensions, making improvements where necessary and integrating some of these features into our core APM product. Again, if there’s an AWS metric you need, we collect it. If you have specific AWS needs, contact your account manager. Want to learn more about AppDynamics? Learn more here or schedule a demo today.

Understanding the Momentum Behind .NET Core

Three years ago Satya Nadella took over as CEO of Microsoft, determined to spearhead a renewal of the iconic software maker. He laid out his vision in a famous July 10, 2014 memo to employees in which he declared that “nothing was off the table” and proclaimed his intention to “obsess over reinventing productivity and platforms.”

How serious was Nadella? In the summer of 2016, Microsoft took the bold step of releasing .NET Core, a free, cross-platform, open-source version of its globally popular .NET development platform. With .NET Core, .NET apps could run natively on Linux and macOS as well as Windows.

For customers .NET Core solved a huge problem of portability. .NET shops could now easily modernize monolithic on-premises enterprise applications by breaking them up into microservices and moving them to cloud platforms like Microsoft Azure, Amazon Web Services, or Google Cloud Platform. They had been hearing about the benefits of containerization: speed, scale and, most importantly, the ability to create an application and run it anywhere. Their developers loved Docker’s ease of use and installation, as well as the automation it brought to repetitive tasks. But just moving a large .NET application to the cloud had presented daunting obstacles. The task of lifting and shifting the large system-wide installations that supported existing applications consumed massive amounts of engineering manpower and often did not deliver the expected benefits, such as cost savings. Meanwhile, the dependency on the Windows operating system limited cloud options, and microservices remained a distant dream.

.NET Core not only addressed these challenges, it was also ideal for containers. In addition to starting a container with an image based on the Windows Server, engineers could also use much smaller Windows Nano Server images or Linux images. This meant engineers had the freedom of working across platforms. They were no longer required to deploy server apps solely on Windows Server images.

Typically, the adoption of a new developer platform would take time, but .NET Core experienced a large wave of early adoption. Then, in August 2017, .NET Core 2.0 was released, and adoption increased exponentially. The number of .NET Core users reached half a million by January 2018. By achieving almost full feature parity with .NET Framework 4.6.1, .NET Core 2.0 took away all the pain that had previously existed in shifting from the traditional .NET Framework to .NET Core. Libraries that hadn’t existed in .NET Core 1.0 were added to .NET Core 2.0. Because .NET Core implemented all 32,000 APIs in .NET Standard 2.0 most applications could reuse their existing code.

Engineering teams who have struggled with DevOps initiatives found that .NET Core allowed them to accelerate their move to microservices architectures and to put in place a more streamlined path from development to testing and deployment. Lately, hiring managers have started telling their recruiters to be sure and mention the opportunity to work with .NET Core as an enticement to prospective hires—something that never would have happened with .NET.

At AppDynamics, we’re so excited about the potential of .NET Core that we’ve tripled the size of the engineering team working on .NET. And, just last month, we announced a beta release of support for .NET Core 2.0 on Windows using the new the .NET micro agent released in our Winter ‘17 product release. This agent provides improved microservices support as more customers choose .NET Core to implement multicloud strategies. Reach out to your account team to participate in this beta.

Stay tuned for my next blog posts on how to achieve end-to-end visibility across all your .NET apps, whether they run on-premises, in the cloud, or in multi-cloud and hybrid environments.

The AppD Approach: IoT and AWS Greengrass

As both data and processing power rise on the edge of the network, monitoring the performance of edge devices becomes increasingly important. In addition to deploying the AppDynamics IoT monitoring platform to monitor C/C++ and Java apps, end-to-end visibility can be extended to applications running in an AWS Greengrass core by using AppDynamics IoT RESTFul APIs. The easiest way to do this today is with a Lambda function. We recently demonstrated this at AWS re:Invent using Cisco IOx and Cisco Kinetic together with AWS Greengrass on a Cisco Industrial Integrated Services router.

The best thing about this approach is that it opens up a new ecosystem of edge applications to the benefits of unified application monitoring. It ensures customers will resolve incidents faster, reduce downtime, and lower operations’ costs. Meanwhile, the combined strengths of AWS Greengrass and AppDynamics’ IoT Monitoring Platform allow very large volumes of data generated by the Internet of Things to be mined for business insights and harnessed to achieve business objectives.

AWS Greengrass is designed to simplify the implementation of local processing on edge devices. A software runtime, it lets companies execute compute, messaging, data caching, sync, and machine learning (ML) inference instructions even when connectivity to the cloud is temporarily unavailable. Since its release, it has helped accelerate adoption of IoT by making it easier for developers to create and test applications in the cloud using their programming language of choice and then deploy the apps to the edge.

Once the apps are deployed, AppDynamics’ IoT Monitoring Platform provides deep visibility, in real-time, by letting developers capture application performance data, errors and exceptions, and business data. Since the AppDynamics solution is designed for flexible integration at the edge, Lambda functions can be individually instrumented, or a dedicated Lambda function can be written to provide insight into all the Lambdas running. This allows for a wide range of edge applications to monitor any key metric that makes sense to the business.

In the demo at AWS re:Invent, we instrumented an edge application running on a manufacturing floor that was reading sensor data from a programmable logic controller (PLC) over a Modbus interface and reporting it back to the cloud. A key success metric was how edge computing reduced the large amount of inbound data volume to a much smaller meaningful volume that was being pushed to the cloud. AppDynamics provided real-time verification by keeping track of the volume of data being ingested into the Lambda functions, and of the data that was being processed and being sent to the various cloud applications, including AWS Cloud.

Learn more about AppDynamics IoT monitoring and please send us any feedback or questions.

This Is How Amazon’s Servers Rarely Go Down [Infographic]

Amazon Web Services (AWS), Amazon’s best-in-class cloud services offering, had downtime of only 2.5 hours in 2015. You may think their uptime of 99.9997 percent had something to do with an engineering team of hundreds, a budget of billions, or dozens of data centers across the globe—but you’d be wrong. Amazon’s website, video, and music offerings, and even AWS itself, all leverage multiple AWS products to get five nines of availability, and those are the same products we get to use as consumers. With some clever engineering and good service decisions, anyone can get uptime numbers close to Amazon’s for only a fraction of the cost.

But before we discuss specific techniques to keep your site constantly available, we need to accept a difficult reality: Downtime is inevitable. Even Google was offline in 2015, and if the single largest website can’t get 100 percent uptime, you can be sure it’s impossible for your company to do so too. Instead of trying to prevent downtime, reframe your thinking and do everything you can to make sure your service is as usable as possible even while failure occurs, and then recover from it as quickly as possible.

Here’s how to architect an application to isolate failure, recover rapidly from downtime, and scale in the face of heavy load. (Though this is only a brief overview: there are plenty of great resources online for more detailed descriptions. For example, don’t be afraid to dive into your cloud provider’s documentation. It’s the single best source for discovering all the amazing things they can do for you.)

Architecture and Failure Mitigation

Let’s begin by considering your current web application. If your primary database were to go down, how many services would be affected? Would your site be usable at all? How quickly would customers notice?

If your answers are “everything,” “not at all,” and “immediately” you may want to consider a more distributed, failure-resistant application architecture. Microservices—that is, many different, small applications that work together to act like a larger app—are extremely popular as an engineering paradigm. The failure of an individual service is less noticeable to all clients.

For example, consider a basic shop application. If it were all one big service, failure of the database takes the entire site offline; no one can use it at all, even just to browse products or plan purchases. But now let’s say you have microservices instead of a monolith. Instead of a single shop application, perhaps you have an authentication service to login users, a product service to browse the shop, and an order fulfillment service to charge customers and ship goods. A failure in the order fulfillment database means that only people who try to ship see errors.

Losing an element of your operation isn’t ideal, but it’s not anywhere near as bad as having your entire site unavailable. Only a small fraction of customers will be affected, while everyone else can happily browse your store as if nothing was going wrong. And with proper logging, you can note the prospects that had failed requests and reach out to them personally afterward, apologizing for the downtime and hopefully still converting them into paying customers.

This is all possible with a monolithic app, but microservices distribute failure and better isolate it to specific parts of a system. You won’t prevent downtime; instead, you’ll make it affect less people, which is a much more achievable goal.

Databases, Automatic Failover, and Preventing Data Loss

It’s 2 a.m. and a database stops working. What happens to your website? What happens to the data in your database? How long will you be offline?

This used to be the sysadmin nightmare scenario: pray the last backup was usable and recent, downtime would only be a few hours, only a day’s worth of data perished. But nowadays the story is very different, thanks in part to Amazon but also to the power and flexibility of most database software.

If you use the AWS Relational Database Service (RDS), you get daily backups for free, and restoration of a backup is just a click away. Better yet, with a multi-availability zone database, you’re likely to have no downtime at all and the entire database failure will be invisible.

With a multi-AZ database, Amazon keeps an up-to-date copy of your database in another availability zone: a logically separate datacenter from wherever your primary database is. An internet outage, a power blip, or even a comet can take out the primary availability zone, and Amazon will detect the downtime and automatically promote the database copy to be your main database. The process is seamless and happens immediately—chances are, you won’t even experience any data loss.

But availability zones are geographically close together. All of Amazon’s us-east-1 datacenters are in Virginia, only a few miles from each other. Let’s say you also want to protect against the complete failure of all systems in the United States and keep a current copy of your data in Europe or Asia. Here, RDS offers cross-region read replicas that leverage the underlying database technology to create consistent database copies that can be promoted to full-fledged primaries at the touch of a button.

Both MySQL and PostgreSQL, the two most popular relational database systems on the market and available as RDS database drivers, offer native capabilities to ship database events to external follower databases as they occur. Here, RDS takes advantage of a feature that anyone can use, though with Amazon’s strong consumer focus, it’s significantly easier to set up in RDS than to do it manually. Typically, data is shipped to followers simultaneously to data being committed to the primary. Unfortunately, across a continent, you’re looking at a data loss window of about 200 to 500 milliseconds, because an event must be sent from your primary database and be read by the follower.

Still, for recovering a cross-continental consistent backup system, 500 milliseconds is much better than hours. So next time your database fails in the middle of the night, your monitoring service won’t even wake you. Instead you can read about it in the morning—if you can even detect that it occurred. And that means no downtime and no unhappy custom.

Auto Scaling, Repeatability, and Consistency

Amazon’s software-as-a-service (SaaS) offerings, such as RDS, are extremely convenient and very powerful. But they’re far from perfect. Generally, AWS products are much slower to provision compared to running the software directly yourself. Plus, they tend to be several software versions behind the most recent releases.

In databases, this is a fine tradeoff. You almost never create databases so slow that startup doesn’t matter, and you want extremely stable, well-tested, slightly older software. If you try to stay on the bleeding edge, you’ll just end up bloody. But for other services, being locked into Amazon’s product offerings makes less sense.

Once you have an RDS instance, you need some way for customers to get their data into it and for you to interact with that data once it’s there. Specifically, you need web servers. And while Amazon’s Elastic Beanstalk (AWS’ platform to deploy and scale web applications) is conceptually good, in practice it is extremely slow with middling software support, and can be painfully difficult to debug problems.

But AWS’ primary offering has always been the Elastic Compute Cloud (EC2). Running EC2 nodes is fast and easy, and supports any kind of software your application needs. And, unsurprisingly, EC2 offers exceptional tools to mitigate downtime and failure, including auto scaling groups (ASGs). With an ASG, Amazon keeps as many servers up as you specify, even across availability zones. If a server becomes unresponsive or passes other thresholds defined by you (such as amount of incoming traffic or CPU usage), new nodes will automatically spin up.

New servers by themselves do you no good. You need a process to make sure  new nodes are provisioned correctly and consistently so a new server joining  your auto scaling group also has your web  software and credentials to access  your database. Here, you can take  advantage of another Amazon tool, the  Amazon  Machine Image (or AMI). An AMI is a saved copy of an EC2 instance.  Using an AMI, AWS can spin up a new node  that is  an exact copy of the machine that generated the AMI.

Packer, by Hashicorp, makes it easy to create and save AMIs, and is also free and open-source. But there are lots of  amazing tools that can simplify AMI creation. They are the fundamental building blocks of EC2. With clever AMI use you’ll  be able to create new, functional servers in less than 5 minutes.

It’s common to need additional provisioning and configuration even after an AMI is started—perhaps you want to make  sure  the latest version of your application is downloaded onto your servers from GitHub, or that the most recent security  patches  have been applied to your installed packages. In cases such as these a provisioning system is a necessity. Chef  and  Puppet  are the two biggest players in this space, and both offer excellent integrations with AWS. The ideal use case  here i  is an AMI  with credentials to automatically connect to your Chef or Puppet provisioning system, which then ensures  the  newly  created node is as up to date as possible.


Final Thoughts

By relying on auto scaling groups, AMIs, and a sensible provisioning system, you can create a system that is completely  repeatable and consistent. Any server could go down and be replaced, or 10 more servers could enter your load balancer,  and the process would be seamless, automatic, and almost invisible to you.

And that’s the secret why Amazon’s services rarely go down. Not the hundreds of engineers, or dozens of datacenters, or  even the clever products: It’s the automation. Failure happens, but if you detect it early, isolate it as much as possible, and  recover from it seamlessly—all without requiring human intervention—you’ll be back on your feet before you even knew a  problem occurred.

There are plenty of potential concerns with powerful automated systems like this. How do you ensure new servers are  ones  provisioned by you, and not an attacker trying to join nodes to your cluster? How do you make sure transmitted  copies of  your databases aren’t compromised? How do you prevent a thousand nodes from accidentally starting up and dropping a  massive AWS bill into your lap? This overview of the techniques AWS leverages to prevent downtime and isolate failure  should serve as a good jumping-off point to those more complicated concepts. Ultimately, downtime is  impossible to  prevent, but you can keep it from broadly affecting your customers. Working to keep failure contained and recovery as  rapid as possible leads to a better experience both for you and your users.

Share this Image On Your Site

A Guide to Performance Challenges with AWS EC2: Part 4

If you’re just starting, check out Part 1, Part 2, and Part 3 to get up to speed on your guide to the top 5 performance challenges you might come across managing an AWS Elastic Compute Cloud (EC2), and how to best address them. We kicked off with the ins and outs of running your virtual machines in Amazon’s cloud, and how to navigate your way through a multi-tenancy environment along with managing different storage options. Last week, we went over how to identify the right applications run on EC2 instances for your unique workloads. This week, we’ll wrap up by handling Amazon’s Elastic Load Balancer (ELB) performance and overall AWS outages.

Poor ELB Performance

Amazon’s Elastic Load Balancer (ELB) is Amazon’s answer to load balancing that integrates seamlessly into the AWS ecosystem. Rather than sending calls directly to individual EC2 instances we can instead insert an ELB in front of our EC2 instances, send load to our ELB, and then the ELB will distribute that load across the EC2 instances. This allows us to more easily add and remove EC2 instances from our environment and affords us the optimization of leveraging auto-scaling groups that grow and shrink our EC2 environment based on our rules or based on performance metrics. This relationship is shown in figure 3.


Figure 3. ELB-to-EC2 Relationship

While we may think of an ELB as a stand-alone appliance, like a Cisco LocalDirector or F5 BigIP, under-the-hood ELB is a proprietary load balancing application running on EC2 instances. As such, it benefits from the same elastic capabilities that your own EC2 instances do, but it also suffers from the same constraints of any load balancer running on in a virtual environment, namely it must be sized appropriately. If your application receives substantial load then you need to have enough ELB instances to handle and distribute that load across your EC2 instances. Unfortunately (or fortunately depending on how you look at it) you do not have visibility into the number of ELB instances or their configurations, you must rely on Amazon to manage that complexity for you.

So how do you handle that scale-up requirement for applications that receive substantial load? There are two things that you need to keep in mind:

  • Pre-warming: if you know that your application will receive substantial load or you are expecting flash traffic then you should contact Amazon and ask them to “pre-warm” the load balancer for you. Amazon will then configure the load balancer to have the appropriate capacity to handle the load that you expect.

  • Recycling ELBs: for most applications it is advisable to recycle your EC2 instances regularly to clean up memory or other clutter that might appear on a machine, but in the case of ELBs, you do not want to recycle them if at all possible. Because an ELB consists of EC2 instances that have grown, over time, to facilitate your user load, recycling them would effectively reset their capacity back to zero and force them to start over.

To detect if you are suffering from a poor ELB configuration and need to pre-warm your environment, measure the response times of synthetic business transactions (simulated load) or response times from the client perspective (such as via JavaScript on the browser or instrumented mobile applications) and then compare the total response time with the response time from the application itself. The difference between the client’s response time and your application response time consists of both network latency as well as wait / queue time of your requests in your load balancer. Historical data should help you understand network latency, but if you are seeing consistently poor latency or, even worse, your clients are receiving an HTTP 503 error code reporting that the server cannot handle any more load, then you should contact Amazon and ask them to pre-warm your ELB.

Handling AWS Failures and Outages

Amazon failures are very infrequent, but they can and do occur. For example, AWS had an outage in its North Virginia data center (US-EAST-1) for 6-8 hours on September 20, 2015 that affected more than 20 of its services, including EC2. Many big-name clients were affected by the outage, but one notable company that managed to avoid any “significant impact” was Netflix. Netflix has created what it calls its Simian Army, which is a set of processes with colorful names like Chaos Monkey, Latency Monkey, and Chaos Gorilla (get the simian reference?) that regularly wreak havoc on their application and on their Amazon deployment. As a result they have built their application to handle failure, so the loss of an entire data center did not significantly impact Netflix.

AWS runs across more than a dozen data centers in the world:



Amazon divides the world into regions and each region maintains more than one availability zone (AZ), in which each AZ represents a data center. When a data center fails then there are redundant data centers in that same region that can take over. For services like S3 and RDS, the data in a region is safely replicated between AZs so that if one fails then the data is still available. With respect to EC2, you need to choose the AZs to which you deploy your applications so it is advised that you deploy multiple instance of your applications and services to multiple AZs in your region. You are in control of how redundant your application is, but it also means that you need to run more instances (in different AZs), which equates to more cost.

Netflix’s Chaos Gorilla is a tool that simulates a full region outage for Netflix and they have tested their application to ensure that it can sustain a full region outage. Cross AZ data replication is available to you for free in AWS, but if you want to replicate data across regions then the problem is far more complicated. S3 supports cross-region replication but at the cost of transferring data out to and in from other regions. RDS cross-region replication varies based on the database but sometimes at much higher costs. Overall, Adrian Cockcroft, the former chief architect of Netflix, tweeted that the cost to maintain active-active data replication across regions is about 25% of the total cost.

All of this is to say that resiliency and high availability are at odds with both financial costs as well as the performance overhead of data replication, but are all available to the diligent. In order to be successful at handling Amazon failures (and scheduled outages for that matter), you need to architect your application to protect against failure.


Amazon Web Services may have revolutionized computing in the cloud, but it also introduced new concerns and challenges that we need to be able to identify and respond to. This paper presented five challenges that we face when managing an EC2 environment:

  • Running in a Multi-Tenancy Environment: how do we determine when our virtual machines are running on hardware shared with other virtual machines and those other virtual machines are noisy?

  • Poor Disk I/O Performance: how do we properly interpret AWS’s IOPS metric and determine when we need to opt for a higher IOPS EBS volume?

  • The Wrong Tool for the Job: how do we align our application workload with Amazon optimized EC2 instance types?

  • Poor ELB Performance: how do ELBs work under-the-hood and how do we plan for and manage our ELBs to match expected user load?

  • Handling AWS Failures and Outages: what do we need to consider when building a resilient application that can sustain AZ or even full region failures?

Hopefully this series gave you some starting points for your own performance management exercises and helped you identify some key challenges in your own environment that may be contributing to performance issues.


A Guide to Performance Challenges with AWS EC2: Part 3

If you’re just starting, check out Part 1 and Part 2 to get up to speed on your guide to the top 5 performance challenges you might come across managing an AWS Elastic Compute Cloud (EC2), and how to best address them. We kicked off with the ins and outs of running your virtual machines in Amazon’s cloud, and how to navigate your way through a multi-tenancy environment along with managing different storage options. This week, we’ll discuss identifying the right applications run on EC2 instances for your unique workloads. 

The Wrong Tool for the Job

It is quite common to see EC2 deployments that start simple and eventually evolve into mission critical components of the business. While companies often want to move into the cloud, they tend do so cautiously. This means initially creating sandbox deployments and moving non-critical applications to the cloud. The danger, however, is that as this sandbox environment grows into something more substantial that initial decisions for things like base AMIs and EC2 instance types are not re-evaluated and maintained over time. As a result, applications with certain characteristics may be running on EC2 instances not necessarily best for their workload.

Amazon has defined a host of different EC2 instance types and it is important to choose the right one for your application’s use case. Amazon has defined instance types that provide different combinations of CPU, memory, disk, and network capabilities and are categorized as follows:

  • General Purpose
  • Compute Optimized
  • Memory Optimized
  • GPU
  • Storage Optimized

General purpose instances are good starting points that provide a balance of compute, memory, and network resources. They come in two flavors: fixed performance and burstable performance. Fixed performance instances (M3 and M4) guarantee you specific performance capacity and are good for applications and services that require consistent capacity, such as small and mid-sized databases, data processing tasks, and backend services. Burstable performance instances (T2) provide a baseline level of CPU performance with the ability to burst above that baseline for a period of time. These instances are good for applications that vary in their compute requirements, such as development environments, build servers, code repositories, low traffic websites and web applications, microservices, early production experiments, and small databases.

Compute optimized instances (C3 and C4) provide high-powered CPUs and favor compute capacity over memory and network resources; they provide the best compute/cost value. They are best for applications that require a lot of computational power, such as high-performance front-end applications, web servers, batch processes, distributed analytics, high-performance science and engineering applications, ad serving, MMO gaming, and video encoding.

Memory optimized instances (R3) are optimized for memory-intensive applications and provide the best memory (GB) / cost value. The best use cases are high-performance databases, distributed memory caches, in-memory analytics, genome assembly and analysis, larger deployments of SAP, Microsoft SharePoint, and other enterprise applications.

GPU instances (G2) are optimized for graphics and general purpose GPU computing applications and best for 3D streaming, machine learning, video encoding, and other server-side graphics or GPU compute workloads.

Storage optimized instances come in two flavors: High I/O instances (I2) and Dense-storage instances (D2). High I/O instances provide very fast SSD-backed instance storage and are optimized for very high random I/O performance; they are best for applications like NoSQL databases (Cassandra and MongoDB), scale-out transactional databases, data warehouses, Hadoop, and cluster file systems. Dense-storage instances deliver high disk throughput and provide the best disk throughput performance; they are best for Massively Parallel Processing (MPP) data warehousing, MapReduce and Hadoop distributed computing, distributed file systems, network file systems, log, or data-processing applications.

The best strategy for selecting the best instance type is to choose the closest matching category (instance type family) from the list above, select an instance from that family, and then load test your application. Monitoring the performance of your application under load will reveal if your application is compute-bound, memory-bound, or network-bound so, if you have selected the wrong instance type family then adjust accordingly. Finally, load test results will help you choose the right sized instance within that instance family.

A Guide to Performance Challenges with AWS EC2: Part 2

Amazon Web Services (AWS) revolutionizes production application deployments through elastic scalability and an hourly payment plan. Companies can pay for the infrastructure that they need for any given hour of the day, scaling up to meet high user demand during peak hours and scaling down during off peak hours. The AWS Elastic Compute Cloud (EC2), is one of its core components. It has many of the same characteristics as traditional virtual machines, but is tightly integrated into the AWS ecosystem, so many of its capabilities differ from a traditional virtual machine as well. 

Last week, we kicked off our series on your guide to the top 5 performance challenges you might come across managing an AWS EC2 environment, and how to best address them. We started off with the ins and outs of running your virtual machines in Amazon’s cloud, and how to navigate your way through a multi-tenancy environment.  

Poor Disk I/O Performance

AWS supports several different types of storage options, the core of which include the following: 

  • EC2 Instance Store
  • Elastic Block Store (EBS)
  • Simple Storage Service (S3) 

EC2 instances can access the physical disks that are attached to the machine hosting the EC2 instance to use for temporary storage. The important thing to note about this type of storage is that it is ephemeral, meaning that it only persists for the duration of the EC2 instance and then is destroyed when the EC2 instance stops. Meaningful storage, therefore, will not persist to an EC2 instance store.

For more common storage needs we’ll opt for either EBS or S3. From the perspective of how they are accessed, the main difference between the two is that EBS can be accessed through disk operations whereas S3 provides a RESTful API to store and retrieve objects. With respect to use cases, S3 is designed to store web-scale amounts of data whereas EBS is more akin to a hard drive. Therefore, when you need to access a block device from an application running on an EC2 instance and you need that data to persist between EC2 restarts, such as storage to support a database, you’ll typically leverage an EBS volume.

EBS volumes come in three flavors:

  • Magnetic Volumes: magnetic volumes can range in size from 1GB to 1TB and support 40-90 MB/sec throughput. They are good for workloads where data is accessed infrequently.
  • General Purpose SSD: general purpose SSDs can range in size from 1GB to 16TB and support 160 MB/sec throughput. They are good for use cases such as system boot volumes, virtual desktops, small to medium sized databases, and development and test environments.
  • Provisioned IOPS SSD: provisioned IOPS (Input/Output Per Second) SSDs can range from 4GB to 16TB in size and support 320 MB/sec throughput. They are good for critical business applications that required sustained IOPS performance, or more than 10,000 IOPS (160 MB/sec), such as MongoDB, SQL Server, MySQL, PostgreSQL, and Oracle

Choosing the correct EBS volume type is important, but it is also important to understand what these metrics mean and how they impact your EC2 instance and disk I/O operations. 

  • First, IOPS are measured in terms of a 16K I/O block size, so if your application is writing 64K then it will use 4 IOPS
  • Next, in order to realize the IOPS capacity, you need to send enough requests to the EBS volume to match its queue length, or number of pending operations supported by the volume
  • You must use an AMI instance that is optimized to use EBS volumes; instance types are listed here
  • The first time that you access a block from EBS, there will be approximately 50% IOPS overhead. The IOPS measurement assumes that you’ve already accessed the block at least once

With these constraints you can better understand the CloudWatch EBS metrics, such as VolumeReadOps and VolumeWriteOps, and how IOPS are computed. Review these metrics in light of the EBS volume type that you are using and see if you are approaching a limit. If you are approaching a limit then you will want to opt for a higher IOPS supported volume.

Figure 1 shows the VolumeReadOps and VolumeWriteOps for an EBS volume. From this example we can see that this particular EBS volume is experiencing about 2400 IOPS.

Figure 1. Measuring EBS IOPS

A Guide to Performance Challenges with AWS EC2: Part 1

Amazon Web Services (AWS) revolutionized production application deployments through elastic scalability and an hourly payment plan. Companies can pay for the infrastructure that they need for any given hour of the day, scaling up to meet high user demand during peak hours and scaling down during off peak hours. One of the core components of AWS is the Elastic Compute Cloud (EC2), which is Amazon’s abstraction of a virtual machine running in the cloud. It has many of the same characteristics as traditional virtual machines, but it is tightly integrated into the AWS ecosystem, which means that it has many characteristics that are new or different than traditional virtual machines.

This blog kicks off a series on your guide to the top 5 performance challenges you might come across managing an AWS EC2 environment, and how to best address them.

Running in a Multi-Tenancy Environment

Amazon EC2 instances are virtual machines that run in Amazon’s cloud and, like all virtual machines, they ultimately run on physical hardware. One side effect of running virtual machines in an environment that you do not own is that you cannot control what other virtual machines run next to you on the same physical hardware and some of your neighbors may be noisier than others. Basically, the performance of EC2 instances can sometimes be spotty and you need to be able to effectively identify when your virtual machines are running on spotty hardware and react accordingly.

So, how do you know if you have noisy neighbors?

Answer: You need to review both the physical runtime characteristics of your virtual machines as well as Amazon’s CloudWatch metrics. You can monitor the AWS CloudWatch metrics, OS and application level metrics and correlate them using an APM tool, such as AppDynamics. When examining the runtime behavior of a virtual machine using tools like top or vmstat, you’ll observe that, in addition to returning the CPU usage and idle percentages, the operating system also returns the amount of “stolen” CPU. Stolen CPU is not as diabolical as it sounds, it is a relative measure of the cycles of a CPU that should have been available to run your processes, but were not available because the hypervisor diverted cycles away from your instance. This may have happened because you have met your allotted quota of CPU usage or because another instance running on the same physical hardware is occupying the available CPU capacity. A little investigation will help you discern between the two.

First, you need to know the underlying CPU powering the hardware running your virtual machine. You can connect into your machine and execute the following command to view the CPU information:

# cat /proc/cpuinfo

processor : 0

vendor_id : GenuineIntel

cpu family : 6

model : 62

model name : Intel(R) Xeon(R) CPU E5-2651 v2 @ 1.80GHz

stepping : 4

cpu MHz : 1800.083

cache size : 30720 KB

fdiv_bug : no

hlt_bug : no

f00f_bug : no

coma_bug : no

fpu : yes

fpu_exception : yes

cpuid level : 13

wp : yes

flags : fpu de tsc msr pae cx8 apic cmov pat clflush acpi mmx fxsr sse sse2 ss ht nx constant_tsc up pni ssse3 popcnt

bogomips : 3647.67

clflush size : 64

In this example, the EC2 instance is running on a 1.8 GHz Xeon processor. Because an ECU is equivalent to a 1.0-1.2 GHz Xeon processor then 100% ECU utilization would be between 55% and 66% of the physical CPU usage. Next, we need to look at the runtime behavior of our virtual machine, which can be accomplished by using the top or vmstat command:

# vmstat

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------

r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

0  0      0 1101516 162504 383016    0    0     0     6    1    1  0  0 100  0  0

The “us” metric reports the CPU usage, the “id” metric reports the idle percentage of the CPU, and the “st” metic reports the amount of stolen CPU. The key to identifying noisy neighbors is to observe the metrics over time pay particular attention to the stolen CPU time when your virtual machine is idle. If you see discrepancies in the stolen time when your CPU is idle then that indicates that you are sharing the same hardware with other customers. For example, when your CPU usage is idle and the stolen CPU percentage is 10% once and then another time the stolen CPU usage is 40% then chances are that you are not simply observing hypervisor activity, but rather are observing the behavior of another virtual machine.

In addition to reviewing the operating system CPU utilization and stolen percentage, you need to cross reference this with Amazon’s CloudWatch metrics. The CloudWatch CPUUtilization is defined by Amazon as follows:

The percentage of allocated EC2 compute units that are currently in use on the instance. This metric identifies the processing power required to run an application upon a selected instance. Depending on the instance type, tools in your operating system may show a lower percentage than CloudWatch when the instance is not allocated a full processor core.

The CloudWatch CPUUtilization metric is your source of truth for truly understanding your virtual machine’s processing capability and, combined with the operating system’s CPU metrics, can provide you with enough information to discern the presence of a noisy neighbor. If the CloudWatch CPUUtilization metric is at 100% for your EC2 instance and your operating system is not reporting that you’ve reached 100% of your ECU capacity (one core of your physical CPU type / 1.0-1.2 GHz) and the CPU stolen percentage is high, then you have a noisy neighbor that is draining compute capacity from your virtual machine. For example, in the aforementioned scenario of 1 ECU on a machine with a 2.4 Ghz physical CPU, we would expect 100% ECU usage to be about 50% of the physical CPU. If the operating system reports that we’re at 30% CPU utilization with a high stolen CPU usage and CloudWatch reports that we’re at 100% usage, then we have identified a noisy neighbor.

Figure 1 presents an example of the CloudWatch CPU Utilization as compared to the operating system CPU utilization and stolen percentages, captured graphically over time. This illustrates the discrepancy between how CloudWatch sees your instance and how your instance sees itself.


Figure 1. CloudWatch vs OS CPU Metrics


Once you’ve identified that you have a noisy neighbor, so how should you react? You have a few options, but here’s what we recommend:

  • Observe the behavior to determine if it is a systemic problem (is the neighbor constantly noisy?)
  • If it is a systemic problem then move. While this might not be the best choice if you live in an apartment building, in Amazon it is simple: start a new AMI instance, register it with your ELB, and decommission the old one. If you get into the habit of viewing EC2 instances as disposable it will help you resolve these types of issues quickly
  • Consider implementing a rolling EC2 instance strategy. Again, building on the idea that EC2 instances are disposable, it is a good practice to only keep your EC2 instances around for a short period of time, such as a few hours. Just as your laptop benefits from being restarted regularly, so will your server instances. Over time your instances may add clutter to your memory and, rather than work overly hard to clean them up, it is far easier to replace them with new ones. The best strategy that I have seen in this space is to manage your rolling instances through your elasticity strategy. The cloud allows you to scale up when you need more capacity (peak hours) and scale down when you need less capacity (off-peak hours). Getting into the habit of decommissioning the oldest instances when you scale down can shorten the average lifespan of your EC2 instances and achieve a rolling EC2 instance strategy.

Expanding Amazon Web Services Monitoring with AppDynamics

Enterprises are increasingly moving their applications to the cloud and Amazon Web Services (AWS) is the leading cloud provider. We announced expanded AWS monitoring with the AppDynamics Winter ‘16 release earlier this year. In this blog, I will provide some additional details on the expanded support of AWS monitoring.

Native Support of AWS components

Before the Winter ’16 release, only Amazon Simple Queue Service (SQS) was automatically discovered by AppDynamics Java APM agents and shown in the Application flow map with key performance metrics. For other AWS components, customers had to configure the discovery of the backends manually in AppDynamics or use the old Amazon Cloudwatch Monitoring extension to get the AWS metrics and track them in metric browser or dashboards.

In the AppDynamics Winter ’16 release, the following AWS components are natively supported by Java APM agents:

  • Amazon DynamoDB: A fast and flexible NoSQL database service
  • Amazon Simple Storage Service (S3): It provides secure, durable, highly scalable object storage
  • Amazon Simple Notification Service (SNS): It is Pub-sub Service for Mobile and Enterprise Messaging

By native support, I mean automatic discovery and display of the Application flow map with key performance metrics without any manual configuration. The screenshot of AppDynamics application flow map in Figure 1 below shows an application that uses Amazon DynamoDB, S3, and SNS.

Screen Shot 2016-05-18 at 4.57.37 PM.png

Figure 1 – Application using Amazon DynamoDB, S3, and SNS

19 New AWS Monitoring Extensions

The AppDynamics platform is highly extensible to monitor various technology solutions that are not discovered natively. In the past, we had Amazon Cloudwatch Monitoring extension that collected all the AWS metrics via Amazon Cloudwatch APIs and passed them to the AppDynamics controller where they were tracked in metric browsers or dashboards. This extension was not very efficient because it collected and passed a lot of data for all the AWS components even if a customer just needed to monitor one or two different AWS components they were using.

The AppDynamics Winter ’16 release announced 19 new AWS monitoring extensions to monitor different AWS components. These extensions still use Amazon Cloudwatch APIs, but collect metrics for a specific AWS component and pass it to AppDynamics controller for tracking, creating health rules and visualizing in dashboards efficiently.

Here is the list of all the new AWS monitoring extensions:

  1. AWS Custom Namespace Monitoring Extension
  2. AWS SQS Monitoring Extension
  3. AWS S3 Monitoring Extension
  4. AWS Lambda Monitoring Extension
  5. AWS CloudSearch Monitoring Extension
  6. AWS StorageGateway Monitoring Extension
  7. AWS SNS Monitoring Extension
  8. AWS Route53 Monitoring Extension
  9. AWS Redshift Monitoring Extension
  10. AWS ElasticMapReduce Monitoring Extension
  11. AWS RDS Monitoring Extension


Figure 2 – Amazon RDS Metrics in AppDynamics Metric Browser

  1. AWS ELB Monitoring Extension
  2. AWS OpsWorks Monitoring Extension
  3. AWS EBS Monitoring Extension


Figure 3 – Amazon EBS Metrics in AppDynamics Metric Browser

  1. AWS Billing Monitoring Extension
  2. AWS AutoScaling Monitoring Extension
  3. AWS DynamoDB Monitoring Extension
  4. AWS ElastiCache Monitoring Extension
  5. AWS EC2 Monitoring Extension

Customers can leverage all the core functionalities of AppDynamics (e.g. dynamic baselining, health rules, policies, actions, etc.) for all of these AWS metrics while correlating them with the performance of the application using these AWS environment.

AppDynamics Customers using AWS Monitoring

With AppDynamics, many of our customers have accelerated their application migration to the cloud while others continue to monitor cloud applications as the application complexity explodes with the move towards microservices and dynamic web services. As the workload on the these cloud applications grew, AppDynamics came to rescue by elastically scaling the AWS resources to meet the exponential demand for applications.

For example, Nasdaq accelerated their application migration to AWS and gained complete visibility into complex application ecosystem. Heather Abbott, Senior Vice President of Corporate Solutions Technology at Nasdaq, summarized their experience with AppDynamics. “The ability to trace a transaction visually and intuitively through the interface was a major benefit AppDynamics delivered. This visibility was especially valuable when Nasdaq was migrating a platform from its internal infrastructure to the AWS Cloud. We used AppDynamics extensively to understand how our system was functioning on AWS, a completely new platform for us.”