This Is How Amazon’s Servers Rarely Go Down [Infographic]

Amazon Web Services (AWS), Amazon’s best-in-class cloud services offering, had downtime of only 2.5 hours in 2015. You may think their uptime of 99.9997 percent had something to do with an engineering team of hundreds, a budget of billions, or dozens of data centers across the globe—but you’d be wrong. Amazon’s website, video, and music offerings, and even AWS itself, all leverage multiple AWS products to get five nines of availability, and those are the same products we get to use as consumers. With some clever engineering and good service decisions, anyone can get uptime numbers close to Amazon’s for only a fraction of the cost.

But before we discuss specific techniques to keep your site constantly available, we need to accept a difficult reality: Downtime is inevitable. Even Google was offline in 2015, and if the single largest website can’t get 100 percent uptime, you can be sure it’s impossible for your company to do so too. Instead of trying to prevent downtime, reframe your thinking and do everything you can to make sure your service is as usable as possible even while failure occurs, and then recover from it as quickly as possible.

Here’s how to architect an application to isolate failure, recover rapidly from downtime, and scale in the face of heavy load. (Though this is only a brief overview: there are plenty of great resources online for more detailed descriptions. For example, don’t be afraid to dive into your cloud provider’s documentation. It’s the single best source for discovering all the amazing things they can do for you.)

Architecture and Failure Mitigation

Let’s begin by considering your current web application. If your primary database were to go down, how many services would be affected? Would your site be usable at all? How quickly would customers notice?

If your answers are “everything,” “not at all,” and “immediately” you may want to consider a more distributed, failure-resistant application architecture. Microservices—that is, many different, small applications that work together to act like a larger app—are extremely popular as an engineering paradigm. The failure of an individual service is less noticeable to all clients.

For example, consider a basic shop application. If it were all one big service, failure of the database takes the entire site offline; no one can use it at all, even just to browse products or plan purchases. But now let’s say you have microservices instead of a monolith. Instead of a single shop application, perhaps you have an authentication service to login users, a product service to browse the shop, and an order fulfillment service to charge customers and ship goods. A failure in the order fulfillment database means that only people who try to ship see errors.

Losing an element of your operation isn’t ideal, but it’s not anywhere near as bad as having your entire site unavailable. Only a small fraction of customers will be affected, while everyone else can happily browse your store as if nothing was going wrong. And with proper logging, you can note the prospects that had failed requests and reach out to them personally afterward, apologizing for the downtime and hopefully still converting them into paying customers.

This is all possible with a monolithic app, but microservices distribute failure and better isolate it to specific parts of a system. You won’t prevent downtime; instead, you’ll make it affect less people, which is a much more achievable goal.

Databases, Automatic Failover, and Preventing Data Loss

It’s 2 a.m. and a database stops working. What happens to your website? What happens to the data in your database? How long will you be offline?

This used to be the sysadmin nightmare scenario: pray the last backup was usable and recent, downtime would only be a few hours, only a day’s worth of data perished. But nowadays the story is very different, thanks in part to Amazon but also to the power and flexibility of most database software.

If you use the AWS Relational Database Service (RDS), you get daily backups for free, and restoration of a backup is just a click away. Better yet, with a multi-availability zone database, you’re likely to have no downtime at all and the entire database failure will be invisible.

With a multi-AZ database, Amazon keeps an up-to-date copy of your database in another availability zone: a logically separate datacenter from wherever your primary database is. An internet outage, a power blip, or even a comet can take out the primary availability zone, and Amazon will detect the downtime and automatically promote the database copy to be your main database. The process is seamless and happens immediately—chances are, you won’t even experience any data loss.

But availability zones are geographically close together. All of Amazon’s us-east-1 datacenters are in Virginia, only a few miles from each other. Let’s say you also want to protect against the complete failure of all systems in the United States and keep a current copy of your data in Europe or Asia. Here, RDS offers cross-region read replicas that leverage the underlying database technology to create consistent database copies that can be promoted to full-fledged primaries at the touch of a button.

Both MySQL and PostgreSQL, the two most popular relational database systems on the market and available as RDS database drivers, offer native capabilities to ship database events to external follower databases as they occur. Here, RDS takes advantage of a feature that anyone can use, though with Amazon’s strong consumer focus, it’s significantly easier to set up in RDS than to do it manually. Typically, data is shipped to followers simultaneously to data being committed to the primary. Unfortunately, across a continent, you’re looking at a data loss window of about 200 to 500 milliseconds, because an event must be sent from your primary database and be read by the follower.

Still, for recovering a cross-continental consistent backup system, 500 milliseconds is much better than hours. So next time your database fails in the middle of the night, your monitoring service won’t even wake you. Instead you can read about it in the morning—if you can even detect that it occurred. And that means no downtime and no unhappy custom.

Auto Scaling, Repeatability, and Consistency

Amazon’s software-as-a-service (SaaS) offerings, such as RDS, are extremely convenient and very powerful. But they’re far from perfect. Generally, AWS products are much slower to provision compared to running the software directly yourself. Plus, they tend to be several software versions behind the most recent releases.

In databases, this is a fine tradeoff. You almost never create databases so slow that startup doesn’t matter, and you want extremely stable, well-tested, slightly older software. If you try to stay on the bleeding edge, you’ll just end up bloody. But for other services, being locked into Amazon’s product offerings makes less sense.

Once you have an RDS instance, you need some way for customers to get their data into it and for you to interact with that data once it’s there. Specifically, you need web servers. And while Amazon’s Elastic Beanstalk (AWS’ platform to deploy and scale web applications) is conceptually good, in practice it is extremely slow with middling software support, and can be painfully difficult to debug problems.

But AWS’ primary offering has always been the Elastic Compute Cloud (EC2). Running EC2 nodes is fast and easy, and supports any kind of software your application needs. And, unsurprisingly, EC2 offers exceptional tools to mitigate downtime and failure, including auto scaling groups (ASGs). With an ASG, Amazon keeps as many servers up as you specify, even across availability zones. If a server becomes unresponsive or passes other thresholds defined by you (such as amount of incoming traffic or CPU usage), new nodes will automatically spin up.

New servers by themselves do you no good. You need a process to make sure  new nodes are provisioned correctly and consistently so a new server joining  your auto scaling group also has your web  software and credentials to access  your database. Here, you can take  advantage of another Amazon tool, the  Amazon  Machine Image (or AMI). An AMI is a saved copy of an EC2 instance.  Using an AMI, AWS can spin up a new node  that is  an exact copy of the machine that generated the AMI.

Packer, by Hashicorp, makes it easy to create and save AMIs, and is also free and open-source. But there are lots of  amazing tools that can simplify AMI creation. They are the fundamental building blocks of EC2. With clever AMI use you’ll  be able to create new, functional servers in less than 5 minutes.

It’s common to need additional provisioning and configuration even after an AMI is started—perhaps you want to make  sure  the latest version of your application is downloaded onto your servers from GitHub, or that the most recent security  patches  have been applied to your installed packages. In cases such as these a provisioning system is a necessity. Chef  and  Puppet  are the two biggest players in this space, and both offer excellent integrations with AWS. The ideal use case  here i  is an AMI  with credentials to automatically connect to your Chef or Puppet provisioning system, which then ensures  the  newly  created node is as up to date as possible.

 

Final Thoughts

By relying on auto scaling groups, AMIs, and a sensible provisioning system, you can create a system that is completely  repeatable and consistent. Any server could go down and be replaced, or 10 more servers could enter your load balancer,  and the process would be seamless, automatic, and almost invisible to you.

And that’s the secret why Amazon’s services rarely go down. Not the hundreds of engineers, or dozens of datacenters, or  even the clever products: It’s the automation. Failure happens, but if you detect it early, isolate it as much as possible, and  recover from it seamlessly—all without requiring human intervention—you’ll be back on your feet before you even knew a  problem occurred.

There are plenty of potential concerns with powerful automated systems like this. How do you ensure new servers are  ones  provisioned by you, and not an attacker trying to join nodes to your cluster? How do you make sure transmitted  copies of  your databases aren’t compromised? How do you prevent a thousand nodes from accidentally starting up and dropping a  massive AWS bill into your lap? This overview of the techniques AWS leverages to prevent downtime and isolate failure  should serve as a good jumping-off point to those more complicated concepts. Ultimately, downtime is  impossible to  prevent, but you can keep it from broadly affecting your customers. Working to keep failure contained and recovery as  rapid as possible leads to a better experience both for you and your users.

Share this Image On Your Site

Expanding Amazon Web Services Monitoring with AppDynamics

Enterprises are increasingly moving their applications to the cloud and Amazon Web Services (AWS) is the leading cloud provider. We announced expanded AWS monitoring with the AppDynamics Winter ‘16 release earlier this year. In this blog, I will provide some additional details on the expanded support of AWS monitoring.

Native Support of AWS components

Before the Winter ’16 release, only Amazon Simple Queue Service (SQS) was automatically discovered by AppDynamics Java APM agents and shown in the Application flow map with key performance metrics. For other AWS components, customers had to configure the discovery of the backends manually in AppDynamics or use the old Amazon Cloudwatch Monitoring extension to get the AWS metrics and track them in metric browser or dashboards.

In the AppDynamics Winter ’16 release, the following AWS components are natively supported by Java APM agents:

  • Amazon DynamoDB: A fast and flexible NoSQL database service
  • Amazon Simple Storage Service (S3): It provides secure, durable, highly scalable object storage
  • Amazon Simple Notification Service (SNS): It is Pub-sub Service for Mobile and Enterprise Messaging

By native support, I mean automatic discovery and display of the Application flow map with key performance metrics without any manual configuration. The screenshot of AppDynamics application flow map in Figure 1 below shows an application that uses Amazon DynamoDB, S3, and SNS.

Screen Shot 2016-05-18 at 4.57.37 PM.png

Figure 1 – Application using Amazon DynamoDB, S3, and SNS

19 New AWS Monitoring Extensions

The AppDynamics platform is highly extensible to monitor various technology solutions that are not discovered natively. In the past, we had Amazon Cloudwatch Monitoring extension that collected all the AWS metrics via Amazon Cloudwatch APIs and passed them to the AppDynamics controller where they were tracked in metric browsers or dashboards. This extension was not very efficient because it collected and passed a lot of data for all the AWS components even if a customer just needed to monitor one or two different AWS components they were using.

The AppDynamics Winter ’16 release announced 19 new AWS monitoring extensions to monitor different AWS components. These extensions still use Amazon Cloudwatch APIs, but collect metrics for a specific AWS component and pass it to AppDynamics controller for tracking, creating health rules and visualizing in dashboards efficiently.

Here is the list of all the new AWS monitoring extensions:

  1. AWS Custom Namespace Monitoring Extension
  2. AWS SQS Monitoring Extension
  3. AWS S3 Monitoring Extension
  4. AWS Lambda Monitoring Extension
  5. AWS CloudSearch Monitoring Extension
  6. AWS StorageGateway Monitoring Extension
  7. AWS SNS Monitoring Extension
  8. AWS Route53 Monitoring Extension
  9. AWS Redshift Monitoring Extension
  10. AWS ElasticMapReduce Monitoring Extension
  11. AWS RDS Monitoring Extension

Amazon_RDS_Metrics.png

Figure 2 – Amazon RDS Metrics in AppDynamics Metric Browser

  1. AWS ELB Monitoring Extension
  2. AWS OpsWorks Monitoring Extension
  3. AWS EBS Monitoring Extension

Amazon_EBS_Metrics.png

Figure 3 – Amazon EBS Metrics in AppDynamics Metric Browser

  1. AWS Billing Monitoring Extension
  2. AWS AutoScaling Monitoring Extension
  3. AWS DynamoDB Monitoring Extension
  4. AWS ElastiCache Monitoring Extension
  5. AWS EC2 Monitoring Extension

Customers can leverage all the core functionalities of AppDynamics (e.g. dynamic baselining, health rules, policies, actions, etc.) for all of these AWS metrics while correlating them with the performance of the application using these AWS environment.

AppDynamics Customers using AWS Monitoring

With AppDynamics, many of our customers have accelerated their application migration to the cloud while others continue to monitor cloud applications as the application complexity explodes with the move towards microservices and dynamic web services. As the workload on the these cloud applications grew, AppDynamics came to rescue by elastically scaling the AWS resources to meet the exponential demand for applications.

For example, Nasdaq accelerated their application migration to AWS and gained complete visibility into complex application ecosystem. Heather Abbott, Senior Vice President of Corporate Solutions Technology at Nasdaq, summarized their experience with AppDynamics. “The ability to trace a transaction visually and intuitively through the interface was a major benefit AppDynamics delivered. This visibility was especially valuable when Nasdaq was migrating a platform from its internal infrastructure to the AWS Cloud. We used AppDynamics extensively to understand how our system was functioning on AWS, a completely new platform for us.”

Get Into the Cloud: AppDynamics at Amazon Web Services re:Invent

Screen Shot 2015-10-01 at 12.22.58 PM.png

The AppDynamics and Amazon Web Services (AWS) relationship grows stronger each year, with marquee joint customers such as Nasdaq. Our Application Intelligence Platform continues to evolve enabling enterprises to manage their cloud applications more efficiently and gain complete visibility and control into an expanded set of AWS services, with an exclusive 60 day free trial for AWS customers.

AppDynamics at AWS re:Invent

October 6 – 9, 2015

AWS re:Invent, the Amazon Web Services annual user conference, is the Mecca for the AWS community. The AWS team has pulled together another great event in Las Vegas featuring keynote announcements, training and certification opportunities, over 250 technical sessions, a partner expo, after hours activities, and more.

AppDynamics will have expanded presence at the sold-out event.

Screen Shot 2015-09-30 at 3.42.20 PM.png

Join us at Booth #550, during the expo hours from October 6 to 9, where we’ll showcase how we can help our customers:

  • Accelerate application migration to the cloud

  • Manage cloud applications as complexity explodes

  • Scale cloud applications as workloads grow.

We’ll be handing out AWS exclusive t-shirts highlighting AppDynamics’ cloud story, with daily surprise social media giveaways.  Be sure to follow us on Twitter @AppDynamics for the latest developments on AppDynamics and AWS.

AppDynamics at AWS re:Invent

Global Partner Summit

October 6, 2015

AWS leadership will discuss about the future of the business and the AWS Partner Network (APN) at the Global Partner Summit at AWS re:Invent. Join us during the following session:

Topic: Realizing the Benefits of Cloud: Cloud Migration & Application Optimization

Speaker: Tom Laszewski, Global Partner Solution Architect Senior Manager at Amazon Web Services.

Tom will discuss application optimization and highlight the value AppDynamics brings to our joint customers.

Wait.  It’s Just the Beginning.  

The AppDynamics ADVANTAGE at AppSphere 2015

November 30 – December 4, 2015

Las Vegas

We’re happy to exclusively announce that Stephen Orban, Global Head of Amazon Web Services Enterprise Strategy, will serve as AppSphere Keynote on Tuesday, December 1. Stephen will discuss the Future of Enterprise IT.  

AppSphere is AppDynamics’ annual user conference — we’re focusing on enabling businesses to bring a competitive advantage to market and driving success to their organizations. Attendees will the future landscape of the software-defined business and gain insights from topics around SaaS at scale, mobile performance, business data analytics, working towards a DevOps structure and much more.

For AWS customers, we’re offering an exclusive $100 discount on AppSphere tickets if you register here with promo code AWSCUST.  Customers who register by 11/6 with promo code AWSCUST will enter to win a free trip to Vegas.

To learn more, AppDynamics solution for AWS ecosystem, please visit appdynamics.com/aws and sign up for a 60-day free trial of AppDynamics designed exclusively for AWS customers.

The AppDynamics Team and I are looking forward to sharing these customer stories, best practices and demonstrating the AppDynamics solution at re:Invent 2015 in Las Vegas.

Scaling our End User Monitoring Cloud

Why End User Monitoring?

In a previous post, my colleague Tom Levey explained the value of Monitoring the Real End User Experience. In this post, we will dive into how we built a service to scale to billions of users.

The “new normal” for enterprise web applications includes multiple application tiers communicating via a service-oriented architecture that interacts with several databases and third-party web services. The modern application has multiples clients from browser-based desktops to native applications on mobile. At AppDynamics, we believe that application performance monitoring should cover all aspects of your application from the client-side to the server-side all the way back to the database. The goal of end user monitoring is to provide insight into client-side performance and capture errors from modern javascript-intensive applications. The challenge of building an end user monitoring service is that every single request needs to be instrumented. This means that for every request your application processes, we will process a beacon. With clients like FamilySearch, Fox News, BackCountry, ManPower, and Wowcher, we have to handle millions of concurrent requests.

1geo

AppDynamics End User Monitoring enables application owners to:

  • Monitor Their Global Audience and track End User Experience across the World to pinpoint which geo-locations may be impacted by poor Application Performance
  • Capture end-to-end performance metrics for all business transactions – including page rendering time in the Browser, Network time, and processing time in the Application Infrastructure
  • Identify bottlenecks anywhere in the end-to-end business transaction flow to help Operations and Development teams triage problems and troubleshoot quickly
  • Compare performance across all browsers types – such as Internet Explorer, FireFox, Google Chrome, Safari, iOS and Android
  • Track javascript errors

“Fox News already depends upon AppDynamics for ease-of-use and rapid troubleshooting capability in our production environment,” said Ryan Jairam, Internet Operations Lead at Fox News. “What we’ve seen with AppDynamics’ End-User Monitoring release is an even greater ability to understand application performance, from what’s happening on the browser level to the network all the way down to the code in the application. Getting this level of insight and visibility for an application as complex and agile as ours has been a tremendous benefit, and we’re extremely happy with this powerful new addition to the AppDynamics Pro solution.”

EUM Cloud Service

The End User Monitoring cloud is our super-scalable platform for data analysis and processing end user requests. In this post we will discuss some of the design challenges of building a cloud service capable of supporting billions of requests and the underlying architecture. Once End User Experience monitoring is enabled in the controller, your application’s requests are automatically instrumented with a very small piece of javascript that allows AppDynamics to capture critical performance metrics.

Screen Shot 2013-07-25 at 9.47.14 AM

The javascript agent leverages Web Episodes javascript timing library and the W3C Navigation Timing Specification to capture the end user experience metrics. Once the metrics are collected, they are pushed to the End User Monitoring cloud via a beacon for processing.

EUM (End User Monitoring) Cloud Service is our on-demand, cloud based, multi-tenant SaaS infrastructure that acts as an aggregator for the entire EUM metrics traffic. All the EUM metrics from the end user browsers from different customers are reported to EUM Cloud service. The raw browser information received from the browser is verified, aggregated, and rolled up at the EUM Cloud Service. All the AppDynamics Controllers (SaaS or on-premise) connect to the EUM Cloud service to download metrics every minute, for each application.

Design Challenges

On-Demand highly available

End users access customer web applications anywhere in the world and any time of the day in different time zones, whenever an AppDynamics instrumented web page is accessed. From the browser, EUM metrics are reported to the EUM Cloud Service. This requires a highly available on-demand system accessed from different geo locations and different time zones.

Extremely Concurrent usage

All end users of all AppDynamics customers using EUM solution continuously report browser information on the same EUM Cloud Service. EUM Cloud Service processes all the reported browser information concurrently and generate metrics and collect snapshot samples continuously.

High Scalability

The usage pattern for different applications throughout the day is different; the number of records to be processed at EUM Cloud vary with different applications at different times. The EUM Cloud Service automatically scale up to handle any surge in the incoming records and accordingly scale down with lower load.

Multi Tenancy support

The EUM Cloud Service process EUM metrics reported from different applications for different customers; the cloud service provides multi-tenancy. The reported browser information is partitioned based on customers and their different applications. EUM Cloud Service provides a mechanism for different customer controllers to download aggregated metrics and snapshots based on customer and application identification.

Cost

The EUM Cloud Service needs to be able to dynamically scale based on demand. The problem with supporting massive scale is that we have to pay for hardware upfront and over provision to handle huge spikes. One of the motivating factors when choosing to use Amazon Web Services is that costs scale linearly with demand.

Architecture

The EUM Cloud Service is hosted on Amazon Web Services infrastructure for horizontal scaling. The service has two functional components – collector and aggregator. Multiple instances of these components work in parallel to collect and aggregate the EUM metrics received from the end user browser/device. The transient metric data be transient is stored in Amazon S3 buckets. All the meta data information related to applications and other configuration is stored in the Amazon DynamoDB tables.

A single page load will send one or more beacon–one per base page and every iframe onload and one per ajax request. Javascript errors occurring post page load are also sent as error beacons.

The functionality of the nodes is to receive the metric data from the browser and process it for the controller:

  • Resolve the GEO information (request coming from the country/region/city) and add it to the metric using a in-process maxmind Geo-resolver.
  • Parse the User-Agent information and add browser information, device information and OS information to the metrics.
  • Validate the incoming browser reported metrics and discard invalid metrics
  • Mark the metrics/snapshots SLOW/VERY SLOW categories based on a dynamic standard deviation algorithm or using static threshold

Load Testing

For maximum scalability, we leverage Amazon Web Services global presence for optimal performance in every region (Virginia, Oregon, Ireland, Tokyo, Singapore, Sao Paulo). In our most recent load test, we tested the system as a collective to about 6.5 B requests per day. The system is designed to easily scale up as needed to support infinite load. We’ve tested the system running at many billions of requests per day without breaking a sweat.

Check out your end user experience data in AppDynamics

4breakdown

Find out more about AppDynamics Pro and get started monitoring your application with a free 15 day trial.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.