AWS re:Invent—What do Black Friday and Cyber Monday Have in Common?

With the genesis of Amazon Web Services, enterprises of all sizes can now take advantage of the public cloud to deliver significantly more agility and control. With AWS, elastic infrastructure is easier to attain, and usage spikes are an afterthought.

Only days apart, Black Friday and Cyber Monday are arguably the two biggest days in retail. They’re what make “web scale” a requirement for leading eCommerce organizations throughout the world. For months in advance, IT operations teams map, plan and prepare for the impending shopping crush. It’s capacity planning at its finest. And beyond eCommerce, a move to the cloud is prudent for many businesses that may encounter major traffic events like a security attack or a runaway product success.

History of the AWS Cloud

From the humble beginnings of Amazon Elastic Compute Cloud (EC2) in 2006, Amazon Web Services (AWS) put public cloud on the map. AWS continues to invest heavily in innovation, helping both the public and private sector harness the power of cloud computing. Ironically, during Cyber Monday, the largest eCommerce day of the year, AWS is kicking off its largest event of the year—AWS re:Invent—which showcases the latest and greatest AWS has to offer, as well as the transformational journeys of its clients.

AWS re:Invent 2018

For the past several years, AWS re:Invent has been the showcase of large web scale in the public cloud, and 2018 will be no different. AppDynamics is excited to return to showcase our strong partnership with AWS and tell the stories of our joint customers. Plenty of very impressive sessions will take place across the Las Vegas Strip from a wide ecosystem of clients and vendors sharing challenges, triumphs, and best practices.

Join AppDynamics at re:Invent

During the show, you’ll have an opportunity to network with 50,000 of your closest friends, as well as attend workshops, sessions, parties and everything in between. It is Vegas, after all.

As part of your AWS re:Invent experience, please stop by and learn from experts and customers at the AppDynamics Theatre, located in booth #810. No matter where you are in your cloud journey, we’re confident you’ll learn something new, be better prepared to migrate to the cloud with confidence, and monitor your workloads once you’ve arrived.

We will be running continuous sessions throughout the show (every half hour) that will cover the shift of cloud workloads to severless and containers, maturing DevOps capabilities and processes, and the impending shift to AIOps.

AIOps: Fix Before Failure

Here’s one more term for your enterprise software buzzword bingo: AIOps—the adoption of artificial intelligence in IT operations. Continuous improvement of your platforms without administrator intervention is closer than you think. With advancements in visibility, insight and action, the AppDynamics Platform can now take action for our customers—making AIOps a reality, not a buzzword.

The Cloud’s Best Friends

Serverless and container technologies are not new, but thanks to advancements in the AWS stack around Lambda, ECS/EKS and Fargate, the adoption of these underlying concepts is exploding. In Gartner’s bimodal IT model, both mode-one and mode-two organizations can benefit from AWS services. Join our experts to make sure you’re maximizing your investment in the latest and greatest AWS has to offer.

See AppDynamics on Stage

AppDynamics Senior Solutions Architect Subarno Mukherjee is leading an AWS Session, “Five Ways Application Insights Impact Migration Success,” on Tuesday at 10 AM PT, November 27th at The Venetian. Come learn how application insights impact migration success. No matter where you are in your cloud journey, you’ll have a great opportunity to learn from our experts and customers.

Time to Party?

AppDynamics and AWS are throwing a fantastic Happy Hour after Thursday’s sessions close and before the re:Play party. If you’d like to get in on the action, contact your AWS or AppDynamics account manager and we’ll add you to the list. It’s not a party you’ll want to miss!

Looking to re:Invent

We’re really excited to see you at AWS re:Invent. Be sure to sign up for our AWS Session and stop by booth #810. We’ll be active on our social channels (Twitter / Instagram) during the event as well. Hopefully you’ll make a guest appearance. See you there!

How DevOps Teams Prepare for Cyber Monday

As we head into the Thanksgiving weekend with thoughts of relaxing with family and friends, there are a group of folks who will still be working or on call the whole time. The Dev and Operations teams of major online stores will have been preparing for this period for many months. Cyber Monday (Nov. 28) is the largest shopping day of not only the Thanksgiving weekend, but the entire year. What’s more, according to Adobe Digital Insights (ADI), it is anticipated to be the largest online shopping day in history. So, no pressure then.

It’s a time when every minute of downtime costs. According to the Aberdeen Group, large companies lose an average of $686,250 per hour of downtime. While most if not all major online retailers are unlikely to experience widespread outages, the more likely scenario is a slow responding site. Under 100ms is perceived as reacting instantly, while a 100ms to 300ms delay is perceptible. 40% of mobile visitors will abandon a site after a three-second delay, so speed of response or perceived speed are critical in determining whether an online purchase takes place, or is abandoned out of frustration. Perceived speed can be addressed by the web team using techniques such as progress bars or content sliding in and out to distract the visitor for the second or so needed for a site to update.

Actual speed of response is much harder to address. In the period leading up to Cyber Monday, enterprises that have adopted DevOps will ideally have both a mindset and a set of practices that drives their preparation for such an important period in the retail calendar. These are likely to include:

Collaboration already in place: The Dev, Ops and Test teams will already be engaged with each other for some time and be objectivized on optimizing the customer experience above individual and departmental goals.

A continuous delivery model: A high velocity of small, incremental releases will have been deployed with little if any negative impact, supported by automated configuration, deployment and release management technologies, and processes.

Knowledge captured from the same period in 2015: Estimates suggest that 2016 will see an 11% increase on last year’s trading, but there will undoubtedly be regional, device, and time variations. Metrics captured during the previous year’s Cyber Monday will not be foolproof in indicating likely site demands this year, but they will still provide a good starting point.

End-to-end visibility of business transactions: Teams will have a full understanding of the software functions and components that make up the purchase process from the initial page view all the way through to database calls and shipping confirmation notifications.

Synthetic and real user monitoring: By combining an understanding of actual user engagement with the site and how it will likely behave under heavy loads at different times from different locations, potential vulnerabilities and bottlenecks can be identified and remediated ahead of time.

Understanding of 3rd party dependencies: When online stores have a major external dependency such as a payment platform, fulfillment agent or loyalty card provider, latency that originates from these must also be identified and addressed.

Scaling up beforehand, with fail overs available: Performance engineering teams and site reliability engineers will take particular responsibility for ensuring that the site is robust enough to withstand vast traffic volumes from multiple logins. This includes topics such as net new account creation and database access speeds, and viewing peak traffic rather than average traffic is the primary consideration.

Full view of the user experience: Shoppers will be accessing the site from notebooks, tablets, and smartphones from a variety of manufacturers in different locations and using a number of network providers, each with their own bandwidth speeds. DevOps teams will have data on how the site will appear to each of these groups and variances that need to investigated.

Recent technology adoptions are not a black box: It’s been an amazing last 18 months for concepts such as microservices and technologies, such as Docker, as they move up the maturity curve and become a staple part of many an enterprise’s stack. However, it’s essential that granular insights into how microservices are performing should be available, such as automatic discovery of entry and exit points of microservice as service endpoints. Equally, DevOps teams should also be able to correlate Docker metrics with the metrics from the applications running in the container.

So once this intensive period from Black Friday starts, what will optimized DevOps teams be doing?

Remedial action: Should any health alert trigger a status switch from green to yellow, there should be a plan of pre-agreed corrective measures to address delays wherever they exist. These delays should not kick-off debate as to whose team is or is not responsible and how to address the pain.

Laser-focus on where an issue resides: Sometimes the cause of response time delays sits in one tiny part of the overall stack. Using the right monitoring solutions, the best DevOps teams will know exactly how to pinpoint the bottleneck and fix it, ahead of the customer sensing any slowdown in site responsiveness. If the full end-to-end business transaction view is obtained, enterprises can identify where online visitors are at any moment in time and if they are at risk of abandoning a site due to poor responsiveness.

Dynamically review performance: While Cyber Monday is most likely to see the peak volume of consumer traffic over the Thanksgiving weekend, Black Friday and the day afterwards will also witness high volumes, giving DevOps teams insights into performance and potential issues ahead of time. Perhaps it’s better to think of it as a particularly heavy traffic volume period with a spike at the end of it than a big bang launch. Rather than setting fixed parameters, dynamic baselining of how servers, networks, databases and so forth are performing provides a more insightful picture of what is working well, and what isn’t.

Business insights: IT-related metrics are great, but in an ideal world the DevOps team should also be able to share KPIs that reflect question such as:

  • What is the ratio of sales between existing and new customers?
  • Are existing customer details being populated when they login or is there a database access bottleneck?
  • Are new customers onboarded without delay?
  • Which parts of the site are generating greatest revenue (e.g. electrical vs kitchenware)?
  • Is there a delay in the final stage of the purchase cycle?
  • Where do visitors sit in the purchase cycle at any given time?

These questions tie back to what we at AppDynamics call Mean Time to Business Awareness (MTBA), — how quickly can essential business-relevant site performance data reach those who need to know and can make key investment and strategic decisions with this information?

Capture essential metrics in preparation for 2017: When Cyber Monday is over, it can be easy to forget to analyze and store the major performance behaviors that occurred. Investment here will pay off in 12 month’s time, as it will help create a foundation for expected site traffic.

Fail to prepare, prepare to fail

Yes, it’s a well worn phrase but it’s especially apt when applied to Cyber Monday. DevOps teams who have done their homework will be attentive during this time, but they will also feel confident that despite application complexity, they know the online experience inside and out, where potential risks may occur and have an agreed response should an issue arise.

If leading retailers get Cyber Monday right, they lay the foundation for a ongoing customer relationship based on the ability to deliver a consistent, quality experience. Failure to prepare could have a highly detrimental impact through customer attrition, lost revenue opportunities, brand reputation and social media naming and shaming.

For more ways to prepare for Cyber Monday, please download our free trial.

Cyber Monday: Past, Present, and Future

Cyber Monday is the name that marketers give to the Monday after Thanksgiving. It is a vital date in the retail calendar, with millions of consumers around the world logging on to the web each year to find great deals on holiday gifts for their friends and family. Traditionally, many retailers view Cyber Monday as the online equivalent of Black Friday, which occurs three days earlier — although today the entire weekend is a hot spot for online sales. Whereas Black Friday can pose logistical challenges for brick-and-mortar store owners, Cyber Monday challenges e-commerce sites to handle much larger amounts of traffic than usual. Before we can learn how to manage its challenges, it is important that we understand what Cyber Monday is and how it originated. In short, the term Cyber Monday originated in 2005. A marketing team at the National Retail Federation came up with the name as part of an effort to create an online equivalent to Black Friday. Since the term was introduced, Cyber Monday has become an increasingly popular day for online shopping, with many consumers using the discounts on offer as opportunities to start their holiday shopping from the comfort of their homes and workplaces.

The Evolution of Cyber Monday

Focusing on work is a struggle for most people returning from a Holiday weekend, particularly on the first day back after Thanksgiving. This underlying trend in human nature, coupled with the fact that most people did not have a high-speed internet connection at home in 2005, probably explains why the Monday after Thanksgiving was the day when so many Americans chose to start their online holiday shopping. Now that so many people can access high-speed Internet through their laptops, smartphones, and tablets, online sales have spread over the entire Thanksgiving weekend and beyond to create a busy period that stretches from Halloween to January.

Cyber Monday is not the only high-traffic day in the retail calendar. Other significant e-commerce events include Singles’ Day, which was pioneered by Alibaba in 2009. Falling on November 11 each year, Singles’ Day is a major gifting occasion in China, a huge retail market. If your business has global reach, awareness of international online shopping days is vital for maximizing your sales and revenue.

The Value of Cyber Monday for Online Retailers

In 2015, Cyber Monday was the biggest online shopping day of the holiday season in the United States and Europe. Online sales on Cyber Monday outpaced those of Black Friday by 25.5 percent, with each customer spending an average of $123.43. Cyber Monday is an incredible opportunity for online retailers to make sales. Some of the largest Cyber Monday sites include Groupon, which offered discounts of up to 69.6 percent in 2015. Amazon and eBay also offered huge discounts on electronics, apparel, and accessories.

Challenges Posed By Cyber Monday

The trend of online shopping spreading over several days is good news for retailers, as it spreads the load on their servers. On the other hand, overall visits to e-commerce sites are growing year on year, so there are still challenges that retailers need to overcome to cope with so much traffic.

Experts predict that Cyber Monday is likely to remain important over the next few years, presenting an opportunity for retailers to make money from online shoppers and surfers – but it is also a big problem. The huge volumes of traffic surging to sites on Cyber Monday can cause them to crash, leading site owners to miss out on thousands or even hundreds of thousands of dollars in sales. In 2011, many of the top 55 retail websites were down for at least part of the day, disappointing customers who were looking for good deals from their favorite retailers.

The Downside of Cyber Monday: Failures and Flops

Even as recently as 2015, some retailers struggled to cope with the challenges posed by Cyber Monday. Although the major retailer Target managed to prevent a complete crash on the big day, many customers experienced the frustration of being placed in the online equivalent of a long checkout line when the message “Please hold tight” appeared on the site. Target claims that forcing some customers to wait to access the site helped to manage the demands on the server, preventing a complete crash. This unique solution allowed shoppers who managed to access the site to enjoy it functioning at an acceptable speed, rather than giving everyone a miserably slow shopping experience.

Failing to meet the technical challenges posed by Cyber Monday can cause companies much embarrassment. Consumers feel disappointed, frustrated and sometimes even angry when they hear about a great deal but aren’t able to access the site that is offering it. The moral of the story? If you promise your customers time-limited discounts and other special offers, make sure your site is equipped to deliver on those promises.

How to Handle the Technical Challenges of Cyber Monday

If you want to make the most of Cyber Monday, it is vital to ensure that the technology backing up your e-commerce site is robust enough to cope with sudden surges in traffic. Many businesses spend time planning and advertising attractive promotions for Cyber Monday, but if you fail to equip your server with the resources it needs to handle the customers that come pouring in, you will likely fail to capitalize on your investment. Use these tips to help you handle the technical challenges that your business may face next Cyber Monday:

1. Check How Much Bandwidth You Can Support

As Cyber Monday approaches, check how much bandwidth your hosting provider allows you to use. This is particularly important if you have a cloud hosting plan, where you share a server with dozens or even hundreds of other sites. Some of these plans have a burstable connectivity limit, which means that more server resources can be allocated to your site in the event of a traffic surge, which is exactly what retailers hope for on Cyber Monday.

2. Load Test Your Site

Before Cyber Monday arrives, it is a good idea to load test your site with extremely high amounts of traffic in a pre-production environment and filter all that triggers a fail point. This will let you see where your site is failing, so you can address it before the big day.

3. Spread the Load

Distribute the load on your server through various data centers. Using a single data center can result in a bottleneck, which can lead to site failure. Geographically distributing your servers allows you to serve traffic to local regions with little latency.

4. Remember Mobile Users

If you have a separate mobile site for your smartphone and tablet users, do not forget to ensure that it is just as well-supported as your main site when Cyber Monday rolls around. The percentage of holiday sales completed on mobile devices grows every year. Mobile users often have slower connections than desktop users, so it is even more important to minimize lag for these users and keep page loading times as short as possible.

5. Use an APM Solution

An APM (Application Performance Management) solution, such as AppDynamics, can intelligently monitor your environment and reveal errors and slowdowns before they affect your customers. Having your production environment monitored 24/7 can help you avoid costly technical mishaps that can result in loss of both revenue and brand equity. Not only will your company lose sales opportunities but the technical liabilities may damage your reputation among consumers.  

The Future of Cyber Monday

Retail experts predict that Cyber Week will become even more relevant to consumers over the next five to 10 years. They predict that retailers will get even better at using customer data collected over many years to create targeted offers for their buyers. However, these enhanced marketing efforts can only result in more sales and more profit if companies invest in the server resources to handle big spikes in traffic.

All the data shows that Cyber Monday sales are growing year on year, and this trend is expected to continue. Businesses must respond to the incredible demands that this special shopping day places on their servers by investing in better infrastructure to help keep their sites online.

How Black Friday and Cyber Monday Were Saved with Performance and Monitoring Tools

Ah, the holidays. A great occasion to spend quality time with your family, eat a little too much, and indulge on the fantastic shopping deals. Online shopping during Black Friday and Cyber Monday account for a critical chunk of a retailer’s yearly revenue. Any downtime or abnormal slow performance will directly impact the bottom line of the business. So how should these e-commerce companies protect this increased income by preparing their site and applications properly to handle the drastic increase in load?

We partnered with to conduct a survey of senior retail executives to see how e-commerce sites prepared for the holiday season, and how these preparations were crucial in surpassing their revenue goals. We went into the survey looked for specific answers, such as: If these sites invested in performance monitoring tools did they fare better through the holidays with less downtime? Were sites with a performance troubleshooting process in place more likely to exceed their goals? How has mobile e-commerce traffic grown year over year?Screen Shot 2015-02-11 at 11.00.25 AM

Download the full report here

The results were quite surprising. We knew performance mattered as every minute of downtime these e-commerce sites would be losing money. However, we didn’t realize exactly how crucial having monitoring tools and a troubleshooting process were.

Some interesting facts:

  • 92% of retailers that had a site performance troubleshooting process were more likely to have met or exceeded their revenue numbers
  • 70% of retailers who exceed their revenue goals added new site performance and monitoring tools prior to the holidays
  • 24% reported their holiday mobile traffic increased by over 41% from the year before

The mobile piece is quite interesting. Though we know the mobile experience is a crucial stage in the buying cycle — notably for discovering and browsing — it still isn’t typically the final device used for the purchase stage. However, as the mobile traffic increases, the user experience still remains an absolutely vital cog in the purchase decision. In a previous study we conducted with the University of London, we found 86% of customers who have a bad first experience on a mobile app will never return to the app for a second chance.

The report doesn’t only highlight interested facts derived from the survey, but also insight from some of the top e-commerce companies in the world. Charles Hunsinger, CIO at Harry and David, details their troubleshooting process, “ We have several different tools in place, and they give us a lot of information to understand, if there is an issue, where it is coming from.”

On the mobile angle, Heather Dettmann, Digital Business Manager at Finish Line, states, “We know that customers will continue to explore and purchase products through a variety of touch-points, and we’ll continue to be everywhere that they are …. and deliver the best possible customer experience.”

Interested in checking out the full report, click here for your FREE copy!

Top 10 Reasons Why eCommerce Apps Will Fail This Black Friday

My wife is a shopoholic and serial checkout killer. Every week she spends several hours browsing and carefully adding items to shopping carts. Every now and then she’ll interrupt my sports program with an important announcement “My Checkout just failed”. Take this example Mrs Appman sent me during the month of September:

Checkout Fail

The fact I work in the APM industry means it is my responsibility to fix these problems immediately (for my wife). As Black Friday is coming up I thought I’d share with you what typically goes wrong under the covers when our customer’s e-commerce applications go bang during a critical time.

It is worth mentioning that nearly all our customers perform some form of performance or load testing prior to the Black Friday period. Many actually simulate the load from the previous year on test environments designed to reproduce Black Friday activity. While 75% of bottlenecks can be avoided in testing, unfortunately a few surface in production as a result of applications being too big and complex to manage to test under real world scenarios. For example, most e-commerce applications these days span more than 500 servers distributed across many networks and service providers.

Here are the top 10 reasons why eCommerce applications will fail this Black Friday:

1. Database Connection Pool

Nearly every checkout transaction will interact with one or more databases. Connections to this resource are therefore sacred and can often be deadly when transaction concurrency is high. Most application servers come with default connection pool configurations of between 10 and 20. When you consider that transaction throughput for e-commerce applications can easily exceed 100,000 trx/min you soon realize that default pool configurations aren’t going to cut it. When  a database connection pool becomes exhausted incoming checkout requests simply wait or timeout until a connection becomes available. Take this screenshot for example:

Connection Pool Issue

2. Missing Databases Indexes

This root cause is somewhat related to the exhausted connection pools. Simply put, slow running SQL statements hold onto a database connection for longer, therefore connection pools aren’t recycled as often as they should be as queries take longer. The number 1 root cause of slow SQL statements is missing indexes on database tables, which is often caused by miss-communication between developers who write SQL, and the DBAs who configure and maintain the database schemas which hold the data. The classic “full table scan” query execution where a transaction and its database operation must scan through all the data in a table before a result is returned. Here is an example of what such looks like in AppDynamics:

Missing Index

3. Code Deadlock

High transaction concurrency often means application server threads have to contend more for application resource and objects. Most e-commerce applications have some form of atomicity build in to their transactions, so that order and stock volumes are kept in check as thousands of users fight over special offers and low prices. If access to application resource is not properly managed some threads can end up in deadlock, which can often cause an application server and all its user transactions to hang and timeout. One example of this was last year where an e-commerce customer was using a non-thread safe cache. Three threads tried to perform a get, set and remove on the same cache at the same time causing code deadlock to occur, impacting over ~2,500 checkout transactions as the below screenshot shows.


4. Socket Timeout Exceptions

Server connectivity is an obvious root cause, if you check your server logs using a Sumologic or Splunk then you’ll probably see hundreds of these events. They represent network problems or routing failures where a checkout transaction is attempting to contact one or more servers in the application infrastructure. Most of the time the services you are connecting to aren’t your own, for example a shipping provider, credit card processor, or fraud detector. On high traffic days like Black Friday it isn’t just your site experiencing a surge in traffic – often times entire networks are saturated due to intense demand. After a period of time (often 30-45 secs) the checkout transaction will just give up, timeout and return an error to the user. No Connectivity = No Revenue. Here is an example of what it looks like:

socket timeout exception

5. Garbage Collection

Caches are an easy way to speed up applications. The closer data is to application logic (in memory) the faster it executes. It is therefore no surprise that as memory has gotten bigger and cheaper most companies have adopted some form of in-memory caching to eliminate database access for frequent used results. The days of 64GB and 128GB heaps are now upon us which means the impact of things like Garbage Collection are more deadly to end users. Maintaining cache data and efficiently creating/persisting user objects in memory becomes paramount for eliminating frequent garbage collection cycles. Just because you have GB’s of memory to play with doesn’t mean you can be lazy in how you create, maintain and destroy objects. Here is are a few screenshots that show how garbage collection can kill your e-commerce application:

Garbage Collection

Screen Shot 2013-10-14 at 2.57.04 PM

6. Transactions with High CPU Burn

Its no secret than inefficient application logic will require more CPU cycles than efficient logic. Unfortunately the number 1 solution to slow performance in the past was for eCommerce vendors to buy more servers. More servers = More Capacity = More Transaction Throughput. While this calculation sounds good, the reality is that not all e-commerce transactions are CPU bound. Adding more capacity just masks inefficient code in the short term, and can waste you significant amounts of money in the long term. If you have specific transactions in your eCommerce application that hog or burn CPU then you might want to consider tuning those before you whip out your check book with Oracle or Dell. For example:

High CPU Burn

7. 3rd Party Web Services

If your e-commerce application is built around a distributed SOA architecture then you’ll have multiple points of failure. Especially if several of those services are provided by a 3rd party where you have no visibility. For example, most payment and credit card authorization services are provided by 3rd party vendors like PayPal, Stripe, or Braintree. If these services slow down or fail then its impossible for checkout transactions to complete. You therefore need to monitor these services religiously so when problems occur you can rapidly identify whether it is your code or connectivity or someone else’s outage. Here is example of how AppDynamics can help you monitor your 3rd party web services:

Transaction Flow

Screen Shot 2013-10-14 at 2.59.50 PM

8. Crap Recursive Code

This is similar to #6 but burns time instead of resources. For example, many e-commerce transactions will request data from multiple sources (caches, databases, web services) at the same time. Every one of these round trips could be expensive and may involve network time along the way. I’ve seen a single eCommerce search transaction call the same database multiple times instead of performing a single operation using a stored procedure on the database. Recursive remote calls may only take 10-50 millisecond each, but if they are invoked multiple times per transaction they can add seconds to your end user experience. For example, here is that search transaction that took x seconds and made 13,000 database calls.

Screen Shot 2013-10-14 at 3.00.36 PM

9. Configuration Change

As much as we’d like to think that production environments are “locked down” with change control process, they are not. Accidents happen, humans make mistakes and hotfixes occasionally get applied in a hurry at 2am 😉 Application server configuration can be sensitive just like networks, or any other pieces of the infrastructure. Being able to audit, report and compare configuration change across your application gives you instant intelligence that a change may have caused your eCommerce application to break. For example, AppDynamics can record any application server change and show you the time and values that were updated to help you correlate change with slowdowns and outages, see below screenshot.

Screen Shot 2013-10-14 at 3.01.02 PM

10. Out of Stock Exception

“I’m sorry, the product you requested is no longer in stock”. This basically means you were too slow and you’ll need to wait until 2014 for the same offer. Remember to set an alarm next year for Black Friday 😉


In addition, AppDynamics can also monitor the revenue and performance of your checkout transactions over-time which helps Dev and Ops teams monitor and alert on the health of the business:

 Correlating revenue and performance

The good news is that AppDynamics Pro can identify all of the above defects in minutes. You can take a free trial here and be deployed in production in under 30 minutes! If you send us a few screenshots of your findings in production like the above we’ll send you a $250 Amazon gift certificate for your hard work!


Why did your Checkout Fail? AppDynamics knows

The reason for this blog is purely down to a real-life incident which one of our e-commerce customers shared with us this week. It’s based around a use case that pretty much anyone can relate to – the moment your checkout transaction spectacularly fails. You sit there, looking at a big fat error message and think “WTF – did my transaction complete or did the company steal my money?” A minute later you’re walking a support team through exactly what happened: “I just clicked Checkout and got an error…honestly…I waited and never got a response.”

What’s different in this story is that the support team had access to AppDynamics as they were talking to a customer on the phone…and the customer got to find out the real reason their checkout failed. How often does that happen? Never, until now. Here is the story as documented by the customer.

Black Friday and Cyber Monday thru the eyes of an APM solution

A week has passed since Black Friday, so I thought it would be a good idea to summarise what we saw at AppDynamics from monitoring one of several e-commerce applications in production.

Firstly, things went pretty well for our customers who experienced between 300 and 500% increase in transaction volume over the holiday period on their applications. Thats a pretty big spike in traffic for any application so its always good to look at those spikes and see what impact they had on application performance.

Here’s a screenshot which shows the load (top) and response time (bottom) of a major e-commerce production application during the thanksgiving period. The dotted line in both charts represents the dynamic baseline of normal activity. You can see on Black Friday (23rd) and Cyber Monday (26th) that transaction throughput was peaking between 24,000 and 31,000 tpm on the application, spiking between 150 and 200% over the normal load the application experiences throughput the rest of the year.

Application response time during the period had one blip during the first minutes of Black Friday (9pm PCT/Midnight EST) with no major performance issues following thru into Cyber Monday. The blip in the application related to the web container thread pool becoming exhausted during peak load when the Black Friday promotions went live. Below you can see throughput was hitting 23,000 tpm.

Two business transactions “Product Display” and “Checkout” were breaching their performance baselines during that period. Looking at the average response times of 516ms and 733ms tells one story, looking at the maximum response time and number of slow/very slow transactions (calculated using SD) tells a completely different story.

Let’s take a look at the execution of one individual “Product Display” business transaction that was classified as very slow with a 66 second response time.

When we drill into the code execution and SQL activity we can see a simple SELECT SQL query had a response time of 588ms, the problem in this transaction was that this query was invoked 102 times resulting in a whopping 59.9 seconds of latency, its therefore no surprise that thread concurrency inside the JVM was high waiting for transactions like these to complete. If these queries are simply pulling back product data then there is no reason why a distributed cache can’t be used to store the data instead of expensive calls to a remote database like DB2.

Let’s look at the other “Checkout” transaction which was breaching during the performance spike. Here is a checkout which took 9.1 seconds and deviated significantly from its performance baseline. You can see from the screenshot below the latency or bottleneck is again coming from the DB2 database:

Hardly surprising given most application scalability issues these days still relate to data persistence between the JVM and database. So let’s drill down into the JVM for this transaction and understand what exactly is being invoked in the DB2 database:

Above is the code execution of that transaction and you immediately see 8.5 seconds of latency is spent in an EJB call which is performing an update. Let’s take a look at the invoked queries as part of that update:

Nice, a simple update query was taking 8.4 seconds, notice all the other SQL queries associated with a single execution of the “Checkout” transaction. The application during this performance spike was clearly database bound and as a result a few code changes were made overnight that reduced the amount of database calls the application was making. We had one retail e-commerce customer last year who found a similar bottleneck, a fix was applied that reduced the number of database calls per minute from 500,000 to a little under 150,000. While the problem may at first appear to be a database issue (for the DBA) it was actually application logic and the developers who were responsible for resolving the issue.

You can see in the first screenshot that application response time was stable throughout the rest of the thanksgiving period , no spikes or outages occurred for this customer and all was well. While every customer will do their best to catch performance defects in pre-production and test, sometimes its not possible to reproduce or simulate real application usage or patterns, especially in large scale high throughput production environments. This is where Application Performance Management (APM) solutions like AppDynamics can help – by monitoring your application in production so you can see whats happening. Get started today with a free 30-day trial.