Improving Client Service Through Cloud Application Monitoring

This article originally appeared on Kurtosys’ blog.

In the last two years, there has been a big change in the way Kurtosys delivers applications to clients, most of which can be attributed to the use of cloud-based Infrastructure services. To keep pace with this shift, we have had to make some serious adjustments to our monitoring and application tracking.

Many of our clients operate and consume our services on a continuous 24/7 basis and the scale of our operations continues to grow. Both these factors have led to a different approach in support and they too introduce new requirements to our monitoring strategy.

Make changes to improve processes

At Kurtosys, we aim to provide a proactive strategy that detects and alerts us to problems before they manifest themselves to the end users. This is not an easy goal to achieve, but one that we must continue to develop in order to improve our client service. It doesn’t happen over night, but evolves progressively with the introduction of new tools and changes in culture and behaviour of a team.

In making these changes we have considered many factors that should be included in our monitoring capabilities. This starts with implementing traditional measures such as performance, capacity, uptime and throughput. However, we are now concerned with wider issues such as security, end user experience, service level analysis, data validation and other performance indicators and usage statistics so we can track the development of a client engagement.

Cloud application monitoring with AppDynamics

In the past, we were somewhat limited by the tools available. These tools did provide us with good environmental indicators for our infrastructure — by using these metrics we were able to determine the likelihood of events and problems. However, they were not very user-friendly to anyone who was unfamiliar with our infrastructure and hosting model. Now, by using a new generation of tools like AppDynamics, we are able to monitor transactions within the applications as well as those at the infrastructure level. This enables us to look in much more detail at different layers within our platform and provide meaningful information to a wider range of users including clients.

In addition to the greater transparency that our monitoring promises, we are also able to drill into issues, diagnose and rectify problems early and quickly. We no longer need to spend as much time trying to determine cause from symptoms that can so often result in the wrong conclusions. Now, we can focus on remediation and prevention, which is a good thing for all concerned.

The path to better client servicing

Of course, this is just the first step and our goals expand as we get better at what we do. Some ideas on our monitoring roadmap include optimisation of our cloud environments and tracking business KPIs. With each new step we see more potential and opportunity from our monitoring systems. First, we establish the robustness of our infrastructure, then we move on to address our client service and support obligations, and finally we drive out better business decisions through a more informed process.

Cloud Auto Scaling using AppDynamics

Are your applications moving to an elastic cloud infrastructure? The question is no longer if, but when – whether that is a public cloud, a private cloud, or a hybrid cloud.

Classic computing capacity models clearly indicate that over-provisioning is essential to keep up with peak loads of traffic while the over-provisioned capacity is largely left under-utilized during non-peak periods. Such over-provisioning and under-utilization can be avoided by moving to an elastic cloud-computing capacity model where just-in-time provisioning and deprovisioning can be achieved by automatically scaling up and down on-demand.

(Source: http://blog.maartenballiauw.be)

Cloud auto-scaling decisions are often made based on infrastructure metrics such as CPU Utilization. However, in a cloud or virtualized environment, infrastructure metrics may not be reliable enough for making auto-scaling decisions. Auto-scaling decisions based on application metrics, such as request-queue depth or requests per minute, are much more useful since the application is intimately familiar with conditions such as:

  • When the existing number of compute instances cannot handle the incoming arrival rate of traffic and must elastically scale up additional instances based on a high-watermark threshold on a given application metric

  • When it’s time to scale back down based on a low-watermark threshold on the same application metric.

Every application service can be expressed as a statistical model of traffic, queues and resources as shown in the diagram below.

  • For a given arrival rate λ, we need to maximize the service rate μ with an optimum value of n resources. Monitoring either the arrival rate  λ itself for synchronous requests or q depth for asynchronous requests will help us tune the application system to see if we need additional service compute instances to meet the demands of the current arrival rate.

  • Having visibility into this data allows us not only to find bottlenecks in the code but also possibly flaws in design and architecture. AppDynamics provides visibility into these application metrics.

The basic flow for auto-scaling using AppDynamics is shown in the diagram below:

Let’s take an example to illustrate how this actually works in AppDynamics. ACME Corporation has a multi-tier distributed online bookstore application running on AWS EC2:

The front-end E-Commerce tier is experiencing a very heavy volume of requests resulting in the tier going into a Warning (Yellow) state.

Now we will walk through the 6 simple steps that the ACME Corporation will use to exploit the Cloud Auto Scaling features of AppDynamics.

 

Step 1: Enable display of Cloud Auto Scaling features

 To do this, they first select “Setup-> My Preferences” and check the box to “Show Cloud Auto Scaling features” under “Advanced Features”:

Step 2: Define a Compute Cloud and an Image

Then they click on the Cloud Auto Scaling option at the bottom left of the screen:

 Next, they click on Compute Clouds and register a new Compute Cloud:

and fill in their AWS EC2 account info and credentials:

Next, they register a new image from which new instances of the E-Commerce tier nodes can be spawned:

 

and provide the details of that machine image:

By using the Launch Instance button, they can manually test whether it was successfully launched.

Step 3: Define a scale-up and a scale-down workflow

 Then, they define a scale-up workflow for the E-Commerce tier with a step to create a new compute instance from the AMI defined earlier:

Next, they define a scale-down workflow for the E-Commerce tier with a step to terminate a running compute instance from the same AMI:

Now, you may be wondering why these workflows are so simplistic and why there are no additional steps to rebalance the load-balancer after every new compute instance gets added or terminated. Well, the magic for that lies in the Ubuntu AMI that bootstraps the Tomcat JVM for the E-Commerce tier. It has the startup logic to automatically join the cluster and also has a shutdown-hook to automatically leave the cluster, by communicating directly with Apache load-balancer mod_proxy.

Step 4: Define an auto-scaling health rule

 Now, they define an auto-scaling health rule for the E-Commerce tier:and select the E-Commerce Server tier as the scope for the health rule:

 

and specify a Critical Condition as “Calls per Minute > 3500”, which in this case, represents the arrival rate  λ:

and a Warning Condition of “Calls per Minute > 3000”:

 Note: It is very important to choose the threshold values for Calls Per Minute in the Critical and Warning conditions very carefully, because failing to do so may result in scaling thrash.

Step 5: Define a scale-up policy

Now, they define a Scale Up Policy which will bind their newly defined Health Rule with  a Cloud Auto-scaling action:



Step 6: Define a scale-down policy

Finally, they define another policy that will invoke the Scale-down workflow when the Health rule violation is resolved.

And they’re done!

After a period of time when the Calls per Minute exceeds the configured threshold, they actually witness that the Auto-scaling Health rule was violated, as it shows up under the Events list:

 

When they drill down into the event, they can see the details of the Health Rule violation:

 

And when they click on the Actions Executed for the Cloud Auto-Scaling Workflows, they see:

 

Also, under Workflow executions, they see:

and when they drill-down into it, they see:

 

Finally, under the Machines  item under Cloud Auto Scaling, they can see the actual compute instance that was started as a result of Auto Scaling:

Thus, without any manual intervention, whenever the E-Commerce tier needs additional capacity indicated by the threshold of Calls Per Minute in the Auto-Scaling Health rule, it is automatically provisioned. Also, these additional instances are automatically released when the Calls Per Minute goes below that threshold.

 

AppDynamics has cloud connectors for all the major cloud providers:

 

Contribute

If you have your own cloud platform, you can always develop your own Cloud Connector using the AppDynamics Cloud Connector API and SDKs that are available via the AppDynamics Community. Find out more in the AppDynamics Connector Development Guide. Our cloud connector code is all open-source and can be found on GitHub.

Take five minutes to get complete visibility into the performance of your production applications with AppDynamics Pro today.

Cloud Migration Tips Part 4: Failure Breeds Success

Welcome back to my series on migration to the cloud. In my last post we discussed all of the effort you need to put into the planning phase of your migration. In this post we are going to focus on what should happen directly after the migration has been completed.

Regardless of how well you planned or if you just decided to dive right in without any forethought, there are steps that need to be taken after your migration to ensure your application is working properly and performing up to snuff. These steps need to be performed whether you chose to use a public, private or hybrid cloud implementation.

Step 1: Take Your New Cloud Based Application for a Test Drive

Go easy at first and just roll through the functionality as a user would. If it doesn’t work well for you then you know it wont work well when there are a bunch of users hitting it.

Assuming things went well with your functional test it’s time to go bigger. Lay down a load test and see step 2 below.

Step 2: Monitoring is Not the Job of Your Users

If you’re relying on the users of your application to let you know if there are performance or stability issues you are already a major step behind your competition. If you planned properly then you have a monitoring system in place. If you’re just winging it, put in a monitoring system now!!!

Here are the things your monitoring tool should help you understand:

  • Architecture and Flow: You design an application architecture to support the type of application you are building. How do you really know if you have deployed the architecture you designed in the first place? How do you know if your application flow changes over time and causes problems? Cloud computing environments are dynamic and can shift at any given time. You need to have a tool in place that let’s you know exactly what happened, when and if it caused any impact.

E-Commerce Website Architecture

What happens if you don’t have a flow map? Simple, when there’s a problem you waste a bunch of time trying to figure out what components were involved in the problematic transaction so that you can isolate the problem to the right component.

  • Response Times: Slow sucks! You moved to the cloud for many potential reasons but one thing is certain, your users don’t want your application(s) to run slowly. It seems obvious to monitor the response time of your applications but I’m constantly amazed by how many organizations still don’t have this type of monitoring in place for their applications. There are really only 2 options in this category; let your users tell you when (notice I didn’t say if) your application is slow or have a monitoring tool alert you right away.

Screen Shot 2012-08-14 at 1.59.33 PM

  • Resources: You need to keep an eye on the resources you are consuming in the cloud. New instances of your application can quickly add up to a large expense if your code is inefficient. You need to understand how well your application scales under load and fix the resource hogs so that you can drive better value out of your application as usage increases.

resources

Step 3: Elasticity

Elasticity is a key benefit of migrating your application to the cloud. Traditional application architectures accounted for periodic spikes in workload by permanently over-allocating resources. Put simply, we used to buy a bunch of servers so that we could handle the monthly or yearly spikes in activity. Most of these servers sat nearly idle the rest of the year and generated heat.

If you’re going to take advantage of the inherent elasticity within your cloud environment you need to understand exactly how your application will respond to being overloaded and how your infrastructure adapts to this condition. Cloud providers have tools to execute the dynamic shift in resources but ultimately you need a tool to detect the trigger conditions and then interface with the dynamic provisioning features of your cloud.

The combination of slow transactions AND resource exhaustion would be a great trigger to spin up new application instances. Each condition on its own does not justify adding a new resource.

Screen Shot 2013-04-25 at 3.16.38 PM

Screen Shot 2013-04-25 at 3.20.05 PM

The point here is that migrating to the cloud is not a magic bullet. You need to know how to use the features that are available and you need the right tools to help you understand exactly when to use those features. You need to stress your new cloud application to the point of failure and understand how to respond BEFORE you set users free on your application. Your users will certainly break your application and during an event is not the proper time to figure out how to manage your application in the cloud.

Let failure be your guide to success. Fail when it doesn’t matter so that you can success when the pressure is on. The cloud auto-scaling features shown in this post are part of AppDynamics Pro 3.7. Click here to start your free trial today.

Don’t Deploy Your Cloud Application Without Reading This – Part 1

Public cloud, private cloud, hybrid cloud, cloud bursting, cloud storming, elastic compute, IaaS, PaaS, SaaS, the list of terms goes on and on ad-nauseam. Like it or not, cloud computing has taken hold as an important design consideration in companies ranging from small startups to large established enterprises. The concepts and technologies behind cloud computing have been around for quite a long time now so why is it taking so long for so many companies to move their applications and realize the benefits that cloud computing offers?

Getting beyond the ridiculous fear of the unknown, security concerns are a major inhibitor to cloud adoption but between private cloud and a slew of security technologies and methods that should only impact a small portion of applications. The real problem, in my opinion, is that nobody wants to fail and suffer damage to their personal and/or corporate brands. I’ve seen so many companies make a poor transition to cloud computing and it impacts their revenue and customer retention.

stormclouds

Companies like Netflix, Orbitz, and Family Search have been tremendously successful with their cloud computing initiatives. Do they have better technologists than other companies? Are their processes better than others? Do they have special tools that nobody else has? Or have they made a commitment that is okay to fail as long as they fail fast and don’t repeat their mistakes? The answer might be a combination of all of the above depending upon which organizations we are talking about.

There is a wealth of information published on the internet about deploying applications to the cloud; there are companies that exist solely to help you move application to the cloud; there are even companies that exist to help you figure out IF you should move your application(s) to the cloud. I used to work for one of those companies and what we saw over and over again was that our clients really didn’t know how to get started down the path of moving their existing applications to a cloud environment. Even worse were the companies that thought they knew what it took to successfully migrate their application(s) but didn’t. All of these companies were missing crucial bits of information that would make the difference between a smooth and painless migration and a rough, frustrating migration.

toolbox-cloudThe tools, processes, and information you use in the planning, execution, and ongoing management of your cloud applications will make all of the difference between success and failure.

In this blog series I’ll discuss some of the key considerations related to planning and execution of migrating your applications to the cloud. I’ll cover a few important aspects of deciding IF you should move your applications to the cloud and then focus mostly on what happens after you’ve decided to go for it. Everything I discuss will be directly from my experience moving and monitoring cloud applications within an enterprise and as a consultant.

In my opinion it’s much harder to move an existing application than it is to set up a new application in the cloud. The good news is that there are common considerations for each of these scenarios so next week I’ll discuss the following:

Should we move or deploy to the cloud?
What can I monitor to ensure my users are not impacted in a negative way?

In future posts I’ll discuss the planning and migration phases, how to take advantage of cloud elasticity, and good ongoing management practices. I might even preview some awesome new features we’re cooking up to make management of your applications faster and easier (shhhhh, it’ll be our little secret).

AppDynamics and Apica: A Partnership Made for Performance

AppDynamics and Apica: A Partnership Made for PerformanceAt AppDynamics we’re always looking to partner with vendors that can significantly enhance the visibility we provide our customers when it comes to managing application performance in production. One area we see synergies is in synthetic monitoring and load testing, specifically for the next generation of cloud and mobile applications. So when Apica, the performance and load testing company for cloud, web and mobile applications came to us to join forces, we were stoked.

On June 26th, Apica and AppDynamics announced a partnership to provide DevOps an integrated monitoring solution to gain 10x visibility into application availability and performance, with the power to identify the root cause of slow downs and outages in as few as 3-clicks.

We’ve seen how development agility can be directly proportional to production fragility. Reproducing complex, distributed production architectures in test is tough, because data volumes, computing resource and user behavior always differ. This is why many organizations are starting to test application performance in production, so they can stress test out-of-hours and pro-actively identify severity-1 incidents before end users and the business is impacted.

Monitoring visibility across geo-locations, browsers and mobile devices

Giving DevOps the visibility they need to see how their applications perform, in production, under load, across geo-locations, different browsers and mobile devices helps DevOps be pro-active in managing application performance. This is where Apica’s synthetic monitoring and load testing products helps organizations understand how an application is performing from an end user, location, browser or device perspective.

“Users expect a fast and reliable web, cloud, and mobile experience. Every second delay can cost businesses valuable customers and revenue,” says Sven Hammar, CEO of Apica. “Together with AppDynamics, we’re providing users with best-of-breed solutions to ensure uptime and availability for revenue-critical applications. They’ll have the most complete understanding available of the metrics that are powering or causing problems for their applications so they can take measures to improve performance.”

AppDynamics integrates with the following three Apica products:

Apica WebPerformance – Verifies the performance and availability of your mission-critical applications from over 80 different locations world-wide.

Apica ProxySniffer – Automates creation of load test scripts and scenarios by recording real end-user traffic

Apica LoadTest – A cloud-based load testing tool for your mission-critical applications

You can access the Apica portal via any web browser or mobile device. Thru this new partnership AppDynamics real-time monitoring data is now seamlessly available in Apica portal, so DevOps can now drill-down a level further to understand the root cause of slowdowns and availability issues.

Apica Web Performance

Apica WebPerformance

Like AppDynamics, Apica has similar synergies to application performance management (APM):

Complete Lifecycle Visibility – We have customers that deploy our solutions in both production and pre-production environments. For Apica, scripts used for load testing thru LoadTest can also be applied to WebPerformance as well to test application availability and performance in production.

Real-time monitoring – understanding the true end user experience 24/7; specifically around business transactions and the application infrastructure, so pro-active alert notifications can be sent to DevOps when service degradation is identified.

Built for the Cloud – Apica LoadTest is cloud-ready so organizations can automate load tests and find the optimal configuration for their cloud deployments.

Below is a short clip of the direct integration of AppDynamics with Apica allowing DevOps to monitor all of the business transactions and system resources in real-time. There is also contextual business transaction drill-down that takes a user from the Apica portal into the AppDynamics user interface so the root cause of performance issues can rapidly found.


You can start your free trial with Apica’s performance and load testing solution by clicking here. For more details on monitoring your applications with Apica and AppDynamics, please visit Apica’s AppDynamics partner page.

 

Top 5 Gotchas for Monitoring Applications in the Cloud

If you haven’t already, many IT organizations are migrating some of their applications to the cloud to become more agile, alleviate operational complexity and spend less time managing infrastructure and servers. The next question you may ask yourself is, “How will we monitor these applications and where should we even begin with so many monitoring tools on the market?”

I’m glad you asked. Here is a list of gotchas you should look out for. If you have your own list, feel free to comment below and share with us.

1. Lack of End User or Business Context – With apps running in the cloud, monitoring infrastructure metrics indicates very little about your end-user experience,or the performance of your apps or business running in the cloud. End users experience business transactions so make sure your monitoring gives you this visibility.

2. Node Churn – How well does your application monitoring solution deal with node churn – the provisioning and de-provisioning of servers and application nodes? The monitoring solution has to work in dynamic, virtual and elastic environment where change is constant, otherwise you’ll end up with blind spots in your application and monitoring. Many of the current monitoring solutions today are unable to monitor and adapt to dynamic cloud infrastructure changes, requiring manual intervention by operations so new nodes can be registered and monitored.

3. Agent-less is Tough in the Cloud – You may not have any major issues with installing a packet sniffer or network-monitoring appliance in your own private cloud or data-center, but you won’t be able to place these kinds of devices in PaaS or IaaS environments to monitor your application performance. Monitoring agents in comparison can easily be embedded or piggy-backed as part of an application deployment in the cloud. Agent-less may not be a option when trying to monitor many cloud applications.

4. High Network Bandwidth Costs – Cloud providers typically charge per gigabyte of inbound and outbound traffic. If your cloud application has 100 nodes and your collecting megabytes of performance data every minute, all of that data has to be communicated outside of the cloud to your monitoring solution’s management server, which can be on-premise or in another cloud. Monitoring what’s relevant in your application versus monitoring everything means you’ll avoid exorbitant cloud usage bandwidth costs for transferring monitoring data.

5. Inflexible Licensing – If you want to monitor specific nodes, will your application monitoring vendor lock each license down to a physical server, hostname or IP, OR can your licenses float to monitor any server/node? This can be a severe limitation as now your agents are locked down to a specific node indefinitely. Even if you weren’t monitoring your applications running in the cloud, it’s still a nuisance to have a monitoring agent handcuffed to a physical server without being given the licensing flexibility to move agents around to monitor different server or nodes. As stated above, with node churn occurring frequently in cloud environments, you need a monitoring solution to be as flexible as possible so you can deploy agents anywhere, at anytime.

The good news is, monitoring application performance in the cloud is hardly a new concept for AppDynamics as we address all of these requirements with flying colors. If you’re interested in a robust application monitoring solution in the cloud, you can take a free 30-day trial of AppDynamics Pro.

 

What’s the big deal with AppDynamics Azure Monitoring?

Let me guess: you’re probably expecting AppDynamics to be “another monitoring solution” for Windows Azure. You’re expecting it to show you basic server metrics like CPU, memory and disk I/O, along with a few CLR counters thrown in for good measure. Well, I’m sorry to disappoint you–but these metrics in isolation are about as useful as a chocolate teapot for monitoring your applications and business in the Cloud.

When end users interact with your Windows Azure application, they don’t experience servers, Azure roles, CPU, or CLR counters. Rather, they experience business transactions. Monitoring infrastructure metrics in the cloud is probably the worst KPI for managing Quality-of-Service (Qos) and application performance. Another problem in the cloud is application architectures, which have become virtual, dynamic and distributed. Applications are no longer just an app server and database. SOA design principles and cloud application services mean that business transaction logic now executes across many distributed application tiers and services.

For example, the Windows Azure PaaS platform has several services that allow organizations to run their mission-critical applications in the Cloud. Azure Compute provides web and windows services, Azure AppFabric provides messaging, security and caching, and SQL Azure and Azure Storage provides data management capabilities.

A major challenge for organizations is gaining end-to-end visibility of how their business (transactions) perform in the Azure Cloud. AppDynamics was founded in 2008 to address this problem, and it’s a key reason why Microsoft announced AppDynamics as a partner to help their customers manage application performance within Windows Azure. For AppDynamics, it’s an enormous privilege and responsibility to be selected by Microsoft, given Windows Azure is a huge strategic play as Microsoft looks to dominate the Cloud Computing market.

So what’s the big deal with AppDynamics in Azure? Let’s take a look at what unique capabilities AppDynamics offers for Windows Azure customers:

Application Mapping for Windows Azure Roles and Services
AppDynamics is able to automatically discover, map, and visualize the living topology and performance of a production application running inside Windows Azure. This allows organizations to understand the application tiers, Azure roles, and service dependencies inside the Azure Cloud so they can rapidly isolate which components are responsible for slow performance or poor QoS. For example, take a look at the following screenshot of an application hosted inside Windows Azure:

Business Transaction-centric Monitoring
What’s unique about AppDynamics production monitoring is that our key unit of measurement is business transactions–rather than, say, web pages, servers, or CLR metrics. AppDynamics can auto-discover the business transactions that end users are experiencing in the application. This gives IT Operations and development teams the context and visibility they need to monitor and manage the end user experience, along with understanding how transactions flow across and inside each application tier, Azure Role, and Services.

For example, take a look at this screenshot, which shows what business transaction monitoring looks like in AppDynamics:

Deep Diagnostics for Rapid Root-Cause Analysis
For situations where end users experience poor QoS, slow downs, or outages, it’s possible for organizations to pinpoint the root cause of issues in minutes using AppDynamics. AppDynamics uses intelligent analytics to self-learn and baseline the normal performance of every business transaction in an application. When performance deviates, AppDynamics captures complete diagnostic snapshots of business transactions that breach their performance baseline. Snapshots represent the complete distributed flow and code execution of slow business transactions as they execute across and inside each application tier, Azure role and service. This means IT Operations and developers see right down to the line of code responsible for issues in production that impact their end users and QoS. Root cause is therefore only minutes away.

Here is a screenshot of how a single user business transaction executes within Windows Azure:

Here is the code execution of that same user business transaction within the 2nd Application Tier “LowValueOrders” worker role:

Auto–Scaling for provisioning Windows Azure resource on-the-fly
A key benefit of Cloud is resource elasticity – the ability for an application to leverage additional computing resource on-the-fly, so it’s able to scale up and down all by itself. For auto-scaling to happen in Windows Azure, the application needs to know when to scale up and when to scale down. AppDynamics provides a comprehensive policy wizard that can auto-scale an application in Windows Azure based on business transaction KPI like end user response time or throughput. For example, if order throughput in your application were to drop due to increased load on your application, you might want to spin up more Azure compute (web/worker roles) to compensate for the additional demand until throughput stabilizes or load reduces.

For example, here’s a screenshot of the AppDynamics Azure auto-scaling policy wizard:

If you want to try AppDynamics for Windows Azure you can sign-up at the marketplace right here.

App Man