Application Runbook Automation – A Detailed Walk Through

On Monday AppDynamics announced a new feature called Application Runbook Automation (RBA). The response to this announcement has been great and many people want to see the details on how we implement RBA within AppDynamics. If you attended one of our customer webinars for AppDynamics 3.7 Sneak Peek then you got to see RBA in action during a live demo. Here is a link to the webinar recording in case you want to see for yourself. After all, a video is worth a million words. Otherwise, I’m going to walk you through it step by step in this blog post.

If you don’t already know WHY we built Application RBA please read “Don’t Be An “Also-Ran” – Application Runbook Automation for World Class IT”

Let’s jump right in…

I have an application that is tuned perfectly for normal load but every once in a while I get a massive rush of activity that causes my application to get really slow. My APM tool tells me this is happening because I am exhausting my database connection pool from the excessive load.

Instead of manually having to adjust my connection pool size every time a major utilization spike occurs I can just create a Runbook and associate it with a policy so that it will fire when the connection pool is exhausted. Here’s how we do it…

The RBA menus (Policies, Health Rules, and Actions) are found in the “Alert and Respond” section of the AppDynamics UI.

RBA Menus

We select the Policy menu item and click the “Create Policy” button which opens our Create Policy dialogue.

Create Policy

First thing we want to do is provide a sensible name for our new policy. I chose “Extend Resource Pool”. Next we need to select the event that acts as a trigger for our Runbook. In this case we choose the “Resource Pool Limit Reached” event and click on the “Next” button. This opens up the “Actions” dialogue shown below.

Policy Actions

Clicking on the green + icon allows us to add pre-existing actions or create new ones to use within our policy. In this case, we will click on the “Create Action” button to generate the proper actions required for remediation of this problem.

Create Action

In the Create Action dialogue we select the “Run a script or executable on problematic Nodes” radio button and hit the “OK” button to continue. This leads us to the “Create Remediation Script Action” dialogue. We provide a name for our action “Increase Resource Pool Script”, path to our script, location where we want our log files saved, script timeout threshold, and decide if we need this action authorized by a human before being executed or not.

Remediation Script

Once we click OK the next dialogue is important and powerful. This is where we determine if the remediation action will be executed on all of the impacted nodes, a percentage of impacted nodes, or a defined number of nodes. You probably don’t want to run Thread Dumps on all of your impacted nodes at the same time so this is a great way to limit the scope of your remediation action if needed. In our case we want every node repaired right away so we have selected 100% of impacted nodes.

Configure Action

We save our action and notice that the new action is now shown in the “Extend Resource Pool” Policy actions box. We can add as many actions to an individual policy as are required to gather data, remediate, and alert. When we are done adding actions we save our work and our new policy is shown in the AppDynamics UI list of Policies.

NewPolicyWith Action

So what’s the end result of our work? Our application is currently running under load. In the top right corner of the application flow map is the Events panel. We see 1 event in there and it is categorized as a “Code Problem”.

Code Problems

Clicking on that event we launch into the events workspace. We see a description of the event and that our remediation script was executed (We increased the size of the database connection pool). We can explore the event further if we choose to but for this blog we will just jump to the actual results of our action.

Resource Limit Reached

By looking at the chart shown below we can see that as our load increased the average response time (blue line) of our transactions was steadily increasing to almost 10 seconds. Meanwhile the transaction throughput (green and orange bars) remained low during the period where our connection pool was a bottleneck. You can see the point at 8:17 AM where the remediation runbook automatically kicked in and increased the size of the connection pool for us. This alleviated our resource contention and throughput increased dramatically while response time improved to around 1 second.

Review of results

This is just one simple but powerful example of what you can do with Application Runbook Automation from AppDynamics. Request your free trial of AppDynamics Pro today and see what we can do for your applications.