Successfully Deploying AIOps, Part 2: Automating Problem Time

In part one of our Successfully Deploying AIOps series, we identified how an anomaly breaks into two broad areas: problem time and solution time. The first phase in deploying AIOps focuses on reducing problem time, with some benefit in solution time as well. This simply requires turning on machine learning within an AIOps-powered APM solution. Existing operations processes will still be defining, selecting and implementing anomaly rectifications. When you automate problem time, solution time commences much sooner, significantly reducing an anomaly’s impact.

AIOps: Not Just for Production

Anomalies in test and quality assurance (QA) environments cost the enterprise time and resources. AIOps can deliver significant benefits here. Applying the anomaly resolution processes seen in production will assist developers navigating the deployment cycle.

Test and QA environments are expected to identify problems before production deployment. Agile and DevOps approaches have introduced rapid, automated building and testing of applications. Though mean time to resolution (MTTR) is commonly not measured in test and QA environments (which aren’t as critical as those supporting customers), the benefits to time and resources still pay off.

Beginning your deployment in test and QA environments allows a lower-risk, yet still valuable, introduction to AIOps. These pre-production environments have less business impact, as they are not visited by customers. Understanding performance changes between application updates is critical to successful deployment. Remember, as the test and QA environments will not have the production workload available, it’s best to recreate simulated workloads through synthetics testing.

With trust in AIOps built from first applying AIOps to mean time to detect (MTTD), mean time to know (MTTK) and mean time to verify (MTTV) in your test and QA environments, your next step will be to apply these benefits to production. Let’s analyze where you’ll find these initial benefits.

Apply AI/ML to Detection (MTTD)

An anomaly deviates from what is expected or normal. Detecting an anomaly requires a definition of “normal” and a monitoring of live, streaming metrics to see when they become abnormal. A crashing application is clearly an anomaly, as is one that responds poorly or inconsistently after an update.

With legacy monitoring tools, defining “normal” was no easy task. Manually setting thresholds required operations or SRE professionals to guesstimate thresholds for all metrics measured by applications, frameworks, containers, databases, operating systems, virtual machines, hypervisors and underlying storage.

AIOps removes the stress of threshold-setting by letting machine learning baseline your environment. AI/ML applies mathematical algorithms to different data features seeking correlations. With AppDynamics, for example, you simply run APM for a week. AppDynamics observes your application over time and creates baselines, with ML observing existing behavioral metrics and defining a range of normal behavior with time-based and contextual correlation. Time-based correlation removes alerts related to the normal flow of business—for example, the login spike that occurs each morning as the workday begins; or the Black Friday or Guanggun Jie traffic spikes driven by cultural events. Contextual correlation pairs metrics that track together, enabling anomaly identification and alerts later when the metrics don’t track together.

AIOps will define “normal” by letting built-in ML watch the application and automatically create a baseline. So again, install APM and let it run. If you have specific KPIs, you can add these on top of the automatic baselines as health rules. With baselines defining normal, AIOps will watch metric streams in real time, with the model tuned to identify anomalies in real time, too.

Apply AI/ML to Root Cause Analysis (MTTK)

The first step to legacy root cause analysis (RCA) is to recreate the timeline: When did the anomaly begin, and what significant events occurred afterward? You could search manually through error logs to uncover the time of the first error. This can be misleading, however, as sometimes the first error is an outcome, not a cause (e.g., a crash caused by a memory overrun is the result of a memory leak running for a period of time before the crash).

In the midst of an anomaly, multiple signifiers often will indicate fault. Logs will show screeds of errors caused by stress introduced by the fault, but fail to identify the underlying defect. The operational challenge is unpacking the layers of resultant faults to identify root cause. By pinpointing this cause, we can move onto identifying the required fix or reconfiguration to resolve the issue.

AIOps creates this anomaly timeline automatically. It observes data streams in real time and uses historical and contextual correlation to identify the anomaly’s origin, as well as any important state changes during the anomaly. Even with a complete timeline, it’s still a challenge to reduce the overall noise level. AIOps addresses this by correlating across domains to filter out symptoms from possible causes.

There’s a good reason why AIOps’ RCA output may not always identify a single cause. Trained AI/ML models do not always produce a zero or one outcome, but rather work in a world of probabilities or likelihoods. The output of a self-taught ML algorithm will be a percentage likelihood that the resulting classification is accurate. As more data is fed to the algorithm, these outcome percentages may change if new data makes a specific output classification more likely. Early snapshots may indicate a priority list of probable causes that later refine down to a single cause, as more data runs through the ML models.

RCA is one area where AI/ML delivers the most value, and the time spent on RCA is the mean time to know (MTTK). While operations is working on RCA, the anomaly is still impacting customers. The pressure to conclude RCA quickly is why war rooms get filled with every possible I-shaped professional (a deep expert in a particular silo of skills) in order to eliminate the noise and get to the signal.

Apply AI/ML to Verification

Mean time to verify (MTTV) is the remaining MTTR portion automated in phase one of an AIOps rollout. An anomaly concludes when the environment returns to normal, or even to a new normal. The same ML mechanisms used for detection will minimize MTTV, as baselines already provide the definition of normal you’re seeking to regain. ML models monitoring live ETL streams of metrics from all sources provide rapid identification when the status returns to normal and the anomaly is over.

Later in your rollout when AIOps is powering fully automated responses, this rapid observation and response is critical, as anomalies are resolved without human intervention.  Part three of this series will discuss connecting this visibility and insight to action.

Successfully Deploying AIOps, Part 1: Deconstructing MTTR

Somewhere between waking up today and reading this blog post, AI/ML has done something for you. Maybe Netflix suggested a show, or DuckDuckGo recommended a website. Perhaps it was your photos application asking you to confirm the tag of a specific friend in your latest photo. In short, AI/ML is already embedded into our lives.

The quantity of metrics in development, operations and infrastructure makes development and operations a perfect partner for machine learning. With this general acceptance of AI/ML, it is surprising that organizations are lagging in implementing machine learning in operations automation, according to Gartner.

The level of responsibility you will assign to AIOps and automation comes from two factors:

  • The level of business risk in the automated action
  • The observed success of AI/ML matching real world experiences

The good news is this is not new territory; there is a tried-and-true path for automating operations that can easily be adjusted for AIOps.

It Feels Like Operations is the Last to Know

The primary goal of the operations team is to keep business applications functional for enterprise customers or users. They design, “rack and stack,” monitor performance, and support infrastructure, operating systems, cloud providers and more. But their ability to focus on this prime directive is undermined by application anomalies that consume time and resources, reducing team bandwidth for preemptive work.

An anomaly deviates from what is expected or normal. A crashing application is clearly an anomaly, yet so too is one that was updated and now responds poorly or inconsistently. Detecting an anomaly requires a definition of “normal,” accompanied with monitoring of live streaming metrics to spot when the environment exhibits abnormal behaviour.

The majority of enterprises are alerted to an anomaly by users or non-IT teams before IT detects the problem, according to a recent AppDynamics survey of 6,000 global IT leaders. This disappointing outcome can be traced to three trends:

  • Exponential growth of uncorrelated log and metric data triggered by DevOps and Continuous Integration and Continuous Delivery (CI/CD) in the process of automating the build and deployment of applications.
  • Exploding application architecture complexity with service architectures, multi-cloud, serverless, isolation of system logic and system state—all adding dynamic qualities defying static or human visualization.
  • Siloed IT operations and operational data within infrastructure teams.

Complexity and data growth overload development, operations and SRE professionals with data rather than insight, while siloed data prevents each team from seeing the full application anomaly picture.

Enterprises adopted agile development methods in the early 2000s to wash away the time and expense of waterfall approaches. This focus on speed came with technical debt and lower reliability. In the mid-2000s manual builds and testing were identified as the impediment leading to DevOps, and later to CI/CD.

DevOps allowed development to survive agile and extreme approaches by transforming development—and particularly by automating testing and deployment—while leaving production operations basically unchanged. The operator’s role in maintaining highly available and consistent applications still consisted of waiting for someone or something to tell them a problem existed, after which they would manually push through a solution. Standard operating procedures (SOPs) were introduced to prevent the operator from accidentally making a situation worse for recurring repairs. There were pockets of successful automation (e.g., tuning the network) but mostly the entire response was still reactive. AIOps is now stepping up to allow operations to survive in this complex environment, as DevOps did for the agile transformation.

Reacting to Anomalies

DevOps automation removed a portion of production issues. But in the real world there’s always the unpredictable SQL query, API call, or even the forklift driving through the network cable. The good news is that the lean manufacturing approach that inspired DevOps can be applied to incident management.

To understand how to deploy AIOps, we need to break down the “assembly line” used to address an anomaly. The time spent reacting to an anomaly can be broken into two key areas: problem time and solution time.

Problem time: The period when the anomaly has not yet being addressed.

Anomaly management begins with time spent detecting a problem. The AppDynamics survey found that 58% of enterprises still find out about performance issues or full outages from their users. Calls arrive and service tickets get created, triggering professionals to examine whether there really is a problem or just user error. Once an anomaly is accepted as real, the next step generally is to create a war room (physical or Slack channel), enabling all the stakeholders to begin root cause analysis (RCA). This analysis requires visibility into the current and historical system to answer questions like:

  • How do we recreate the timeline?
  • When did things last work normally or when did the anomaly began?
  • How are the application and underlying systems currently structured?
  • What has changed since then?
  • Are all the errors in the logs the result of one or multiple problems?
  • What can we correlate?
  • Who is impacted?
  • Which change is most likely to have caused this event?

Answering these questions leads to the root cause. During this investigative work, the anomaly is still active and users are still impacted. While the war room is working tirelessly, no action to actually rectify the anomaly has begun.

Solution time: The time spent resolving the issues and verifying return-to-normal state.

With the root cause and impact identified, incident management finally crosses over to spending time on the actual solution. The questions in this phase are:

  • What will fix the issue?
  • Where are these changes to be made?
  • Who will make them?
  • How will we record them?
  • What side effects could there be?
  • When will we do this?
  • How will we know it is fixed?
  • Was it fixed?

Solution time is where we solve the incident rather than merely understanding it. Mean time to resolution (MTTR) is the key metric we use to measure the operational response to application anomalies. After deploying the fix and verifying return-to-normal state, we get to go home and sleep.

Deconstructing MTTR

MTTR originated in the hardware world as “mean time to repair”— the full time from error detection to hardware replacement and reinstatement into full service (e.g., swapping out a hard drive and rebuilding the data stored on it). In the software world, MTTR is the time from software running abnormally (an anomaly) to the time when the software has been verified as functioning normally.

Measuring the value of AIOps requires breaking MTTR into subset components. Different phases in deploying AIOps will improve different portions of MTTR. Tracking these subdivisions before and after deployment allows the value of AIOps to be justified throughout.

With this understanding and measurement of existing processes, the strategic adoption of AIOps can begin, which we discuss in part two of this series.

Why Your Business Needs a ‘Wrapper’ to Transform Itself

How do you treat your most precious items? Chances are you wrap them up nicely to protect  them for longevity. Since your applications are the most precious items to your business, consider wrapping them with a solution that protects them for the long haul.

Modern applications are changing the way we conduct our day-to-day lives. Disruptive companies like Uber and Lyft are capitalizing on the consumer need for better, faster access to in-demand products and services. Could they have done this with a poor performing application? Most likely, no. This transformation is causing a ripple effect, as companies change their internal processes to develop projects faster. But despite the critical need to fully understand the complete software lifecycle—from planning and design to rollout and business impact—companies are using lackluster monitoring tools that provide only a siloed glimpse of their total environment, not a comprehensive view.

Business leaders increasingly are keen to see how their software impacts the bottom line. But problems arise when independent factions actively involved in the software development lifecycle are unable to see how their actions impact other teams. This shows the critical need for a cultural shift inside corporations, one that tightly aligns multiple teams throughout the entire development lifecycle. By changing the culture to enable cohesive team interaction and full lifecycle visibility, companies will find it far easier to verify if development changes are positively impacting the business. Conversely, when changes are not beneficial, firms will be able to quickly course-correct to reduce or eliminate negative trends.

Your Team Affects My Team

In recent years, companies have made tremendous strides in improving processes to enable faster software development, releases and MTTR. And yet many groups within these organizations remain unaware of how their actions, changes and implementations affect other teams. Say, for example, an automobile insurance provider releases a new application, a crucial component of its digital transformation. This application proves wildly successful, and many groups within the company develop an interest in its performance. For example:

  • The development team leverages the newest container technology to ensure proper scalability.
  • The infrastructure and network teams enable additional capacity and cloud capabilities.
  • The security team keeps a close eye on fraud and hacking.
  • Multiple teams ensure the best user experience.
  • Lastly, the business, keen on revenue performance, sees the application as a big revenue driver that will lower the cost of customer acquisition.

Ideally this leads to closer scrutiny of each group’s performance, which ultimately leads to greater customer satisfaction. This poses a problem, though, when each group operates within its own silo. For instance, when the network team fixes a problem or makes an upgrade or enhancement, it may not be aware which groups along the application flow are being impacted. Conversely, other groups may see an impact to their application without knowing the reason for the change.

Granted, most of us have change management procedures in place. But full visibility enables you to quickly triage and understand how all teams in the organization are being impacted, both positively and negatively. This visibility has become a fundamental requirement of today’s digital transformational efforts, and is essential for every team following the path of the application. Even modifications to marketing campaigns can cause a flurry of team activity if the company doesn’t quickly see the gains it’s after.  We’ve all sat in conference rooms to draw out a lifecycle that resembles the diagram below, where each group is part of the overall flow: A never-ending cycle of dependencies.

As part of this ongoing process, each group enacts changes to enhance its efficiency. The DevOps movement, for example, is a culture shift designed to help companies deploy applications faster and respond more adeptly to customer expectations. But ultimately, connecting every team within an organization requires a “wrapper” of sorts around the entire workflow process—one tying all domains together, including the lines of business.

This is easier said than done in some organizations, however, particularly those that have operated under a tried-and-true process and culture for many years. But with today’s business environments evolving at breakneck speed, companies must adapt much faster to survive. This brings us back to the concept of a wrapper—a comprehensive tool covering multiple domains to help provide full visibility of the application and the user journey throughout your business environment. By delivering these real-time insights, the wrapper ensures your business is moving in the right direction, enabling you to justify budgetary needs for future investment.

This is where AppDynamics comes in. Think about the demand placed on IT and business leaders, and the need to transform the enterprise. The critical element here is the need for the right tools. One of the first steps to consider: how can you gain a full view of your development, testing, implementation, production and business systems? The best solution must provide multiple benefits to ensure success, enabling you to detect both technology and business-related problems. It should help you understand how your end users are impacted by your technology, and even deliver insights to help you determine where to prioritize future  enhancements.

By leveraging AppDynamics, you’ll gain a full view of your critical applications across all stacks, as well as deep insights into how well your business is performing. A successful AIOps strategy with automated root cause analysis will provide the core framework for understanding all the working intricacies in your environment—a major first step toward maintaining a competitive edge.

Mean Time to Repair: What it Means to You

We’ve all been there: Flying home, late at night, a few delays. Our flight arrives at the airport and we’re anxious to get out of the tin can. Looking outside, we see no one is connecting the jet bridge to the aircraft. Seconds seems like minutes as the jet bridge just sits there. “This is not a random event, they should have been expecting the flight,” you tell yourself over and over again. Finally, a collective sigh of relief as the jet bridge starts to light up and inch ever closer to your freedom.

Even though the jet bridge was not broken per se, the process of attaching the bridge seemed broken to the end user, a.k.a. “the passenger.” The latency of this highly anticipated action was angst-causing.

As technologists, we deal with increasingly complex systems and platforms. The advent of the discipline around site reliability/chaos engineering brings rigor to mean-time-to-repair (MTTR) and mean-time-between-failure (MTBF) metrics.

For failure to occur, a system doesn’t have to be in a nonresponsive or crashed state. Going back to my jet bridge example, even high latency can be perceived as “failure” to your customers. This is why we have service level agreements (SLAs), which establish acceptable levels of service and the consequences of noncompliance. Violate a SLA, for example, and your business could find itself facing a sudden drop in customer sentiment as well as a hefty monetary fine.

Site reliability engineers (SREs) push for elastic and self-healing infrastructure that can anticipate and recover from SLA violations. However, these infrastructures are not without complexity to implement and instrument.

Mobile Launch Meltdown

I remember back when I was a consulting engineer with a major mobile carrier as a client. This was about a decade ago, when ordering a popular smartphone on its annual release date was an exercise in futility. I would wait up into the wee hours of the morning to be one of the first to preorder the device. After doing so on one occasion, I headed into the office.

By midday, after preordering had been open for some time, a cascading failure was occuring with my company, one of many vendors crucial to this preorder process. My manager called me to her office to listen in on a bridge call with the carrier. Stakeholders from the carrier were rightfully upset: “We will make more in an hour today than your entire company makes in a year,” they repeated multiple times.

The pressure was on rectify the issues and allow the business to continue. As in the novel The Phoenix Project, representatives from different technology verticals joined forces in a war room to fix things fast.

The failure was complex—multiple transaction and network boundaries, and speed of incoming orders on a massive scale. However, the notion of a large set of orders coming in on a specific date was not random, since the device manufacturer had set the launch date well in advance.

The Importance of Planning Ahead

The ability to tell when a violation state is going to occur—and to take corrective action ahead of time—is crucial. The more insight and time you have, the easier it is to get ahead of a violation, and the less pressure you’ll feel to push out a deployment or provision additional infrastructure.

With the rise of cloud-native systems, platforms and applications are increasingly distributed across multiple infrastructure providers. Design patterns such as Martin Fowler’s Strangler Pattern have become cemented as legacy applications evolve to handle the next generation of workloads. Managing a hybrid infrastructure becomes a challenge, a delicate balance between the granular control of an on-prem environment and the convenience and scalability of a public cloud provider.

Usually there is no silver bullet to fix problems at scale. If there’s a glaring issue the old adage, “Would have been addressed already,” proves true. In performance testing, death by a thousand paper cuts plays itself out in complex distributed systems. Fixing and addressing issues is an iterative approach. During a production-impacting event, haste can make waste. With all of the the investment in infrastructure-as-code and CI/CD, these deployments can systematically occur faster than ever.

We might not all experience an incident as major as a mobile phone preorder meltdown, but as technologists we strive to make our systems as robust as possible. We also invest in technologies that enable us change our systems faster—an essential capability today when so many of us are under the gun to fix what’s broken rather than adding new features that delight the customer.

I am very excited to join AppDynamics! I’ve been building and implementing large, distributed web-scale systems for many years now, and I’m looking forward to my new role as evangelist in the cloud and DevOps spaces. With the ever-increasing complexity of architectures, and the focus on performance and availability to enhance the end user experience, it’s crucial to have the right data to make insightful changes and investments. And with the synergies and velocity of the DevOps movement, it’s equally as important to make educated changes, too.

Cost of Performance Issues [INFOGRAPHIC]

We all know performance issues — application crashes, stalls, slow downs, etc. — can hurt our business and reputation. However, how can we put an exact dollar amount to these issues, and are other companies experiencing the same problems? Surely Fortune 500 companies can’t afford to have performance issues, right?

Wrong. Everyone experiences issues, it’s about limiting these problems and resolving them as quickly as possible to curb the impact to your bottom line.

99% uptime, or more commonly known as “two nines” in the IT world, still means you’re down over three and a half days per year. If one of those days happened to be Cyber Monday or another peak period, Amazon, Walmart, or Best Buy would definitely be able to see the actual cost and consequences of performance issues.

So we decided to do some research and create this infographic around the prevalence of performance issues and how they impact the bottom line. What we found might surprise you:

Don’t wait to be reactive with your performance issues, check out a FREE trial of AppDynamics today!

Beyond DevOps … APM as a Collaboration Engine

In the beginning there was a simply acronym: MTTI (mean time to innocence). Weary after years of costly and time-consuming warroom battles, IT organisations turned to AppDynamics to give an objective application-level view of production incidents. As a result, application issues are swiftly pinpointed and fixed, accelerating time to repair by up to 90%.

In fact, gravitation towards fact-based constructive issue management spawned a whole new movement – DevOps – with the goal of ingraining this maturity and cooperative spirit into IT organisations from the ground up. The movement was discussed by Jim in a previous blog post. Of course, AppDynamics (or at least, easily accessible fact-based information about application behaviour in production) is a necessary prerequisite to this.

Looking back, before DevOps or even MTTI were topical buzzwords, this basic ability to foster communication between teams proved to be an invaluable benefit to the more drab and well-worn business realities of offshoring and outsourcing.

This blog reviews 3 real-life examples of this:

  • Managing external offshore development organisations
  • Facilitating near shore development teams
  • Bringing external developments in-house

Managing external offshore development organisations

Some months ago, I did some work with a then prospect who had started a self-service trial of AppDynamics.

When I spoke to them, they were delighted with the visibility that AppDynamics provided out of the box for their .NET application, a SaaS Learning Management System.

Digging into what had sparked their interest in AppDynamics, they told me they had commissioned an outside development firm to rewrite their flagship application, which was somewhat dated and not architecturally fit to support some newer services the business wanted to offer to customers.

The good news was the new version of the app was live, and supporting around 10% of their existing customer base. The bad news?  This 10% used the same hardware footprint as the remaining 90% on the old system. Extrapolating this hardware requirement for the entire user base would not only require a new datacentre, but also entirely break the business model for the application from an operational cost perspective (not the first time that hardware savings alone could pay for AppDynamics!)

For months prior to trying AppDynamics, the external developers had been under huge pressure to optimize the application footprint (and some pretty lacklustre performance too). Armed with only windows performance counters and intuition, weeks had been spent optimizing slow database queries, which only turned out to be 5% of the errant response times at a transaction level.

Having put AppDynamics in place, the prospect

  • Easily found specific application bottlenecks, allowing them to focus developers on high-impact remediation
  • Could verify the developers had made the required improvements with each new release

Clearly, huge benefits at a technical level.

At a higher level, this helped lead to a more constructive relationship between the development shop and their customer – moving things away from the edge of litigation, constant finger-pointing, and blame shifting.

Facilitating near shore development teams

Another group I have worked with recently are responsible for a settlement system within a large global investment bank based in London. The system is developed in-house, and typical with most financial services institutions, the actual development team itself is located ‘near-shore’ in Eastern Europe to cut costs. The development processes are Agile, with new releases every few weeks.

Inevitably, with new releases can come new production issues and – of course – the best people to deal with these during the “bedding in” period are the developers themselves.

Another thing that is very common in the financial services industry is regulation, and this poses a problem in this scenario. Nobody is permitted to directly access the production systems from outside the UK due to data privacy regulations.

This means hands-on troubleshooting must be left to the on-shore architect staff who are not only expensive, but are not as well-equipped as the developers themselves to dig in to the issues in new code.

Enter AppDynamics. Our agents deployed in production made all the performance data readily available to anyone with the appropriate credentials, but – critically – having access to this does not expose ANY business data from the production system. Now, the near-shore development team can look directly at the non-functional behaviour of their code in production, eliminating the time spent gathering sufficient log data to enable reproduction of issues in test environments.  Bingo, the business case for the AppDynamics purchase is made!

There is an interesting side note to this, which applies much more widely too. Many customers have observed an “organic” improvement in service levels once AppDynamics is installed in production. For the first time, developers can see how their code is actually working in the wild. Developer pride kicks in and suddenly non-functional stories are added to development backlogs to fix latent issues that get observed, which would have previously have gone unnoticed.

Bringing external developments in-house

Of course, as we all know the only constant in life is change, so no outsource is a one-way journey. As a result, I have come across several organisations that are now working on projects which were previously outsourced. Of course, once these customers have completed the initial challenge of recruiting a new development team they then need to get their arms around the existing codebase. Usually handover workshops can help with this, but in many cases these systems have been out- and in- sourced several times, with many changes of personnel along the way. There is only so much you can distill onto a whiteboard in a brain dump session, however long and well-intentioned.

It is here where the high-level visibility that AppDynamics provides can be invaluable. Out of the box, AppDynamics instruments previously unseen systems, automatically detecting and following transactions and draws up flow-maps. The end-to-end visibility of the entire system greatly eases the process. In fact, this system overview (and the ability to view how it changes over time) has proved invaluable for many customers for a number of reasons beyond whole-scale in (or out) sourcing, such as onboarding new team members, verifying compliance with architectural governance of externally developed code changes and so forth.

Conclusion

In summary, AppDynamics does not have to be all about troubleshooting and MTTI.  Nor even necessarily about DevOps and brave new worlds. The easily configured deep insight that we provide into the dynamic behaviour of your applications has many uses – and business cases – beyond the traditional MTTI/MTTR domain.  APM is, after all, just one use-case (albeit an important one) for our Application Intelligence Platform.

Take five minutes to get complete visibility and control into the performance of your production applications with AppDynamics Pro today.

The APPrentice

Screen Shot 2013-05-28 at 3.04.27 PMIn this week’s episode, Donald Trump enlists Team ROI and Team Overhead to solve a Severity1 incident on the “Trump Towers Website”. Team Overhead used “Dynoscope” and took 3 weeks to solve the incident, while Team ROI took 15 minutes by using AppDynamics.

 

Data Stinks, Information Rocks!

I’ve got many years of performance geekery under my belt and I’ve learned many lessons during that time. One of the most important lessons is paying close attention to the distinction between data and information. Let’s take a look at how the dictionary defines each term:

Data – facts and statistics collected together for reference or analysis.
Information – facts provided or learned about something or someone.

What do these definitions reveal to us about data and information and how does it apply to monitoring tools? Let’s explore that together. I’ll provide specific examples along the way to illustrate my points.

The Problems with Data

Data is fundamental to problem solving, but I don’t want to have to dig through a bunch of data while my business critical, mission critical, revenue generating, etc… applications are down. To me, data is just like this picture…

SteamingPile

Covis Software GmbH Speeds up CRM Application by 33x with AppDynamics

I received an X-Ray from Covis Software GmbH in Germany who provides CRM solutions. They’ve been managing their .NET application performance with AppDynamics Pro for several months now in production. In the below X-Ray (as documented by the customer), Covis were able to improve the performance of a mission-critical business transaction from over 10 seconds to around 300 milliseconds, representing a 33x improvement. The business impact of such change meant average call time for some CRM agents dropped by almost 10 seconds.

If you would like to get started with AppDynamics you can download AppDynamics Lite (our free version) or you can take a free 30-day trial of AppDynamics Pro.

App Man.

4 Lessons that Operations Can Learn From Batman

In honor of the summer blockbuster The Dark Knight Rises, let’s take a look at one of the most celebrated heroes in history—Batman. (Okay, granted, he’s not real. But that doesn’t mean he can’t help us understand some best practices regarding application uptime and availability, right? Look, just bear with me here.)

1. Always have the right tools.

Batman knows his limitations: he doesn’t have superpowers (unlike my esteemed colleague, App Man). He’s just an ordinary guy with a weird-looking suit and five tons of cash. So what does he do? He buys batarangs, batcopters, bat anti-shark repellant, etc. He knows that in order to give himself an edge against the forces of darkness, he needs serious gadgets.

Similarly, Application Operations teams need to recognize that they shouldn’t wade into a production outage or firefight without some kind of application management solution. It’s amazing how many times we talk to teams that are simply using log files to manage a production application. Log files are not effective tools in a true firefight, and they’re no way to get to root cause quickly when your end users experience poor performance. Get an APM solution that equips you with 10x visibility and code-level detail in production, at less than 2% overhead. Otherwise, your office may start to feel like the rough & tumble streets of Gotham.

2. The bad guys always come back.

How many times has Batman stuck the Joker in Arkham Asylum, only see him escape and start inventing new death traps? But the Dark Knight doesn’t grumble—well, actually, he does, but let’s not worry about that right now—he just suits up and goes back on patrol.

Similarly, poor performing business transactions, memory leaks, and slow SQL calls are never going to disappear entirely, no matter how many iterations your developers perform. Agile release cycles and complex code mean that no environment will be completely free of evil. It’s important to have the right web application monitoring solution to put your rogues gallery on ice, time and time again.

3. War Room sessions are a waste of time.

“It’s the network.” “No, it’s the database.” “No, it’s the application.” How much time does your team spend pointing fingers and establishing innocence? And in the meantime, how many more issues are bound to terrorize your end users?

Batman doesn’t sit in a fire circle and hold hands with his colleagues. He doesn’t parley with the police and he doesn’t make long speeches. He gets the job done. The right application performance management tool can do the same for your own attempts at application heroism: less on the fingerpointing, more on the Mean-Time-to-Resolution.

4. Never stop until you find root cause.

One of the things that people forget about Batman is that he’s not just a superhero: he’s also the World’s Greatest Detective. He uses his brains as often as his fists. And when it comes to a mystery, he’s not content to just a get a vague sense of the story, or poke around the edges of a crime scene. Rather, he wants to find the killer and the smoking gun.

In the world of application performance, that means getting to code-level detail in a production environment. It does not mean using a tool to monitor infrastructure or a network monitor that skims the surface of the application layer.  Those tools can get close, but “close” only works in horseshoes and hand grenades. What’s necessary is to find the true culprit, the method and class that’s causing a production outage or performance issue. And if you have the right tool for the job, you won’t be patrolling all night in order to do it. Rather, you can solve the mystery in literally minutes.

One lesson you might not want to take from the Caped Crusader is to cough up a huge P.O. for your APM solution. Not everyone can be a bazillionaire like Bruce Wayne. And it’s possible to arm yourself for battle without making your IT budget as big and bloated as the Penguin.

Don’t believe me? Check out the pricing for AppDynamics Pro for yourself—then send up a signal to take our 30-Day Trial. When you get a taste of what we can do, you may find yourself swinging from the rooftops.