How to Identify Impactful Business Transactions in AppDynamics

New users of APM software often believe their company has hundreds of critical business transactions that must be monitored. But that’s not the case. In my role as Professional Services Consultant (EMEA) at AppDynamics, I’ve worked at dozens of customer sites, and the question of “What to monitor?” is always foremost in new users’ minds.

AppDynamics’ Business Transactions (BTs) reflect the core value of your applications. Since our inception a decade ago, we’ve built our APM solution around this concept. Given the critical importance of Business Transactions, you’ll want to configure them the right way. While AppDynamics will automatically create BTs for you, you’ll benefit greatly by taking a few extra steps to optimize your monitoring environment.

APM users often think of a BT as a technical transaction in their system, but it’s much more than that. The BT is a key component of effective application monitoring. It consists of all required  services within your environment—things like login, search and checkout—that are utilized to fulfill and respond to a user-initiated request. These transactions reflect the logical way users interact with your applications. Activities such as adding an item to a shopping cart or checking out will summon various applications, databases, third-party APIs and web services.

If you’re new to APM, you may find yourself asking “Where should I begin?” By applying essential best practices, BT configuration can be a smooth and orderly process.

Start by asking yourself two key questions:

  1. What are my business goals for monitoring?
  2. What pain points am I trying to address by using APM?

You may already know the answers. Perhaps you want to resolve major problems that consume a lot of your time and resources, or insure that your most critical business operations are performing optimally. From there, you can drill down to more specific goals and operations to focus on. A retail website, for instance, may choose to focus on its checkout or catalog operation. Financial services firms may focus on the most-used APIs provided for their mobile clients. By prioritizing your business goals early in the process, you’ll find BTs much easier to configure.

AppDynamics automatically discovers and maps Business Transactions for you. Actions like Add to Cart are tagged and traced across every component of your application and visualized on a topology map, helping you to better understand performance across an entire application.

It’s tempting to think configuration is complete once you’ve instrumented with an agent and start seeing traffic coming in. But that’s just the technical side of things. You’ll also need to align with the business, asking questions like, “Do we have SLAs on this?” and “What’s the performance requirement?” You’ll also need to establish health rules and work with the business to determine, for instance, what action to take if a particular rule is violated.

Choose Your BTs Wisely

At a high level, a Business Transaction is more like a use case, even though users often think of it as a technical transaction. Sometimes I must remind users: “No, this activity you want to monitor is not a business transaction. It’s just a technical functionality of the system, but it’s not being used by a customer or an API user.” These cross-cutting metrics may be better served by monitoring through views like Service Endpoints or specific technical metrics.

Be very selective when choosing your Business Transactions. Here’s a rule of thumb: Configure up to 20 to 30 BTs per business application. This may not seem like a lot, but really it is. One of AppDynamics’ largest banking customers identified that 90% of its business activity was reflected in just 25 or so business transactions.

It’s not uncommon for new users to balk at this. They may say, “But we have many more important processes to track!” Fear not: the recommended number of BTs isn’t set in stone, although our 20-to-30 guideline is a good starting point. You may have 20 key Business Transactions and another 20 that are less critical, but you really want to monitor all 40. You can do this, of course, but you’ll need to prioritize these transactions. Capturing too many BTs can lead users to miss the transactions that are truly important to the business.

Best Practices

During APM setup, you’ll have many questions. Should you work exclusively with your own technical team? With the application owner? The business that’s using the application?

Start with these three key steps:

  1. Get to know your business.
  2. Identify the major flows.
  3. Talk to the application owner.

 

Whenever I’m onsite with a customer, the first thing I advise is that we login as an end user to see how they use the system. For example, we’ll order a product or renew a subscription, and then track these transactions end-to-end through the system. This very important step will help you identify the transactions you want to monitor.

It’s also critical to check the current major incidents you have, or at least the P1s and P2s. Find out what problems you’re experiencing right now. What are the major complaints involving the application?

Focus on the the low-hanging fruit—your most troublesome applications—which you’ll find by instrumenting systems and talking to applications owners. This will deliver value in the early setup stage, providing information you can take to the business to make them more receptive to working with you.

Prioritize Your Operations

Business Transactions are key to configuring APM. Before starting configuration, ask yourself these critical questions:

  1. What are my business goals for monitoring?
  2. What pain points am I trying to solve with AppDynamics?
  3. What are the typical problems that take up my time and resources?
  4. What are the most critical business operations that need to perform optimally?

 

Then take a closer look at your application. Decide which operations you must focus on to achieve your goals.

These key steps will help you prioritize operations and make it easier to configure them as Business Transactions. Go here to learn more!

Accelerate Your Digital Business with AppDynamics Winter ‘17 Product Release

Last month at AppD Summit New York, we unveiled the latest innovations in our Business iQ and App iQ platforms, paving the way for a new era of the CIO and digital business. Delivering on this vision, we’re excited to announce the general availability of AppDynamics’ Winter ‘17 Release for our customers.

As application and business success become indistinguishable, enterprises are increasing their investment in digital initiatives. According to Gartner, 71% of enterprises are actively implementing digital strategies, and IDC predicts that companies will spend $1.2 trillion on their digital transformation in 2017 alone.

But without effective tools to correlate application and business performance – and lack of end-to-end visibility across customer touchpoints, application code, infrastructure, and network – customer experiences and employee productivity are degraded, and executives can’t analyze or justify technology investments. In fact, according to McKinsey, the digital promise still seems more of a hope than a reality, with only 12% of technology and C-level executives confident that IT organizations have been effective in this shift.

Winter ‘17 Release is Here

Business iQ just got better. Bridging the gap between the app and the business, BiQ capabilities have expanded to include:

Business Journeys

With AppDynamics Business Journeys, application teams can link multiple, distributed business events into a single process that reflects the way customers interact with the business. Business events can include transaction, log, mobile, browser, synthetics, or custom events and are long-running, from hours to days.

Application teams can create performance thresholds and quickly visualize where performance issues are impacting the customer experience. KPIs for each Business Journey inform technology investments and effectively prioritize code development and release.

In the two figures below, you can see how easy it is to set up a new Business Journey for loan approvals and visualize the impact of delays through the lens of the business.

Business_Journey_Ani_720x.gif

Fig 1: Author an end-to-end Business Journey by joining multiple distributed events.

Screen Shot 2017-10-31 at 11.27.16 AM.png

Fig 2: Quickly and easily create custom dashboards visualizing business performance.

Experience Level Management (XLM)

With XLM, enterprises can establish custom service-level thresholds by customer segment, location, or device. For example, the CIO of a major retailer may deliver tailored experiences to its top customers by setting performance thresholds across its customer channels — including website, mobile apps, in-store wireless, and in-store checkout. XLM also provides an immutable audit for service-level agreements with your customers or internal business units. The product images below show the service levels setup for a connected streaming device, giving an instant view on how services are performing against set SLAs.

Screen Shot 2017-11-01 at 10.45.06 AM.png

Fig 3: Service levels setup for a connected streaming device.

Network Visibility

Application developers, IT Ops  and network teams often work in silos using a myriad of different monitoring tools. To troubleshoot application performance issues, war rooms are created, and the lack of a common language and visibility across different tools results in finger pointing, endless debates, and slower Mean Time to Resolution (MTTR).

With the introduction of AppDynamics Network Visibility, a capability AppDynamics is uniquely positioned to address now as part of Cisco, enterprises will be able to understand the impact that the network is having on application and business performance. Network performance measurements are automatically correlated with application performance in the context of the Business Transaction. IT teams will be able to triage network issues with one single pane of glass and provide the right information to network teams before there is an impact on the end-user experience. Finally, an answer to end-to-end visibility from customer, to code, to network is here.

AppDynamics automatically discovers network devices such as reverse proxy load balancers deployed on-premises and in cloud environments and eliminates the need to use expensive network tools such as SPAN/TAP to capture and analyze network traffic.

The animation below shows out-of-box visibility into network flow maps, network metrics such as latency, throughput, retransmission rates, and critical errors, enabling IT Ops to quickly identify and isolate root cause without the need to engage network teams.

Network_Viz_Ani_720x.gif

Fig 4: Correlated and out-of-box view of network performance in context of application performance.

AppDynamics IoT

IoT devices create another channel to engage with customers, and if properly measured and optimized, can create game-changing business benefits. With new IoT visibility, businesses can convert rich and invaluable insights into consumer behavior, buying patterns, and business impacts. IoT visibility includes:

Device analytics  — Together with Business iQ, IoT visibility provides an unprecedented insight into how IoT devices are driving business impact. And because these insights are delivered through a single platform, IoT visibility is the first and only solution that maps and correlates entire customer journeys — from the device to customer touchpoint, to business conversions.

Device application visibility and troubleshooting — AppDynamics’ new IoT visibility provides an aggregated view into device uptime, version status, and performance, enabling drill-down views into the device to simplify the troubleshooting of IoT applications. The screenshot below shows a list view of all active devices. A simple double-click on a specific device takes you to the device details.

Custom dashboards — Every company measures success differently. With custom dashboards in IoT visibility, companies from any vertical can quickly build new visualizations to measure the business impact of IoT devices — from the revenue impact of a slow checkout for a brick and mortar retailer, to the customer impact of a software change in a connected car.

All_active_devices.png

Fig 5: Consolidated list view of all active smart-shelf  IoT devices and key KPIs.

Synthetic Private Agent

AppDynamics Winter ‘17 Release brings Browser Synthetic Monitoring to your internal network. By running Synthetic Private Agent on-premises, you can monitor the availability and performance of internal websites and services that aren’t accessible from the public Internet. You can also test specific locations within your company and set alerts when performance issues occur and fix them before end-user experience is impacted.

Cross-Controller Federation

As application teams start using microservices architecture, the scalability requirements have exploded, necessitating APM scale. With Cross-Controller Federation, AppDynamics is taking unified monitoring to the next level. Our customers can achieve limitless scalability and flexibility to deploy application components across multiple public and private clouds.

Only with AppDynamics, customers get complete correlated visibility and quick drill-down into the line of code, irrespective of where the application components and controllers are deployed, because controllers can participate in a federation. Another important use case is keeping APM data isolated by deploying multiple controllers yet maintaining correlated visibility for compliance, architecture, and business reasons.

KPI Analyzer

KPI Analyzer applies machine learning to automate root cause analysis. With the KPI analyzer, customers can isolate the metrics that are the most likely contributors to poor performance, and identify the likely degree of impact on the KPI for each metric, automatically. The KPI analyzer makes troubleshooting root cause as simple as clicking a prompt to surface the underlying issue most likely to be the root cause of degraded performance.

The following figure shows KPI Analyzer in action. KPIs such as average response time are displayed with metrics that are automatically identified as the root cause and scored in ranked order for quick resolution.

KPIAnalyzer.png

Fig 6: Key application KPIs and automatically-detected root causes in ranked order.

Learn More

AppDynamics’ Winter ‘17 Release is rich with other important features such as Universal Agent to simplify agent installation and configuration, Enterprise Console for streamlined controller lifecycle management, and Node.js flame graph for deeper visibility, among several other features.

Join us for a webinar on November 16th to get an in-depth look into the latest innovations and features in our Winter ‘17 Release. You can also get started with the free trial of AppDynamics Winter ‘17 today!

Introducing Real-Time Business Metrics

I’ve been looking forward to writing this blog for some time.  I have worked with many enterprise customers to document the pain they solved using AppDynamics and a common question I always ask is “What was the actual business impact of that slowdown or outage?”, the result is that most customers guesstimate the revenue impact of slow performance, and are generally nervous about calculating such number.

They’re nervous because they might expose to the business how much revenue they are costing them each year thru incidents and outages. That’s definitely one way to look at things. However, if you flip this problem around IT could actually show the business how much revenue it created as a result of agile releases or initiatives such as SOA, Cloud and Virtualization.

Imagine if a new application feature suddenly caused a 5% increase in revenue? Wouldn’t it be cool for IT to share this fact with the business? With AppDynamics new real-time business metrics IT can do just that. Here’s how it works…

1. Monitoring Business Transactions

A business transaction is a type of user request in your application. AppDynamics can auto-discover these and monitor the response time of such requests, this allows IT to see the real end user experience and detect problems instantly as they happen. For example, below is a Checkout transaction from one of our customers that was requested 4,639 times, it had 53 errors and over 700 were classified as slow from their normal performance baseline.

1

2. Extracting Revenue Metrics from Business Transactions

Once you starting discovering and monitoring the performance of business transactions, the next step is to define which key business data you want to extract and report on. In AppDynamics you can define “Information Points” which are essentially custom metrics from extracting method parameters in application code. For example, in the below screenshot I created an information point called “Checkout Revenue” and specified the application code where AppDynamics should extract the revenue values, in this example it was the method signature:

com.online.business.action.CheckOut.confirm() 

I then created a custom metric called Checkout Revenue based on a SUM operation on the getter chain:

getShoppingCart().getAdjustedTotal().getAmount().intValue()

AppDynamics will now extract all the checkout revenue values from every transaction and make this available as a new metric “Checkout Revenue” which can be reported in real-time just like any other AppDynamics metric.

Business Metrics Wizard

3. Correlating Application Response Times with Application Revenue:

Now that AppDynamics is monitoring the performance and revenue of your business transactions, its possible to correlate and report these metrics over-time so IT can understand their relationship. Take the below example, which shows the revenue per minute vs. the response time per minute of the application. As the screenshot shows, its pretty clear what the real business impact of this slowdown was to the business. Now imagine the reverse,  imagine if the application got faster and that have a positive impact on revenue and transaction throughput? Wouldn’t it be great to track this information over-time so you can see the real impact of agile release cycles?

Correlating revenue and performance

4. Creating Real-time Business Dashboards

Today nearly every monitoring dashboard is about application response times, or the health and resource of infrastructure. So when something glows red or flashes on a dashboard it denotes something very bad is happening. The reality is that most dashboards glow red everyday when performance and resource spikes. When is a problem really a problem? With real-time business metrics you can now mash and fuse business KPI’s with your application and infrastructure metrics. So when something turns red you can see the revenue impact of such issue.

real-time business metrics dashboard

5. Being Pro-Active with Business Alerts

Looking at monitoring dashboards periodically (like the above) is the first step to being pro-active with business impact. However, if you want to be truly pro-active you need to automate this entire process and let your monitoring solution do the alerting for you. The great thing with AppDynamics is that it can self-learn the normal value of every metric it collects, and create a dynamic baseline (threshold) over-time. This allows it to accurately detect deviations caused by abnormal activity. So just like we can detect deviations in application performance we can now do the same for your application revenue or order throughput. For example, one of our customers Orbitz said:

“If we’ve sold less than $1,000 in five minutes, there is probably a problem, even if it’s 2 o’clock in the morning. If our sales have flatlined, that’s a critical problem. I don’t know how to be any clearer.”

Geoff Kramer, Manager of Quality Engineering at Orbitz Worldwide

The ability to alert on business impact vs. application or infrastructure performance can be a game changer. It helps IT truly align with the priorities and needs of the business, allowing them to speak the same language and manage the bottom line.

You can get started today with real-time business metrics by signing up and taking a free trial of AppDynamics Pro here.

 

 

 

Turnkey APMaaS by AppDynamics

Since we launched our Managed Service Provider program late last year, we’ve signed up many MSPs that were interested in adding Application Performance Management-as-a-Service (APMaaS) to their service catalogs.  Wouldn’t you be excited to add a service that’s easy to manage but more importantly easy to sell to your existing customer base?

Service providers like Scicom definitely were (check out the case study), because they are being held responsible for the performance of their customer’s complex, distributed applications, but oftentimes don’t have visibility inside the actual application.  That’s like being asked to officiate an NFL game with your eyes closed.

ref

The sad truth is that many MSPs still think that high visibility in app environments equates to high configuration, high cost, and high overhead.

Thankfully this is 2013.  People send emails instead of snail mail, play Call of Duty instead of Pac-Man, listen to Pandora instead of cassettes, and can have high visibility in app environments with low configuration, low cost, and low overhead with AppDynamics.

Not only do we have a great APM service to help MSPs increase their Monthly Recurring Revenue (MRR), we make it extremely easy for them to deploy this service in their own environments, which, to be candid, is half the battle.  MSPs can’t spend countless hours deploying a new service.  It takes focus and attention away from their core business, which in turn could endanger the SLAs they have with their customers.  Plus, it’s just really annoying.

Introducing: APMaaS in a Box

Here at AppDynamics, we take pride in delivering value quickly.  Most of our customers go from nothing to full-fledged production performance monitoring across their entire environment in a matter of hours in both on-premise and SaaS deployments.  MSPs are now leveraging that same rapid SaaS deployment model in their own environments with something that we like to call ‘APMaaS in a Box’.

At a high level, APMaaS in a Box is large cardboard box with air holes and a fragile sticker wherein we pack a support engineer, a few management servers, an instruction manual, and a return label…just kidding…sorry, couldn’t resist.

man in box w sticker

Simply put, APMaaS in a Box is a set of files and scripts that allows MSPs to provision multi-tenant controllers in their own data center or private cloud and provision AppDynamics licenses for customers themselves…basically it’s the ultimate turnkey APMaaS.

By utilizing AppDynamics’ APMaaS in a Box, MSPs across the world are leveraging our quick deployment, self-service license provisioning, and flexibility in the way we do business to differentiate themselves and gain net new revenue.

Quick Deployment

Within 6 hours, MSPs like NTT Europe who use our APMaaS in a Box capabilities will have all the pieces they need in place to start monitoring the performance of their customer’s apps.  Now that’s some rapid time to value!

Self-Service License Provisioning

MSPs can provision licenses directly through the AppDynamics partner portal.  This gives you complete control over who gets licenses and makes it very easy to manage this process across your customer base.

Flexibility

A MSP can get started on a month-to-month basis with no commitment.  Only paying for what you sell eliminates the cost of shelfware.  MSPs can also sell AppDynamics however they would like to position it and can float licenses across customers.  NTT Europe uses a 3-tier service offering so customers can pick and choose the APM services they’d like to pay for.  Feel free to get creative when packaging this service for customers!

Conclusion

As more and more MSPs move up the stack from infrastructure management to monitoring the performance of their customer’s distributed applications, choosing an APM partner that understands the Managed Services business is of utmost importance.  AppDynamics’ APMaaS in a box capabilities align well with internal MSP infrastructures, and our pricing model aligns with the business needs of Managed Service Providers – we’re a perfect fit.

MSPs who continue to evolve their service offerings to keep pace with customer demands will be well positioned to reap the benefits and future revenue that comes along with staying ahead of the market.  To paraphrase The Great One, MSPs need to “skate where the puck is going to be, not where it has been.”  I encourage all you MSPs out there to contact us today to see how we can help you skate ahead of the curve and take advantage of the growing APM market with our easy to use, easy to deploy APMaaS in a Box.  If you don’t, your competition will…

How Fast are your Web Services?

Everyday in our life we rely on services provided by other people. Making a phone call, getting a car fixed, or ordering a pizza – and yet we want those things to happen as quickly as possible, because time often means money. If you take your car to a Mercedes or BMW dealer, you will understand this point better than anyone. Our productivity (and often happiness) is therefore controlled, everyday, by different organizations and people. When things slow down or don’t happen we get upset, frustrated, and sometimes rant on twitter like these folk:

If your application today has SOA design principles, is heavily distributed and relies on 3rd party service providers, then you’ve probably become frustrated at some point when your application slows down or crashes. The problem is this: your end user experience and quality of service (QoS) is only as good as the QoS of your service providers. So, unless you monitor QoS you can’t measure QoS–and if you can’t measure QoS, you can’t manage your service providers and your end user experience. For example, take a look at this customer e-commerce application which has 7 JVM’s, 1 database and 7 external web service providers:

This customer recently had a slowdown with their e-commerce production application. After a few minutes browsing AppDynamics, they successfully identified that one of their web service providers was having latency issues (AppDynamics automatically baselines performance and flags deviations for each web service provider as shown in the above screenshot). The customer called their service provider, and sure enough the service provider admitted to having issues. A few hours later the service provider called back and said “we fixed the problem, everything should be back to normal”–yet the customer could clearly see latency issues still occurring in AppDynamics. So they sent their service provider a screenshot showing the evidence. The service provider then checked again, and called back a few minutes later saying “Yes, sorry a few customers are still being impacted.” Without this level of visibility, many organizations are simply blind to how external service providers impact their end user experience and business.

Being able to troubleshoot slow performance in minutes is helpful, but what about being able to report the exact service level you receive–say, from each of your service providers over a period of time? Did your service improve over time or did it regress? How many outages or severity 1 incidents did your service providers cause this week for your application?

Take the below screenshot from AppDynamics, which plots the maximum response time for five different web services consumed by an application over the last week. You can see that three out of the five web services (denoted by pink, blue and turquoise lines) consistently deliver sub-second response times and provide a great service level. However, the other two web services (red and green lines) show performance spikes with response times of between 14 and 22 seconds. The green web service in particular is very inconsistent and shows several performance spikes in two days.

Below is the response time of another web service (PayPal) for a customer application over the last 3 months. Notice the spikes in response time and look at the deviation between average and maximum response time over the time period. What’s impressive is that despite the occasional service blip, the PayPal service has slowly improved by 14% from 450 milliseconds to around 385 milliseconds. It’s also been very stable the last few weeks, along with having a consistent service (small deviation from average and maximum response time).

If your application relies on one or more 3rd party web services, you should periodically check and report what level of service you are receiving each week. That way, you can truly understand your service provider QoS and its impact on your end user experience and application performance. You can also keep your service providers honest, with complete visibility of whether QoS is improving or degrading over time as service outages occur and are fixed.

The next time you experience a slow down or outage in your application, you should first check external web services before you start to troubleshoot your own. The last thing you want to be doing is debugging your own code, when it could be someone else’s service and code that is causing the issue. Using AppDynamics it’s possible to monitor, measure, and manage the QoS from each of your web service providers. You can get started right now by downloading AppDynamics Lite (our free edition) for a single JVM or IIS web server, or you can request a 30-day trial of AppDynamics Pro (our commercial edition) for Java or .NET applications with multiple JVMs and IIS web servers.

UX – Monitor the Application or the Network?

Last week I flew into Las Vegas for #Interop fully suited and booted in my big blue costume (no joke). I’d been invited to speak in a vendor debate on User eXperience (UX): Monitor the Application or the Network? NetScout represented the Network, AppDynamics (and me) represented the Application, and “Compuware dynaTrace Gomez” sat on the fence representing both. Moderating was Jim Frey from EMA, who did a great job introducing the subject, asking the questions and keeping the debate flowing.

At the start each vendor gave their usual intro and company pitch, followed by their own definition on what User Experience is.

Defining User Experience

So at this point you’d probably expect me to blabber on about how application code and agents are critical for monitoring the UX? Wrong. For me, users experience “Business Transactions”–they don’t experience applications, infrastructure, or networks. When a user complains, they normally say something like “I can’t Login” or “My checkout timed out.” I can honestly say I’ve never heard them say –  “The CPU utilization on your machine is too high” or “I don’t think you have enough memory allocated.”

Now think about that from a monitoring perspective. Do most organizations today monitor business transactions? Or do they monitor application infrastructure and networks? The truth is the latter, normally with several toolsets. So the question “Monitor the Application or the Network?” is really the wrong question for me. Unless you monitor business transactions, you are never going to understand what your end users actually experience.

Monitoring Business Transactions

So how do you monitor business transactions? The reality is that both Application and Network monitoring tools are capable, but most solutions have been designed not to–just so they provide a more technical view for application developers and network engineers. This is wrong, very wrong and a primary reason why IT never sees what the end user sees or complains about. Today, SOA means applications are more complex and distributed, meaning a single business transaction could traverse multiple applications that potentially share services and infrastructure. If your monitoring solution doesn’t have business transaction context, you’re basically blind to how application infrastructure is impacting your UX.

The debate then switched to how monitoring the UX differs from an application and network perspective. Simply put, application monitoring relies on agents, while network monitoring relies on sniffing network traffic passively. My point here was that you can either monitor user experience with the network or you can manage it with the application. For example, with network monitoring you only see business transactions and the application infrastructure, because you’re monitoring at the network layer. In contrast, with application monitoring you see business transactions, application infrastructure, and the application logic (hence why it’s called application monitoring).

Monitor or Manage the UX?

Both application and network monitoring can identify and isolate UX degradation, because they see how a business transaction executes across the application infrastructure. However, you can only manage UX if you can understand what’s causing the degradation. To do this you need deep visibility into the application run-time and logic (code). Operations telling a Development team that their JVM is responsible for a user experience issue is a bit like Fedex telling a customer their package is lost somewhere in Alaska. Identifying and Isolating pain is useful, but one could argue it’s pointless without being able to manage and resolve the pain (through finding the root cause).

Netscout made the point that with network monitoring you can identify common bottlenecks in the network that are responsible for degrading the UX. I have no doubt you could, but if you look at the most common reason for UX issues, it’s related to change–and if you look at what changes the most, it’s application logic. Why? Because Development and Operations teams want to be agile, so their applications and business remains competitive in the marketplace. Agile release cycles means application logic (code) constantly changes. It’s therefore not unusual for an application to change several times a week, and that’s before you count hotfixes and patches. So if applications change more than the network, then one could argue it’s more effective for monitoring and managing the end user experience.

UX and Web Applications

We then debated which monitoring concept was better for web-based applications. Obviously, network monitoring is able to monitor the UX by sniffing HTTP packets passively, so it’s possible to get granular visibility on QoS in the network and application. However, the recent adoption of Web 2.0 technologies (ajax, GWT, Dojo) means application logic is now moving from the application server to the users browser. This means browser processing time becomes a critical part of the UX. Unfortunately, Network monitoring solutions can’t monitor browser processing latency (because they monitor the network), unlike application monitoring solutions that can use techniques like client-side instrumentation or web-page injection to obtain browser latency for the UX.

The C Word

We then got to the Cloud and which made more sense for monitoring UX. Well, network monitoring solutions are normally hardware appliances which plug direct into a network tap or span port. I’ve never asked, but I’d imagine the guys in Seattle (Amazon) and Redmond (Windows Azure) probably wouldn’t let you wheel a network monitoring appliance into their data-centre. More importantly, why would you need to if you’re already paying someone else to manage your infrastructure and network for you? Moving to the Cloud is about agility, and letting someone else deal with the hardware and pipes so you can focus on making your application and business competitive. It’s actually very easy for application monitoring solutions to monitor UX in the cloud. Agents can piggy back with application code libraries when they’re deployed to the cloud, or cloud providers can embed and provision vendor agents as part of their server builds and provisioning process.

What’s interesting also is that Cloud is highlighting a trend towards DevOps (or NoOps for a few organizations) where Operations become more focused on applications vs infrastructure. As the network and infrastructure becomes abstracted in the Public Cloud, then the focus naturally shifts to the application and deployment of code. For private clouds you’ll still have network Ops and Engineering teams that build and support the Cloud platform, but they wouldn’t be the people who care about user experience. Those people would be the Line of Business or application owners which the UX impacts.

In reality most organizations today already monitor the application infrastructure and network. However, if you want to start monitoring the true UX, you should monitor what your users experience, and that is business transactions. If you can’t see your users’ business transactions, you can’t manage their experience.

What are your thoughts on this?

AppDynamics is an application monitoring solution that helps you monitor business transactions and manage the true user experience. To get started sign-up for a 30-day free trial here.

I did have an hour spare at #Interop after my debate to meet and greet our competitors, before flying back to AppDynamics HQ. It was nice to see many of them meet and greet the APM Caped Crusader.

App Man.

Finding the Root Cause of Application Performance Issues in Production

The most enjoyable part of my job at AppDynamics is to witness and evangelize customer success. What’s slightly strange is that for this to happen, an application has to slow down or crash.

It’s a bittersweet feeling when End Users, Operations, Developers and many Businesses suffer application performance pain. Outages cost the business money, but sometimes they cost people their jobs–which is truly unfortunate. However, when people solve performance issues, they become overnight heroes with a great sense of achievement, pride, and obviously relief.

To explain the complexity of managing application performance, imagine your application is 100 haystacks that represent tiers, and somewhere a needle is hurting your end user experience. It’s your job to find the needle as quickly as possible! The problem is, each haystack has over half a million pieces of hay, and they each represent lines of code in your application. It’s therefore no surprise that organizations can take days or weeks to find the root cause of performance issues in large, complex, distributed production environments.

End User Experience Monitoring, Application Mapping and Transaction profiling will help you identify unhappy users, slow business transactions, and problematic haystacks (tiers) in your application, but they won’t find needles. To do this, you’ll need x-ray visibility inside haystacks to see which pieces of hay (lines of code) are holding the needle (root cause) that is hurting your end users. This X-Ray visibility is known as “Deep Diagnostics” in application monitoring terms, and it represents the difference between isolating performance issues and resolving them.

For example, AppDynamics has great End User Monitoring, Business Transaction Monitoring, Application Flow Maps and very cool analytics all integrated into a single product. They all look and sound great (honestly they do), but they only identify and isolate performance issues to an application tier. This is largely what Business Transaction Management (BTM) and Network Performance Management (NPM) solutions do today. They’ll tell you what and where a business transaction slows down, but they won’t tell you the root cause so you can resolve the issues.

Why Deep Diagnostics for Production Monitoring Matters

A key reason why AppDynamics has become very successful in just a few years is because our Deep Diagnostics, behavioral learning, and analytics technology is 18 months ahead of the nearest vendor. A bold claim? Perhaps, but it’s backed up by bold customer case studies such as Edmunds.com and Karavel, who compared us against some of the top vendors in the application performance management (APM) market in 2011. Yes, End User Monitoring, Application Mapping and Transaction Profiling are important–but these capabilities will only help you isolate performance pain, not resolve it.

AppDynamics has the ability to instantly show the complete code execution and timing of slow user requests or business transactions for any Java or .NET application, in production, with incredibly small overhead and no configuration. We basically give customers a metal detector and X-Ray vision to help them find needles in haystacks. Locating the exact line of code responsible for a performance issue means Operations and Developers solve business pain faster, and this is a key reason why AppDynamics technology is disrupting the market.

Below is a small collection of needles that customers found using AppDynamics in production. The simple fact is that complete code visibility allows customers to troubleshoot in minutes as opposed to days and weeks. Monitoring with blind spots and configuring instrumentation are a thing of the past with AppDynamics.

Needle #1 – Slow SQL Statement

Industry: Education
Pain: Key Business Transaction with 5 sec response times
Root Cause: Slow JDBC query with full-table scan

Needle #2 – Slice of Death in Cassandra

Industry: SaaS Provider
Pain: Key Business Transaction with 2.5 sec response times
Root Cause: Slow Thrift query in Cassandra

Needle #3 – Slow & Chatty Web Service Calls

Industry: Media
Pain: Several Business Transactions with 2.5 min response times
Root Cause: Excessive Web Service Invocation (5+ per trx)

Needle #4 -Extreme XML processing

Industry: Retail/E-Commerce
Pain: Key Business Transaction with 17 sec response times
Root Cause: XML serialization over the wire.

Needle #5 – Mail Server Connectivity

Industry: Retail/E-Commerce
Pain: Key Business Transaction with 20 sec response times
Root Cause: Slow Mail Server Connectivity

 Needle #6 – Slow ResultSet Iteration

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 30+ sec response times
Root Cause: Querying too much data

Needle #7 – Slow Security 3rd Party Framework

Industry: Education
Pain: All Business Transactions with > 3 sec response times
Root Cause: Slow 3rd party code

Needle #8 – Excessive SQL Queries

Industry: Education
Pain: Key Business Transactions with 2 min response times
Root Cause: Thousands of SQL queries per transaction

Needle #9 – Commit Happy

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 25+ sec response times
Root Cause: Unnecessary use of commits and transaction management.

Needle #10 – Locking under Concurrency

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 5+ sec response times
Root Cause: Non-Thread safe cache forces locking for read/write consistency

 Needle #11 – Slow 3rd Party Search Service

Industry: SaaS Provider
Pain: Key Business Transaction with 2+ min response times
Root Cause: Slow 3rd Party code

 Needle #12 – Connection Pool Exhaustion

Industry: Financial Services
Pain: Several Business Transactions with 7+ sec response times
Root Cause: DB Connection Pool Exhaustion caused by excessive connection pool invocation & queries

Needle #13 – Excessive Cache Usage

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 50+ sec response times
Root Cause: Cache Sizing & Configuration

If you want to manage and troubleshoot application performance in production, you should seriously consider AppDynamics. We’re the fastest growing on-premise and SaaS based APM vendor in the market right now. You can download our free product AppDynamics Lite or take a free 30-day trial of AppDynamics Pro – our commercial product.

Now go find those needles that are hurting your end users!

App Man.

Code Deadlock – A Usual Suspect

Imagine you’re an operations guy and you’ve just received a phone call or alert notifying you that the application your responsible for is running slow. You bring up your console, check all related processes, and notice one java.exe process isn’t using any CPU but the other Java processes are.  The average sys admin at this point would just kill and restart the Java process, cross their fingers, and hope everything returns back to normal (this actually does work most of the time). An experienced sys admin might perform a kill -3 on the Java process, capture a thread dump, and pass this back to dev for analysis. Now suppose your application returns back to normal–end users stop complaining, you pat yourself on the back and beat your chest, and basically resume what you were doing before you were rudely interrupted.

The story I’ve just told may seem contrived, but I’ve witnessed it several times with customers over the years. The stark reality is that no one in operations has the time or visibility to figure out the real business impact behind issues like this. Therefore, little pressure is applied to development to investigate data like thread dumps so that root causes can be found and production slowdowns can be avoided again in future. It’s true restarting a JVM or CLR will solve a fair few issues in production, but it’s only a temporary fix over the real problems that exist within the application logic and configuration.

Now imagine for one minute that operations could actually figure out the business impact of production issues, along with identifying the root cause, and communicate this information to Dev so problems could be fixed rapidly. Sounds too good to be true, right? Well, a few weeks ago an AppDynamics customer did just that and the story they told was quite compelling.

Code Deadlock in a distributed E-Commerce Application

The customer application in question was a busy e-commerce retail website in the US. The architecture was heavily distributed with several hundred application tiers that included JVMs, LDAP servers, CMS server, message queues, databases and 3rd party web services. Here is a quick glimpse of what that architecture looked like from a high level:

Detecting Code Deadlock

If we look at the AppDynamics problem pane (right) as the customer saw things, it shows the severity of their issues. During the day the application was experiencing just over 4,000 business transactions per minute, which works out at just under 1 million transactions a day. Approximately 2.5% of these transactions were impacted by the slowdown, which was the result of the 92 code deadlocks you see here that occurred during peak hours.

AppDynamics is able to dynamically baseline the performance of every business transaction type before classifying each execution as normal, slow, very slow or stalled depending on its deviation from its unique performance baseline. This is critical for understanding the true business impact of every issue or slowdown because operations can immediately see how many user requests were impacted relative to the total requests being processed by the application.

From this pane, operations were able to drill down into the 92 code deadlocks and see the events that took place as each code deadlock occurred. As you can see from the screenshot (below left), the sys admins during the slowdown kept restarting the JVMs (as shown) to try and make the issues go away. Unfortunately, this didn’t work given that the application was experiencing high concurrency under peak load.

By drilling into each Code Deadlock event, operations were able to analyze the various thread contentions and locate the root cause of the issue. The root cause of the slowdown turned out to be an application cache which wasn’t thread-safe. If you look at the screenshot below, showing the final execution of the threads in deadlock accessing the cache, you can see that one thread was trying to remove an item, another was trying to get an item, and the last thread was trying to put an item. 3 threads were trying to do a put, get and remove at the same time! This caused a deadlock to occur on cache access, thus causing the related JVM to hang until those threads were released via a restart.

 Analyzing Thread Dumps

Below you can see the thread dump that AppDynamics collected for one of the code deadlocks, which clearly shows where each thread was deadlocked. By copying the full thread dumps to clipboard, operations were able to see the full stack trace of each thread, thus identifying which business transactions, classes, and methods were responsible for cache access.

The root cause for this production slowdown may have been identified and passed to dev for resolution, but the most compelling conclusion from this customer story was related to them identifying the real business impact that occurred. The application was clearly running slow, but what did the end user experience during the slowdown and what impact would this have had on the business?

What was the Actual Business Impact?

The screenshot below shows all business transactions that were executing on the e-commerce web application during the five hour window before, during, and after the slowdown occurred.

Here are some hard hitting facts for the two most important business transactions inside this e-commerce application:

  • 46,463 Checkouts processed
    • 482 returned an error, 1325 were slow, 576 were very slow and 111 stalled.
  • 3,956 Payments processed
    • 12 returned an error, 242 were slow, 96 were very slow and 79 stalled

Error – transaction failed with an exception. Slow – the business transaction deviated from its baseline by more than 3 standard deviations. Very Slow – the business transaction deviated from its baseline by more than 4 standard deviations. Stalled – the transaction timed out.

If you take these raw facts and assume the average revenue per order is $100, then the potential revenue risk/impact of this slowdown was easily into six digits when you consider the end user experience for checkout and payment. Even if you take the 482 Errors and 111 Stalls relating to the Checkout business transaction alone – this still equates to around $60,000 of revenue at risk. And that’s a fairly conservative estimate!

If you add up all the errors, slow, very slow and stalls you see in the screenshot above, you start to picture how serious this issue was in production. The harsh reality is that incidents like this happen everyday in production environments, but no one has visibility into the true business impact of them, meaning little pressure is applied to development to fix “glitches.”

Agile isn’t about Change, It’s about Results

If development teams want to be truly agile, they need to forget about constant change and focus on what impact their releases has on the business. The next time your application slows down or crashes in production, ask yourself one question: “What impact did that just have on the business?” I guarantee just thinking about that answer will make you feel cold. If development teams found out more often the real business impact of their work, they’d learn pretty quickly how fast, reliable and robust their application code really is.

I’m pleased to say no developers were injured or fired during the making of this real-life customer story; they were simply educated on what impact their non-thread safe cache had on the business. Failure is OK–that’s how we learn and build better applications.

App Man.

Why Alerts Suck and Monitoring Solutions need to become Smarter

I have yet to meet anyone in Dev or Ops who likes alerts. I’ve also yet to meet anyone who was fast enough to acknowledge an alert, so they could prevent an application from slowing down or crashing. In the real world alerts just don’t work, nobody has the time or patience anymore, alerts are truly evil and no-one trusts them. The most efficient alert today is an angry end user phone call, because Dev and Ops physically hear and feel the pain of someone suffering 🙂

Why? There is little or no intelligence in how a monitoring solution determines what is normal or abnormal for application performance. Today, monitoring solutions are only as good as the users that configure them, which is bad news because humans make mistakes, configuration takes time, and time is something many of us have little of.

Its therefore no surprise to learn that behavioral learning and analytics are becoming key requirements for modern application performance monitoring (APM) solutions. In fact, Will Capelli from Gartner recently published a report on IT Operational Analytics and pattern based strategies in the data center. The report covered the role of Complex Event Processing (CEP), behavior learning engines (BLEs) and analytics as a means for monitoring solutions to deliver better intelligence and quality information to Dev and Ops. Rather than just collect, store and report data, monitoring solutions must now learn and make sense of the data they collect, thus enabling them to become smarter and deliver better intelligence back to their users.

Change is constant for applications and infrastructure thanks to agile cycles, therefore monitoring solutions must also change so they can adapt and stay relevant. For example, if the performance of a business transaction in an application is 2.5 secs one week, and that drops to 200ms the week after because of a development fix. 200ms should become the new performance baseline for that same transaction, otherwise the monitoring solution won’t learn or alert of any performance regression. If the end user experience of a business transaction goes from 2.5 secs to 200ms, then end user expectations change instantly, and users become used to an instant response. Monitoring solutions have to keep up with user expectations, otherwise IT will become blind to the one thing that impacts customer loyalty and experience the most.

AppDynamics Growth – Not Bad for 3 Years

AppDynamics has experienced significant growth over the past three years, here’s a quick summary of our key highlights.

 

Embed this image on your site: