Dev and Ops Continue to Revisit Their Roles in the Enterprise in 2017

If there are two departments that have undergone huge changes in recent years, it’s development and operations, and 2017 looks likely to see this trend continue. What’s on the horizon for the often dev-centric DevOps teams and operations in particular?

A test of faith for DevOps

According to Gartner Research, DevOps is at the peak of inflated expectations and staring at the trough of disillusionment? Is this correct? During the Gartner Data Center London event last month, it was revealed that 38% of Gartner Circle members stated that they were already using DevOps today but equally, as presented at the aforementioned event, Gartner conferences witnessed 87% of attendees stating that DevOps has not delivered the benefits they were expecting. “Changing the behaviors and culture are fundamental to the success of a bimodal IT approach. We estimate that, by 2018, 90 percent of I&O organizations attempting to use DevOps without specifically addressing their cultural foundations will fail,” according to Gartner Research Director Ian Head. So, a lot of buzz surrounds the topic, which at my count earlier this month returned nearly 17.5 million search results.

Making the move to a DevOps approach is not for the faint-hearted or easily discouraged. It takes persistence, belief, and superior internal sales skills to lead others on the journey. The good news is there is now a critical mass of enterprises who have made the move and are experiencing significant benefits. It just needs a tactical approach to advocate and oversee change in the face of opposition, momentum-sapping inertia, or difficulty adapting. In doing so, initial wins can be achieved, upon which further initiatives can be built.

An equal partner with the business

As DevOps makes its impact (high-performing IT organizations deploy 200 times more frequently than low performers, with 2,555 times faster lead times for example), the relationship between IT and the business becomes intrinsically interlinked. The capabilities which deliver better quality, more robust applications faster, and with less waste open up significant potential for new customer offerings, improved customer relationships, and time-to-market. In short, DevOps adoption can mean a critical competitive edge. The decision around what to build, when to release it, and when to update it should be the result of an ongoing peer to peer dialogue between tech and business leads. In parallel, IT teams overall need to position themselves as enablers of transformation, not inhibitors.

Organizational change

A few years ago, any DevOps introduction would almost inevitably include a classic “silo” picture of the dev team on one side and the ops folks on the other (and yes, I was one of those offenders). This situation is evolving now as new roles that blur this division emerge such as DevOps Engineer, Site Reliability Engineer, and Cloud Architect. They don’t sit easily 100% in one camp or another and possess hybrid skills that can prove a real asset in delivering robust, scalable, and rapidly deployable applications. Expect new structures to emerge that resemble a flotilla of speedboats rather than an ocean liner in terms of ability to respond to changing demands.

No metric blind spots

In an increasingly data-driven world, the complexity of today’s applications should not be an excuse for the unavailability of insights into how they are performing. Whether or not an organization has chosen public, private, or hybrid cloud, experimenting with microservices or embracing the potential of containers, rich, detailed, yet easy to understand metrics around every aspect should be at hand in real-time.

IoT and DevOps: a brave new world

The IoT topic has been around for several years, but 2017 could be the inflection point. Gartner Research estimates that by 2020, more than half of major new business processes and systems will incorporate some element of the Internet of Things. While wearables may have the consumer eye, industrial IoT usage dwarfs that in consumer markets. As a counterpoint to industries that have led in digital transformation (retail, banking, and telecommunications), the heaviest IoT users are likely to be in oil, gas, utilities, and manufacturing industries, according to a global survey released March 3 by Gartner. No sector is immune to the need to review and evolve its application development approaches. In parallel, the DevOps relationship to IoT is an interesting one, particularly around the end user experience (it’s a myriad of small devices now, not a tablet or smartphone) and security (think public wifi issues) to name but a few. Expect more DevOps teams to work on applications with an IoT use case and go through a major learning curve.

DevOps and its relationship with data

It’s not just big data, all data, including databases themselves (which came up often at AppSphere 2016 as a cause for latency and downtime) that will matter. From the rise of the role of data scientist to the explosion of IoT data, DevOps teams cannot ignore this all too important area, and need a POV regarding how it should be managed. One particularly interesting area will be the increased use of business algorithms, with a lot of the data needed to build these held within the remit of IT operations teams. Machine learning APIs can have a significant role here, as they help developers to apply machine learning to a dataset enabling predictive features to be added to the applications they are building. One example of this is Google Prediction API, a cloud-based machine learning and pattern matching tool. It helps with the upsell of opportunity analysis, provides details of customer sentiment and churn analysis, and detects spam, amongst many other features. Stephen Thair of DevOpsGuys has a written a solid exploratory piece on this topic, which may not be as hot as containers, but is still an essential consideration. Data Ops experts share a related goal (faster time-to-market) and DevOps centric organizations would be wise to have a line of communication between the two teams.

Security

This subject could be a blog all on its own. How can DevOps work more closely with security teams as data breaches threaten to damage a brand as much as a slow responding app? What can DevOps do to ensure applications are built with security in mind from day one? How can anomalies or outliers in business transactions provide an early warning system to fraud? The potential for APM insights to assist in fraud detection is an interesting use case which PayU, one of our customers, raised in their presentation at AppSphere 2016. Look out for more data breach headlines in 2017, though maybe not at the scale of the recently announced Yahoo hack, which is understood to have involved one billion accounts.

But what about operations, specifically? In some ways, the less prominent player in DevOps adoption, ops teams are perhaps more likely to undergo even greater changes than their dev counterparts in 2017.

Move from being a cost center to an innovation hub

Enterprise technology spending in 2017 is set to rise 3%, with $3.5 trillion on technology expected to be invested, according to Gartner Research. Yet within this environment, operations leads will still need to pivot from being seen as an area that risks being cut year-on-year to one that helps build the business and is in step with its goals. This requires many behavioral shifts — from running a tight ship to being a negotiator and intermediary between multiple cloud, software, and infrastructure vendors, and lines of business themselves. The ability to market one’s team and its contributions becomes almost as important as deep technical knowledge when pursuing investment. The established focus on being the data center guardian and only objectivized on stability has to shift towards accepting that the I&O has to be anti-fragile, rather than simply unchanging. This will take a fine balance between calculated gambles on new approaches and technologies, and acceptance of the fact that, as Ian Head of Gartner Research pointed out, a small amount of risk has to be accepted as it is impossible to innovate otherwise.

No ops team is an island

Having just misquoted John Donne, those in operations will need to accept that they will both lose and gain going forward. As Gartner Research pointed out at the Data Center Summit, infrastructure and operations environments have traditionally been the center of the ops team’s world view. However, as the organization expands, operations staff will find that while the number of activities they become involved with increases, their ultimate control over specific areas reduces at the same time. Organizations are now taking a more collaborative approach. This means the creation of flexible and agile networks and ecosystems become ever more important, with the innate capability to scale and rollback investment areas as the business demands. This requires an astute ambassador with a business-savvy mind, building agility into the I&O mindset rather than a rigid enforcer who views change with suspicion.

Customer experience is key

As DevOps delivers an improved customer experience, there is potential for more insights as to exactly what the end user sees and interacts with, bringing development in particular closer to understanding exactly whether what they have built works, or doesn’t, with its target audience. At AppDynamics, we have seen how end user monitoring of experiences across mobile, tablet, and PC platforms and devices, for example, is essential for understanding how, when, and where customers are engaging with an application. This has been critical in preventing a degradation in response times ahead of impacting customers and the company’s reputation.

The skills gap

How do enterprises build and nurture teams that are equipped for the digital business platform? Relying on rockstars and contractors are short-term fixes. How can the classic Gartner Research I&O employees working on predominantly Mode 1 refresh their skill sets and feel part of the “new way”? How can up and coming young talent be attracted not just by salaries, but a culture where they feel valued and listened to? DevOps can’t be just for the rockstars and stellar contractors — buy-in needs to come from colleagues who see the move as inclusive and non-elitist. There are many external consultancies who can provide some excellent stewardship and direction in the tricky subject of enterprise DevOps, but the best ones are those who teach others how to fish, pair up, and coach them — rather than treating knowledge transfer as an inhibitor to additional consulting fees.

If there is one standout theme for 2017, it’s the overwhelming need to revisit how dev and ops view their role in the organization, how they contribute value, their scope of responsibility, and the new mindsets needed to thrive in an ultra-high velocity world. In the words of Stephen Covey, continually “sharpen the saw”.

Be hungry to learn, question perceived wisdom if it doesn’t fit with evolving demands, and remain open to trying new approaches without fear of failure inhibiting new platform initiatives, onboarding technology partners, and being objectivized on more outcome-based metrics.

Learn More

Download the eBook, 10 Things Your CIO Should Know About DevOps.

What’s new in the Summer ‘14 release for the Ops Team

AppDynamics recently announced our summer release that builds upon our history of delivering game changing functionality and innovations in application performance management.  In our latest version, we’ve added many features that cater to operations-focused professionals, let’s take a closer look at some of those features.

Percentile metrics

AppDynamics has always had robust behavior learning capabilities that automatically baseline the metrics that we collect.  Instead of having to tell our platform what’s normal behavior, AppDynamics continually collects data and adjusts the dynamic baselines in real-time.  Percentile metrics give customers the added ability to analyze metrics based on percentiles like 90%, 95%, or 99% to get a better understanding of the distribution of metrics.  Basically, it allows operations teams to exclude outlier data to get a better understanding of what ‘normal’ application behavior is. You’re in full control with the ability to set 5 different percentile levels across the entire application or even down to the individual transaction level.

Configure percentile metrics and easily apply to specific business transactions or all existing business transactions:

Screen Shot 2014-08-12 at 4.41.51 PM

This feature allows operations teams to improve problem detection with fewer false positives for long-tail distribution profiles.  Having to dig through less noise to find the problems that matter is always a good thing.

Smart dashboards

Building meaningful dashboards takes time and effort. AppDynamics has added dashboard template functionality that allows you to build once and deploy as many times as needed to new applications without any configuration changes. Just build a new dashboard and associate it to the proper application(s) or component(s). It’s that easy.

Screen Shot 2014-08-12 at 4.42.01 PM

Business transaction discovery tools

AppDynamics has always automatically discovered business transactions out of the box for customers.  This capability has saved our enterprise customers countless hours defining and configuring dozens or even hundreds of different business transactions.  However, sometimes customers have unusual circumstances that require some configuration rules for getting their business transactions defined to align with their unique business needs. Well, we’ve created another time saving feature that allows AppDynamics administrators to model business transaction configurations in real time and only commit the configuration when you’re getting the exact results you intended.

This feature is purely intended to make life as easy as possible for the operations staff that manages AppDynamics. It’s another improvement that further reduces total cost of ownership ensuring AppDynamics offers the best possible ROI.

Step 1: Pick a node

Screen Shot 2014-08-12 at 4.42.10 PM

Step 2: Edit Discovery Configuration

Screen Shot 2014-08-12 at 4.42.20 PM

Step 3: Wait for load (real-time) & validate results

Screen Shot 2014-08-12 at 4.42.31 PM

Step 4: Apply config

Screen Shot 2014-08-12 at 4.42.40 PM

Windows Machine snapshots

Oftentimes operations teams can narrow the root-cause of application issues to the infrastructure, but can’t automatically correlate performance of the machine back to the application issue.  This results in manual correlation efforts that increase the amount of time needed to isolate and fix these types of issues.  AppDynamics now has a feature called machine snapshots that enables operations teams to get visibility into the status of machine resources and associate that with application health.

This feature has been released first for .NET. Let’s face it, it’s rare to have features released for .NET applications before Java apps get the feature but we thought it was about time to even the score a bit. Java folks, you’ll have to wait a little while longer for this feature.

View health metrics like CPU and memory consumption per process on the machine:

Screen Shot 2014-08-12 at 4.42.47 PM

Monitor the state of IIS app pools queues:

Screen Shot 2014-08-12 at 4.42.54 PM

Correlate physical machine health with business transaction performance using snapshots:

Screen Shot 2014-08-12 at 4.43.03 PM

.NET: Object Instance Tracking

Memory leaks happen. When they happen in production it’s operations responsibility to assist the development team with collecting the data that leads to the fix. Object Instance Tracking from AppDynamics adds this critical capability to it’s ever growing arsenal of .NET monitoring and troubleshooting capabilities.

Screen Shot 2014-08-12 at 4.43.11 PM

.NET: CLR Crash alerts and information for IIS Server

When CLR’s crash it is the job of the operations team to know that it happened and to figure out why. AppDynamics now has the ability to alert on your CLR crashes and automatically pulls the details from the windows event viewer so that you don’t even have to login to the Windows server to see what went wrong. This is another time saving feature that is ideal for operations teams that support .NET applications.

Screen Shot 2014-08-12 at 4.43.21 PM

EUEM resource timing waterfall view

Today, many operations teams have difficulty understanding how long load times for static resources impact the end user experience.  For example, your team might wonder how a social media widget, banner ad, or a set of images impacts the web application’s response time from the end user perspective.  AppDynamics now provides support for resource timing for web applications that provides the details you need to understand why end users are experiencing sub-optimal performance.

View resource timing details by type, domains requested, and waterfall breakdown:

Screen Shot 2014-08-12 at 4.43.33 PM

By gaining visibility into how static resources are performing, operations teams can quickly identify performance issues affecting end users and enforce SLAs with 3rd parties.

EUEM Analyze

Customers want to combine multiple metrics to drive the best insights into end user experience. The traditional hierarchical metric model doesn’t scale too the exponential number of possible combinations customers might want. Thanks to the new analytics platform the EUEM team has created a new page called Analyze that will allow customers to perform a multi-faceted search on their end user data, uncovering answers to questions like “how was End User Experience for a specific page, in California, using the IE browser?”

Screen Shot 2014-08-13 at 2.15.08 PM

Mobile EUEM: Unique Crash Dashboard and Reports

Crashes on mobile applications aren’t usually unique. Now customers can see how many crashes of a particular type have occurred and how many users where impacted allowing them to better prioritize and focus their work. We also provide supporting details like typical environments and stack traces for each.

Screen Shot 2014-08-13 at 2.15.17 PM

EUEM on-premise

AppDynamics has always had secure data transmission and storage protocols, however some enterprise customers have company policies that prevent them from utilizing our hosted EUEM cloud.  Now, customers can host their own AppDynamics EUEM cloud on-premise so they can take advantage of our granular end user experience metrics and snapshots.

C/C++, Analytics, and Database monitoring beta programs

The last major point to make in this announcement is the fact that we have also released 3 new major feature sets into beta.

C/C++(beta):

  • SDK: The AppDynamics C/C++ SDK provides an industry standard compile and deploy monitoring solution. Use the AppDynamics SDK if you have access to source code and you are willing to modify and recompile it to include C++ Agent SDK.
  • Dynamic Agent: The C/C++ Dynamic Agent is an industry-leading monitoring solution that extends AppDynamics breadth of monitoring capability to all those C/C++ applications. It requires no modifications to your proprietary applications and enables monitoring of third party libraries for which you might not have access to the source-code.

Analytics (beta):

Our Analytics module is layered upon the underlying data collection capabilities of AppDynamics. In this beta release of analytics we have done major architecture and feature work culminating in the ability to search and analyze business data collected in real-time from Java applications as well as from logs. The key to this rich analytics capability lies in our new horizontally scalable events service that will power exciting new analytics & search use cases for EUEM, APM and database monitoring in our upcoming releases.

Our new infinitely scalable event service is major part of our new Platform architecture. It is built on ElasticSearch and Kafka and will become the core foundation of event storage, processing and analytics for all our products. It went live in 3.9 with the EUEM, database monitoring and Analytics modules already using the new event service data platform. When we say that AppDynamics is an Application Intelligence Platform we mean business.

Screen Shot 2014-08-12 at 4.43.44 PM

Fully integrated database monitoring (beta):

The same UI, the same Installer, and the same Intelligence Platform as the rest of AppDynamics product ecosystem available for both SaaS and On-premise.

Our new fully integrated database monitoring (beta) product brings a whole new level of database information into the familiar AppDynamics UI. It’s fast, scalable, and 100% SaaS-ready. The new architecture allows us to build a very tight integration between APM and database monitoring, bridging the gaps between DBA, dev, and ops. Database monitoring now gains the added benefit of our behavioral learning engine, automatically creating baselines of “normal” behavior to compare current data against. It’s a major step forward for the database monitoring world.

Detailed query analysis:

Screen Shot 2014-08-12 at 4.43.55 PM

Main database dashboard:

Screen Shot 2014-08-12 at 4.44.04 PM

Top queries analysis:

Screen Shot 2014-08-12 at 4.44.11 PM

To try these features out today, sign up for a free trial here.

A Newbie Guide to APM

Today’s blog post is headed back to the basics. I’ve been using and talking about APM tools for so many years sometimes it’s hard to remember that feeling of not knowing the associated terms and concepts. So for anyone who is looking to learn about APM, this blog is for you.

What does the term APM stand for?

APM is an acronym for Application Performance Management. You’ll also hear the term Application Performance Monitoring used interchangeably and that is just fine. Some will debate the details of monitoring versus management and in reality there is an important difference but from a terminology perspective it’s a bit nit-picky.

What’s the difference between monitoring and management?

Monitoring is a term used when you are collecting data and presenting it to the end user. Management is when you have the ability to take action on your monitored systems. Management tasks can include restarting components, making configuration changes, collecting more information through the execution of scripts, etc… If you want to read more about the management functionality in APM tools click here.

What is APM?

There is a lot of confusion about the term APM. Most of this confusion is caused by software vendors trying to convince people that their software is useful for monitoring applications. In an effort to create a standard definition for grouping software products, Gartner introduced a definition that we will review here.

Gartner lists five key dimensions of APM in their terms glossary found here… http://www.gartner.com/it-glossary/application-performance-monitoring-apm

End user experience monitoringEUM and RUM are the common acronyms for this dimension of monitoring. This type of monitoring provides information about the response times and errors end users are seeing on their device (mobile, browser, etc…). This information is very useful for identifying compatibility issues (website doesn’t work properly with IE8), regional issues (users in northern California are seeing slow response times), and issues with certain pages and functions (the Javascript is throwing an error on the search page).

prod-meuem_a-960x0 (2)

Screen_Shot_2014-08-04_at_4.29.55_PM-960x0 (2)

Runtime application architecture discovery modeling and display – This is a graphical representation of the components in an application or group of applications that communicate with each other to deliver business functionality. APM tools should automatically discover these relationships and update the graphical representation as soon as anything changes. This graphical view is a great starting point for understanding how applications have been deployed and for identifying and troubleshooting problems.

Screen_Shot_2014-07-17_at_3.42.47_PM-960x0 (3)

User-defined transaction profiling – This is functionality that tracks the user activity within your applications across all of the components that service those transactions. A common term associated with transaction profiling is business transactions (BT’s). A BT is very different from a web page. Here’s an example… As a user of a website I go to the login page, type in my username and password, then hit the submit button. As soon as I hit submit a BT is started on the application servers. The app servers may communicate with many different components (LDAP, Database, message queue, etc…) in order to authenticate my credentials. All of this activity is tracked and measured and associated with a single “login” BT. This is a very important concept in APM and is shown in the screenshots below.

Screen_Shot_2014-08-04_at_1.30.15_PM-960x0

Component deep-dive monitoring in application context – Deep dive monitoring is when you record and measure the internal workings of application components. For application servers, this would entail recording the call stack of code execution and the timing associated with each method. For a database server this would entail recording all of the SQL queries, stored procedure executions, and database statistics. This information is used to troubleshoot complex code issues that are responsible for poor performance or errors.

Screen_Shot_2014-08-07_at_11.08.00_AM-960x0

Analytics – This term leaves a lot to be desired since it can be and often is very liberally interpreted. To me, analytics (in the context of APM) means baselining, and correlating data to provide actionable information. To others analytics can be as basic as providing reporting capabilities that simply format the raw data in a more consumable manner. I think analytics should help identify and solve problems and be more than just reporting but that is my personal opinion.

business-impact-analytics2-1-960x0

performance-analytics-960x0

Do I need APM?

APM tools have many use cases. If you provide support for application components or the infrastructure components that service the applications then APM is an invaluable tool for your job. If you are a developer the absolutely yes, APM fits right in with the entire software development lifecycle. If your company is adopting a DevOps philosophy, APM is a tool that is collaborative at it’s core and enables developers and operations staff to work more effectively. Companies that are using APM tools consider them a competitive advantage because they resolve problems faster, solve more issues over time, and provide meaningful business insight.

How can I get started with APM?

First off you need an application to monitor. Assuming you have access to one, you can try AppDynamics for free. If you want to understand more about the process used in most companies to purchase APM tools you can read about it by clicking here.

Hopefully this introduction has provided you with a foundation for starting an APM journey. If there are more related topics that you want me to write about please let me know in the comments section below.

Check out our complementary ebook, Top 10 Java Performance Problems!

DevOps Is No Replacement for Ops

DevOps is gaining traction according to a study at the end of 2012 by Puppet Labs. Their study concluded that DevOps adoption within companies grew by 26% year over year. DevOps is still misunderstood and has tremendous room for greater adoption still but let’s be clear about one very important thing: DevOps is not a replacement for operations!

If you’re not as well versed in the term DevOps as you want to be I suggest you read my “DevOps Scares Me” series.

Worst Developer on the Planet

The real question in my mind is where do you draw the line between dev and ops. Before you get upset and start yelling that dev and ops should collaborate and not have lines between them let me explain. If developers are to take on more aspects of operations, and operations personnel are to work more closely with developers, what are the duties that can be shared and what are the duties that must be held separate?

i-will-not-write-any-more-bad-code

To illustrate this point I’ll use myself as an example. I am a battle hardened operations guy. I know how get servers racked, stacked, cabled, loaded, partitioned, secured, tuned, monitored, etc… What I don’t know how to do is write code. Simply put, I’m probably the worst developer on the planet. I would never trust myself to write code in any language to run a business. Now in this particular case it comes down to the fact that I simply don’t currently posses the proper skills to be a developer but that’s exactly my point. In order for DevOps to succeed there needs to be a clear delineation of duties where expertise can be developed and applied.

Experts Needed

Operations is a mix of implementing standard processes, planning for the future, and sometimes working feverishly to try and solve problems that are impacting the business. Just like writing good code, providing good operational support takes experience. There are many areas where you need to develop expertise and you shouldn’t just start doing operations stuff unless you have some experience and a peer review of your work plan.

the-expert

You do have a work plan don’t you? A work plan could be something as formal as a change control (ITIL would be good to read about) or something as informal as a set of steps written on a bar napkin (hopefully not, but it’s possible). The point is that you need a plan and you need someone to review your plan to see if it makes sense. Here are some of the things you need to consider when creating your plan:

  • Implementation process – What are you going to actually do?
  • Test plan – How will you be able to tell if everything is working properly after you implement your change?
  • Backout process – How do you revert back if things don’t go right?
  • Impacted application – What application are you actually working on?
  • Dependent application(s) – What applications are dependent upon the application you are changing?

The point I’m trying to make here is that DevOps does not change the fact that well defined operations processes and practices are what keep companies IT departments working smoothly. Adopting DevOps doesn’t mean that you can revert back to the days of the wild west and just run around making changes however you see fit. If you are going to adopt DevOps as a philosophy within your organization you need to blend the best of what each practice has to offer and figure out how to combine the practices in a manner that best meets your organizations requirements and goals.

Bad operations decisions can cost companies a lot of money and significant reputation damage in a short period of time. Bad code written by inexperienced developers can have the same effect but can be detected before causing impact. Did someone just release new code for your application to production without testing it properly? Find the application problems and bad code before they find you by trying AppDynamics for free today.

If all you have is logs, you are doing it wrong

Ten years ago, the standard way to troubleshoot an application issue was to look at the logs. Users would complain about a problem, you’d go to operations and ask for a thread dump, and then you’d spend some time poring over log files looking for errors, exceptions, or anything that might indicate a problem. There are some people who still use this approach today with some success, but for most modern applications logging is simply not enough. If you’re depending on log files to find and troubleshoot performance problems, then chances are your users are suffering – and you’re losing money for your business. In this blog we’ll look at how and why logging is no longer enough for managing application performance.

The Legacy Approach

The typical legacy web application was monolithic and fairly static, with a single application tier talking to a single database that was updated every six months. The legacy approach to monitoring production web applications was essentially a customer support loop. A customer would contact the support team to report an outage or bug, the customer support team reports the incident to the operations team, and then the operations team would investigate by looking at the logs with whatever useful information they had from the customer (username, timestamps, etc.). If the operations team was lucky and the application had ample logging, the operations team would spot the error and bring in developers to find the root cause and provide a resolution. This is the ideal scenario, but more often than not the logs were of very little use and the operations team would have to wait for another user to complain about a similar problem and kick off the process again. Ten years ago, this was what production monitoring looked like. Apart from some rudimentary server monitoring tools that could alert the operations team if a server was unavailable, it was the end users who were counted on to report problems.

Monolithic App

Logging is inherently reactive

The most important reason that logging was never truly an application performance management strategy is that logging is an inherently reactive approach to performance. Typically this means an end user is the one alerting you to a problem, which means that they were affected by the issue – and (therefore) so was your business. A reactive approach to application performance loses you money and damages your reputation. So logging isn’t going to cut it in production.

You’re looking for a needle in a haystack

Another reason why logging was never a perfect strategy is that system logs have a particularly low signal to noise ratio. This means that most of the data you’re looking at (which can amount to terabytes for some organizations) isn’t helpful. Sifting through log files can be a very time-consuming process, especially as your application scales, and every minute you spend looking for a problem is time that your customers are being affected by a performance issue. Of course, newer tools like Splunk, Loggly, SumoLogic and others have made sorting through log files easier, but you’re still looking for a needle in a haystack.

Logging requires an application expert

Which brings us to another reason logging never worked: Even with tools like Loggly and Splunk, you need to know exactly what to search for before you start, whether it’s a specific string, a time range, or a particular file. This means the person searching needs to be someone who knows the application well, usually a developer or an architect. Even then, their hunches could be wrong, especially if it’s a performance issue that you’ve never encountered before.

Not everyone has access to logs

Logging is a great tool for developers to debug their code on their laptops, but things get more complicated in production, especially if the application is dealing with sensitive data like credit card numbers. There are usually restrictions on the production system that prevent people like developers from accessing the production logs. In some organizations, these can be requested from the operations team, but this step can take a while. In a crisis, every second counts, and these costly processes (while important) can cost organization money if your application is down.

It doesn’t work in production

Even in a perfect world where you have complete access to your application’s log files, you still won’t have complete visibility into what’s going on in your application. The developer who wrote the code is ultimately the one who decides what gets logged, and the verbosity of those logs is often limited by performance constraints in production. So even if you do everything right there’s still a chance you’ll never find what you’re looking for.

The Modern Approach

Today, enterprise web applications are much more complex than they were ten years ago. The new normal for these applications includes multiple application tiers communicating via a service-oriented architecture (SOA) that interacts with several databases and third-party web services while processing items out of caches and queues. The modern application has multiple clients from browser-based desktops to native applications on mobile. As a result, it can be difficult just to know where to start if you’re depending on log files for troubleshooting performance issues.

Distributed App

Large, complex and distributed applications require a new approach to monitoring. Even if you have centralized log collection, finding which tier a problem is on is often a large challenge on its own. Usually increasing logging verbosity is not an option, as production does not run with debugging enabled due to performance constraints. First the operations team must figure out which log file to look through and once finding something that resembles an error, reach out to the developers to find out what it means and how to resolve it. Furthermore with the evolution of web 2.0 applications which are heavily dependent on JavaScript and the capabilities of the browser in order to properly monitor the application is operating as expected you must have end user monitoring that is capable of capturing errors on the client-side and in native mobile applications.

Logging is simply not enough

Logging is not enough – modern applications require application performance management to enable application owners to stay informed to minimize the business impact of performance degradation and downtime

Logging is simply not enough information to get to the root cause of problems in modern distributed applications. The problems of production monitoring have changed and so has the solution. Your end users are demanding and fickle, and you can’t afford to let them down. This means you need the fastest and most effective way to troubleshoot and solve performance problems, and you can’t rely on the chance that you might find the message in the log. Business owners, developers, and operations need in-depth visibility into the app, and the only way to get that is by using application performance monitoring.

Get started with AppDynamics Pro today for in-depth application performance management.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

My Top 3 Automated Tasks for Finding and Fixing Problems

Gears-BrainAutomation sets apart organizations at the top of their game from the rest of the pack. The limiting factor in most organizations is that they are usually too busy putting out fires and keeping up with all of their other obligations to expend the effort required to envision and build out their automation strategy. With that in mind I have created a small list of the automation tasks that I feel provide the most value to an organization. Along with these tasks I explain the type of effort involved and reward associated with each one. All of the information presented is based upon my 15 years of troubleshooting within enterprise operations and applications environments.

IT Operations

1. Collect troubleshooting metrics – To me this is the no-brainer of automation tasks. For each particular type of problem you always want certain information to help resolve that issue. This is also the easiest of my top three to implement but may provide less value than the other two. Here are some examples…

  • Hung/Unresponsive JVM/CLR – Initiate and store thread dump, restart application on offending node.
  • Slow transactions plus high server CPU utilization – Collect process listing to determine CPU contributors and spin up extra instance of application.
  • Transactions throwing excessive errors – search log files for errors and send list to appropriate personnel, based upon error type possibly probe individial components deeper (see #2 below)

Application Operations

2. Probe application components – This one is really useful for figuring out difficult application problems but requires more effort to set up than #1. The basic concept is that you need to find out from the application support team what steps they would take manually to trouble shoot their application if it were slow or broken. The usual responses are things like “Check this log file for this word”, “Run this query against this database and if the output is -3 then I know what the problem is”, “Hit this URL to see what the response looks like. If the page does not return properly I know there is a problem with this component”, etc…

It may seem like a lot of work to set up this type of automated probing at first but once it is set up it becomes an invaluable troubleshooting tool. Imagine that the application response times get slow so these troubleshooting measures are automatically invoked and you get an email with the exact root cause within minutes of performance degradation. With this type of automation there is usually a known resolution based upon the probing results so that could be automated too. How handy is that?

Business Operations

Blame_Flowchart3. Alert the business to changing conditions – This is where the IT staff has the ability to really make an impression and impact on the business. One of the most overlooked aspects of monitoring and automation is the ability to gather, baseline, alert, and act based upon pure business metrics. Here’s an example scenario…

Bobs business has an e-commerce website that sells many different products. They use AppDynamics Pro to track the quantity of each item sold along with the total revenue of all sales throughout the business day (using the information point functionality). One day Bob gets an alert that the latest greatest widgets are selling way below their typical volume but the automation engine searched the prices of major competitors to find that one competitor lowered prices and is undercutting business. With this actionable information Bob is able to immediately match the pricing of the competition and sales rates return to normal before too much damage is done.

Obviously business operations automation can be the most complex but also the most rewarding. Reaching out to the business and having a conversation about their activities and processes can seem daunting but I can tell you from personal experience that the business is more than willing to participate in initiatives that save them time and money. This type of conversation also normally leads to the business asking about other ways to collaborate to make them more effective so it is a great way to improve the overall level of communication with the business.

AppDynamics Pro enables a new level of automation because it knows exactly what is going on inside of your applications from both a business and IT perspective. Your level of automation is limited only by your imagination. I recommend that you start out small with a single automation use case and build outward from there. You can use AppDynamics Pro for free to try out our application runbook automation functionality on your specific use case by clicking here.

DevOps Scares Me – Part 3

Hey, who invited the developer to our operations conversation? In Devops Scares Me – Part 2 my colleague Dustin Whittle shared his developer point of view on DevOps. I found his viewpoint quite interesting and it made me realize that I take for granted the knowledge I have about what it takes to get an application into production in a large enterprise. As Dustin called out, there are many considerations including but not limited to code management, infrastructure management, configuration management, event management, log management, performance management, and general monitoring. In his blog post Dustin went on to cover some of the many tools available to help automate and manage all of the considerations previously mentioned. In my post I plan to explore if DevOps is only for loosely managed e-commerce providers or if it can really be applied to more traditional and highly regulated enterprises.

Out With the Old

In the operations environments I have worked in there were always strict controls on who could access production environments, who could make changes, when changes could be made, who could physically touch hardware, who could access what data centers, etc… In these highly regulated and process oriented enterprises the thought of blurring the lines between development and operations seems like a non-starter. There is so much process and tradition standing in the way of using a DevOps approach that it seems nearly impossible. Let’s break it down into small pieces and see if could be feasible.

Here are the basic steps to getting a new application built and deployed from scratch (from an operations perspective) in a stodgy Financial Services environment. If you’ve never worked in this type of environment some of the timing of these steps might surprise you (or be very familiar to you). We are going to assume this new application project has already been approved by management and we have the green light to proceed.

  1. Place order for dev, test, uat, prod, dr, etc… infrastructure. (~8 weeks lead time, all hardware ordered up front)
  2. Development team does dev stuff while us ops personnel are filling out miles of virtual paperwork to get the infrastructure in place. Much discussion occurs about failover, redundancy, disaster recovery, data center locations, storage requirements, etc… None of this discussion includes developers, just operations and architects…oops.
  3. New application is added to CMDB (or similar) to include new infrastructure components, application components, and dependencies.
  4. Operations is hopeful that the developers are making good progress in the 8 weeks lead time provided by the operational request process (actually the ops teams don’t usually even think about what dev might be working on). Servers have landed and are being racked and stacked. Hopefully we guessed right when we estimated the number of users, efficiency of code, storage requirements, etc… that were used to size this hardware. In reality we will have to see what happens during load testing and make adjustments (i.e. tell the developers to make it use fewer resources or order more hardware).
  5. We’re closing in on one week until the scheduled go-live date but the application isn’t ready for testing yet. It’s not the developers fault that the functional requirements keep changing but it is going to squeeze the testing and deployment phases.
  6. The monitoring team has installed their standard monitoring agents (usually just traditional server monitoring) and marked off that checkbox from the deployment checklist.
  7. It’s 2 days before go-live and we have an application to test. The load test team has coded some form of synthetic load to be applied to the servers. Functional testing showed that the application worked. Load testing shows slow response times and lot’s of errors. Another test is scheduled for tomorrow while the development team works frantically to figure out what went wrong with this test.
  8. One day until go-live, load test session 2, still some slow response time and a few errors but nothing that will stop this application from going into production. We call the load test a “success” and give the green light to deploy the application onto the production servers. The app is deployed, functional testing looks good, and we wait until tomorrow for the real test…production users!
  9. Go-Live … Users hit the application, the application pukes and falls over, the operations team check the infrastructure and gets the developers the log files to look at. Management is upset. Everyone is asking if we have any monitoring tools that can show what is happening in the application.
  10. Week one is a mess with the application working, crashing, restarting, working again, and new emergency code releases going into production to fix the problems. Week 2 and each subsequent week will get better  until new functionality gets released in the next major change window.
Nobody wins with a "toss it over the wall" mentality.

Nobody wins with a “toss it over the wall” mentality.

In With the New

Part of the problem with the scenario above is that the development and operations teams are so far removed from each other that there is little to no communication during the build and test phases of the development lifecycle. What if we took a small step towards a more collaborative approach as recommended by DevOps? How would this process change? Let’s explore (modified process steps are highlighted using bold font)…

  1. Place order for dev, test, uat, prod, dr, etc… infrastructure. (~8 weeks lead time, all hardware ordered up front)
  2. Development and operations personnel fill out virtual paperwork together which creates a much more accurate picture of infrastructure requirements. Discussions about failover, redundancy, disaster recovery, data center locations, storage requirements, etc… progress more quickly with better estimations of sizing and understanding of overall environment.
  3. New application is added to CMDB (or similar) to include new infrastructure components, application components, and dependencies.
  4. Operations is fully aware of the progress the developers are making. This gives the operations staff an opportunity to disucss monitoring requirements from both a business and IT perspective with the developers. Operations starts designing the monitoring architecture while the servers have arrived and are being racked and stacked. Both the development and operations teams are comfortable with the hardware requirement estimates but understand that they will have to see what happens during load testing and make adjustments (i.e. tell the developers to make it use fewer resources or order more hardware). Developers start using the monitoring tools in their dev environment to identify issues before the application ever makes it to test.
  5. We’re closing in on one week until the scheduled go-live date but the application isn’t ready for testing yet. It’s not the developers fault that the functional requirements keep changing but it is going to squeeze the testing and deployment phases.
  6. The monitoring team has installed their standard monitoring agents (usually just traditional server monitoring) as well as the more advanced application performance monitoring (APM) agents across all environments. This provides the foundation for rapid triage during development, load testing, and production.
  7. It’s 2 days before go-live and we have an application to test. The load test team has coded a robust set of synthetic load based upon application monitoring data gathered during development. This load is applied to the application which reveals some slow response times and some errors. The developers and operations staff use the APM tool together during the load test to immediately identify the problematic code and have a new release available by the end of the original load test. This process is repeated until the slow response times and errors are resolved.
  8. One day until go-live, we were able to stress test overnight and everything looks good. We have the green light to deploy the application onto the production servers. The app is deployed, functional testing looks good, business and IT metric dashboard looks good, and we wait until tomorrow for the real test…production users!
  9. Go-Live … Users hit the application, the application works well for the most part. The APM tool is showing some slow response time and a couple of errors to the developers and the operations staff. The team agrees to implement a fix after business hours as the business dashboard shows that things are generally going well. After hours the development and operations team collaborate on the build, test, and deploy of the new code to fix the issues identified that day. Management is happy.
  10. Week one is highly successful with issues being rapidly identified and dealt with as they come up. Week 2 and each subsequent week are business as usual and the development team is actively focused on releasing new functionality while operations adapts monitoring and dashboards when needed.
DevAndOps

Developers and operations personnel living together in harmony!

So what scenario sounds better to you? Have you ever been in a situation where increased collaboration caused more problems than it solved? In this example the overall process was kept mostly intact to ensure compliance with regulatory audit procedures. Developers were never granted access to production (regulatory issue for Financial Services companies) but by being tightly coupled with operations they had access to all of the information they needed to solve the issues.

It seems to me that you can make a big impact across the lifecycle of an application by implementing parts of the DevOps philosophy in even a minor way. In this example we didn’t even touch the automation aspects of DevOps. That’s where all of those fun and useful tools come into play so that is where we will pick up next time.

If you’re interested in adding an APM tool to your DevOps, development, or operations toolbox you can take a free self guided trial by clicking here and following the prompts.

Click here for DevOps Scares Me Part 4.

DevOps Scares Me – Part 2

What is DevOps?

In the first post of this series, my colleague Jim Hirschauer defined what DevOps means and how it impacts organizations. He concluded DevOps is defined as “A software development method that stresses communication, collaboration and integration between software developers and information technology professionals with the goal of automating as much as possible different operational processes.”

I like to think DevOps can be explained simply as operations working together with engineers to get things done faster in an automated and repeatable way.

DevOps Cycle

From developer to operations – one and the same?

As a developer I have always dabbled lightly in operations. I always wanted to focus on making my code great and let an operations team worry about setting up the production infrastructure. It used to be easy! I could just ftp my files to production, and voila! My app was live and it was time for a beer. Real applications are much more complex. As I evolved my skillset I started to do more and expand my operations knowledge.

Infrastructure Automation

When I was young you actually had to build a server from scratch, buy power and connectivity in a data center, and manually plug a machine into the network. After wearing the operations hat for a few years I have learned many operations tasks are mundane, manual, and often have to be done at two in the morning once something has gone wrong. DevOps is predicated on the idea that all elements of technology infrastructure can be controlled through code. With the rise of the cloud it can all be done in real-time via a web service.

Growing pains

When you are responsible for large distributed applications the operations complexity grows quickly.

  • How do you provision virtual machines?
  • How do you configure network devices and servers?
  • How do you deploy applications?
  • How do you collect and aggregate logs?
  • How do you monitor services?
  • How do you monitor network performance?
  • How do you monitor application performance?
  • How do you alert and remediate when there are problems?

Combining the power of developers and operations

The focus on the developer/operations collaboration enables a new approach to managing the complexity of real world operations. I believe the operations complexity breaks down into a few main categories: infrastructure automation, configuration management, deployment automation, log management, performance management, and monitoring. Below are some tools I have used to help solve these tasks.

Infrastructure Automation

Infrastructure automation solves the problem of having to be physically present in a data center to provision hardware and make network changes. The benefits of using cloud services is that costs scale linearly with demand and you can provision automatically as needed without having to pay for hardware up front.

Configuration Management

Configuration management solves the problem of having to manually install and configure packages once the hardware is in place. The benefit of using configuration automation solutions is that servers are deployed exactly the same way every time. If you need to make a change across ten thousand servers you only need to make the change in one place.

There are other vendor-specific DevOps tools as well:

Deployment Automation

Deployment automation solves the problem of deploying an application with an automated and repeatable process.

Log Management

Log management solves the problem of aggregating, storing, and analyzing all logs in one place.

Performance Management

Performance management is about ensuring your network and application are performing as expected and providing intelligence when you encounter problems.

Monitoring

Monitoring and alerting are a crucial piece to managing operations and making sure people are notified when infrastructure and related services go down.

In my next post of this series we will dive into each of these categories and explore the best tools available in the devops space. As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

Click here for DevOps Scares Me Part 3.

DevOps Scares Me – Part 1

Call of Duty: DevOpsDevOps is scary stuff for us pure Ops folks that thought they left coding behind a long, long time ago. Most of us Ops people can hack out some basic (or maybe even advanced) shell scripts in Perl, ksh, bash, csh, etc… But the term DevOps alone makes me cringe and think I might really need to know how to write code for real (which I don’t enjoy, that’s why I’m an ops guy in the first place).

So here’s my plan. I’m going to do a bunch of research, play with relevant tools (what fun is IT without tools?), and document everything I discover here in a series of blog posts. My goal is to educate myself and others so that we operations types can get more comfortable with DevOps. By breaking down this concept and figuring out what it really means, hopefully we can figure out how to transition pure Ops guys into this new IT management paradigm.

What is DevOps

Here we go, I’m probably about to open up Pandoras Box by trying to define what DevOps means but to me that is the foundation of everything else I will discuss in this series. I started my research by asking Google “what is devops”. Naturally, Wikipedia was the first result so that is where we will begin. The first sentence on Wikipedia defines DevOps as “a software development method that stresses communication, collaboration and integration between software developers and information technology (IT) professionals.” Hmmm… This is not a great start for us Ops folks who don’t really want anything to do with programming.

Reading further on down the page I see something more interesting to me … “The goal is to automate as much as possible different operational processes.” Now that is an idea I can stand behind. I have always been a fan of automating whatever repetitive processes that I can (usually by way of shell scripts).

Taking Comfort

My next stop on this DevOps train lead me to a very interesting blog post by the folks at the agile admin. In it they discuss the definition and history of DevOps. Here are some of the nuggets that were of particular interest to me:

  • “Effectively, you can define DevOps as system administrators participating in an agile development process alongside developers and using many of the same agile techniques for their systems work.”
  • “It’s a misconception that DevOps is coming from the development side of the house – DevOps, and its antecedents in agile operations, are largely being initiated out of operations teams.”
  • “The point is that all the participants in creating a product or system should collaborate from the beginning – business folks of various stripes, developers of various stripes, and operations folks of various stripes, and all this includes security, network, and whoever else.”

Wow, that’s a lot more comforting to my fragile psyche. The idea that DevOps is being largely initiated out of the operations side of the house makes me feel like I misunderstood the whole concept right from the start.

For even more perspective I read a great article on O’Reilly Radar from Mike Loukides. In it he explains the origins of dev and ops and shows how operations has been changing over the years to include much more automation of tasks and configurations. He also explains how there is no expectation of all knowing developer/operations super humans but instead that operations staff needs to work closely or even be in the same group as the development team.

When it comes right down to it there are developers and there are operations staff. The two groups have worked too far apart for far too long. The DevOps movement is an attempt to bring these worlds together so that they can achieve the effectiveness and efficiency that the business deserves. I really do feel a lot better about DevOps now that I have done more research into the basic meaning and I hope this helps some of you who were feeling intimidated like I was. In my next post I plan to break down common operations tasks and talk about the tools that are available to help automate those tasks and their associated processes.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

Click here for DevOps Scares Me Part 2.