Driving Business Decisions Off Data

There are multiple systems in use for basing business decisions. The popular business intelligence (BI) market focuses on the use of back office data that must be aggregated or otherwise centralized and sliced and diced to make business decisions. While this is clearly critical data, and BI is a $14.4B market according to Gartner (calendar year 2013). Software-defined businesses need a far more real-time view of the front office and systems of engagement. These systems of engagement change far more quickly and require real-time response, similar to running IT Operations versus many other parts of IT. These analytics technologies will not only be used to guide product decisions, but also to enable a fluid organization.

Gartner predicts by 2017, 70 percent of successful digital business models will rely on deliberately unstable processes designed to shift as customer needs morph. (See: http://www.gartner.com/newsroom/id/2866617 )

As a result of a fluid business model, and fluid decision-making to drive innovation, the processes must be adjusted and adapt to this change. The adaptation creates instability in processes, but it’s essential to meet customer preferences and demands. The agile product, agile development, and agile organizations require adaptation and experimentation based on customer interaction. The business transaction marries together the user with the processes customers interact with, making that the discrete focus of monitoring. This monitoring must be captured, typically by APM tools such as ours, or by writing custom instrumentation within the software, and finally sent into analytics engines that can provide the insight.

We still have a significant problem in the industry, which is evident in the monitoring space. We have a dashboard problem, and buyers make decisions on dashboards, not insight. The love of dashboards continues to proliferate tool fragmentation and dashboards.

Today’s analytics are user driven, where the user of the tool or product is driving the analysis and making decisions. With advances in machine learning (See: http://www.quora.com/What-are-the-important-advances-in-machine-learning-over-the-past-decade) we are starting to see these new algorithms being applied to IT Operations data and ITOA which will allow for machine driven analytics and insight. This significant shift will have repercussions across IT Operations Management.

2- Image.png

Executive Summary Flipping to Digital Leadership: The 2015 CIO Agenda

Further reading:

Priorities for Application Operations Teams in 2015 [INFOGRAPHIC]

As we start 2015, IT and Application Operations teams prioritize their goals to improve the overall efficiency, migrate to the cloud, place importance on big data, among other department goals.

In case you’ve been living under a rock, and haven’t heard about the monumental success of AppDynamics AppSphere™ 2014, well now you have some required reading. As part of the event, we surveyed all those present on their IT priorities and and the results were quite surprising.

Here are a few noteworthy stats:

  • Docker and other containers are growing. 25% said they plan to use a container solution in the next year.
  • Nearly half of respondents (46%) listed “Improving Operational Efficiency” as their number one priority.
  • Enterprises still prefer private cloud to public.

Check out the full infographic below…

 

Interested in attended the next AppDynamics AppSphere? You can pre register now and save your seat!

Want to see how AppDynamics can help your IT and Application Operations teams? Download a FREE trial now!

 

Too Big To Scale – Data Visualizations at Web Scale

Big news last week was about the JPMC data breach that could potentially impact millions of customers. This news brought about the return of “banks that are too big to fail” discussions on a much smaller scale then we saw during the financial crisis a few years back. The fact is, we live in a world where data breaches can and will happen and the response to these incidents becomes critically important to the company as well as the consumer.

Data at Web Scale

This sequence of events got me thinking about just how big many of our enterprise and consumer applications have become and the problems associated with this new world of “Web Scale”. From a monitoring tool perspective, web scale means collecting massive amounts of data, making sense out of this data using analytics, and visualizing this data using a combination of analytics and new technologies.

At AppDynamics, we’re monitoring single application environments with up to 15,000 components delivering mission critical application services. This is a very unique challenge and one that requires the right combination of analytics and visualization tools. In August we announced the fact that we had brought big data technologies into our summer release (http://blog.appdynamics.com/news/appdynamics-summer–2014-release/) and this blog post shows the result with real customer data.

Visualization of a massive application environment

While other APM companies are busy re-architecting their platforms to figure out how to scale their data collection technology to meet the demands of web scale, AppDynamics is years past that challenge and solving the analytics and visualization hurdles. As you can see in the screenshots below, we’ve got great technology that automatically determines what details should be shown and what details should be hidden as you zoom in and out on application flow map.

image013

Shifting the flowmap into fullscreen mode let’s you easily visualize your entire web scale application as seen below.

image014

Clicking on any node in the flow map shows more detail about the health of the associated components along with information useful for troubleshooting problems.

image012

Zooming in on fewer application components automatically shows a much higher level of detail to ensure you have all of the information you need to solve problems.

image011

Built to scale

Monitoring web scale applications is difficult! Trying to retrofit old technology with architectural band-aids just doesn’t scale in today’s modern application environment. If you’ve got web scale applications or if they’re even on your roadmap then you need a system that was built to scale with your business. Try AppDynamics for free today and see what a modern application monitoring platform can do for you.

AppDynamics Brings Big Data Science to APM in Summer ’14 Release

Today I am pleased to announce the availability of the AppDynamics Summer ‘14 Release.  With this release, AppDynamics brings the first event store to capture and process big data streams in real-time to the APM industry.  Large and complex applications generate data at an extremely high velocity, requiring a monitoring platform to scale along with them. Many business critical applications and operational insights are hidden in the data generated by these applications. This unified platform delivers a central, massively scalable platform to manage all tiers of the application infrastructure.

This release has major enhancements for each of the three layers of the Application Intelligence platform:

Clear, meaningful data visualization

AppDynamics was the first to market with transaction-based topology views of applications that make managing and scaling service oriented architectures easier than ever.  In our latest release, we’ve raised the bar again by offering clear, meaningful data visualization powered by self-learning algorithms for today’s leading enterprise companies.

Advanced flow map visualizations

Self-aggregating flow maps

AppDynamics introduces advanced flow maps that are powered by sophisticated algorithms to make complex architectures more manageable by condensing and de-condensing information to enable intelligent zooming in and out of the topology. These visualization techniques also deliver the right level of granularity of application health indicators and traffic reports to match the zoom level.

Screen Shot 2014-08-13 at 11.13.47 AM

Self-organizing layouts

In the Summer ‘14 release, the dashboards self-organize complex graphs of service and tier dependencies by using auto-grouping heuristics to dynamically determine tier and node weightages. These auto-grouping heuristics rely on dynamic patterns detected across static data such as application tier profiles and dynamic KPIs such as transaction response times, business data complexity, etc.  These algorithms then surface up the business-critical nodes and tiers to application owners and administrators for appropriate attention.

Screen Shot 2014-08-13 at 11.13.58 AM

Self-learning transaction engine

Application owners can benefit greatly if they are armed with smart engines that automatically identify and group these transactions taking away the guesswork out of the exercise. This is based on a combination of historical as well as statistical analysis of large volumes of live execution data.  AppDynamics uniquely does live traffic introspection and creates groupings of business transactions from millions of requests of live traffic to improve business manageability.

Screen Shot 2014-08-13 at 11.14.23 AM

Smart dashboards

Managing and monitoring large deployments with thousands of nodes and tiers can be overwhelming for the APM professional. Creating a separate dashboard for each of the thousand nodes and tiers individually is a near impossible task. In this release, AppDynamics introduces powerful dashboarding templates that auto-generate dashboards based on configurable parameterized characteristics of the nodes or tiers. This new feature enhances monitoring productivity by making the dashboards reusable for all nodes without having to replicate efforts.

Screen Shot 2014-08-13 at 11.25.48 AM

 

Platform to capture and process Big Data streams in real-time

Screen Shot 2014-08-13 at 11.14.48 AM

A new infinitely scalable event service that captures real-time events generated by an application. With this event service, organizations can flexibly define structured and unstructured events, and start capturing them with a public API. This infinitely scalable service has been certified for up to 10 trillion events. Archives of these events can be captured forever, and can be used for historical analyses as well.

A new Hadoop-powered metrics service that crunches massive volumes of time-series data to deliver key application and business metrics in real-time. With its new enhancements, organizations can easily roll-up metrics at the tier levels, application levels, or time-series levels with no loss in granularity of the information. Leveraging the new complex algorithms that can crunch billions of metrics, this metrics service generates self-learning baselines that are refreshed reflecting the up to the minute application and business performance.

We’ve also improved our real-time percentile metrics capabilities that put application and business performance in context. Metrics without a statistical context often don’t reveal the real picture. SLA metrics are more meaningful when presented with the context of percentiles, and when the outliers are automatically identified and surfaced up with alerts for immediate attention or automated remediation.  The percentile functionality in AppDynamics is now configurable, allowing teams to define what percentiles they want to be collected.

Screen Shot 2014-08-13 at 11.15.00 AM

This unified platform delivers a central, linearly scalable platform to manage all tiers of their application infrastructure. Through a single pane of glass, IT organizations can break down application tier silos and monitor their business with comprehensive end-to-end visibility. This unified platform lowers the total cost of ownership of their application infrastructure while lowering their time to issue resolution.

Industry’s most comprehensive monitoring and data collection offering

The AppDynamics Summer ‘14 Release includes several new and enhanced features related to data collection and monitoring, including distributed transaction correlation among all of the languages we support.

Screen Shot 2014-08-13 at 11.19.11 AM

With the industry’s first Node.js distributed transaction monitoring users can now monitor distributed Node.js transactions across all application tiers including Java, .NET and PHP. Node.js can automatically correlate downstream calls to quickly and efficiently isolate and troubleshoot performance bottlenecks.

Screen Shot 2014-08-13 at 11.15.17 AM

AppDynamics adds support for instrumenting native C++ applications with the beta release of the AppDynamics C++ SDK, which provides visibility into C++ applications and tiers. We’ve also added support for Java 8, which makes it easier for businesses to deploy and integrate AppDynamics into the latest generation of Java and Scala applications.

Finally, we’ve announced support for monitoring .NET asynchronous transactions. AppDynamics gives customers the ability to automatically identify asynchronous transactions in dashboards, troubleshoot asynchronous calls in transaction snapshots and analyze async activity in the metric browser.

Screen Shot 2014-08-13 at 11.15.26 AM

For a detailed look at these advancements, check out our webinar recording.

If you’d like to try these new capabilities out for yourself, start your free trial of AppDynamics today.

 

Big Data Monitoring

The term “Big Data” is quite possibly one of the most difficult IT-related terms to pin down ever. There are so many potential types of, and applications for Big Data that it can be a bit daunting to consider all of the possibilities. Thankfully, for IT operations staff, Big Data is mostly a bunch of new technologies that are being used together to solve some sort of business problem. In this blog post I’m going to focus on what IT Operations teams need to know about big data technology and support.

Big Data Repositories

At the heart of any big data architecture is going to be some sort of NoSQL data repository. If you’re not very familiar with the various types of NoSQL databases that are out there today I recommend reading this article on the MongoDB website. These repositories are designed to run in a distributed/clustered manner so they they can process incoming queries as fast as possible on extremely large data sets.

MongoDB Request Diagram

Source: MongoDB

An important concept to understand when discussing big data repositories is the concept of sharding. Sharding is when you take a large database and break it down into smaller sets of data which are distributed across server instances. This is done to improve performance as your database can be highly distributed and the amount of data to query is less than the same database without sharding. It also allows you to keep scaling horizontally and that is usually much easier than having to scale vertically. If you want more details on sharding you can reference this Wikipedia page.

Application Performance Considerations

Monitoring the performance of big data repositories is just as important as monitoring the performance of any other type of database. Applications that want to use the data stored in these repositories will submit queries in much the same way as traditional applications querying relational databases like Oracle, SQL Server, Sybase, DB2, MySQL, PostgreSQL, etc… Let’s take a look at more information from the MongoDB website. In their documentation there is a section on monitoring MongoDB that states “Monitoring is a critical component of all database administration.” This is a simple statement that is overlooked all too often when deploying new technology in most organizations. Monitoring is usually only considered once major problems start to crop up and by that time there has already been impact to the users and the business.

RedisDashboard

Dashboard showing Redis key metrics.

One thing that we can’t forget is just how important it is to monitor not only the big data repository, but to also monitor the applications that are querying the repository. After all, those applications are the direct clients that could be responsible for creating a performance issue and that certainly rely on the repository to perform well when queried. The application viewpoint is where you will first discover if there is a problem with the data repository that is actually impacting the performance and/or functionality of the app itself.

Monitoring Examples

So now that we have built a quick foundation of big data knowledge, how do we monitor them in the real world?

End to end flow – As we already discussed, you need to understand if your big data applications are being impacted by the performance of your big data repositories. You do that by tracking all of the application transactions across all of the application tiers and analyzing their response times. Given this information it’s easy to identify exactly which components are experiencing problems at any given time.

FS Transaction View

Code level details – When you’ve identified that there is a performance problem in your big data application you need to understand what portion of the code is responsible for the problems. The only way to do this is by using a tool that provides deep code diagnostics and is capable of showing you the call stack of your problematic transactions.

Cassandra_Call_Stack

Back end processing – Tracing transactions from the end user, through the application tier, and into the backend repository is required to properly identify and isolate performance problems. Identification of poor performing backend tiers (big data repositories, relational databases, etc…) is easy if you have the proper tools in place to provide the view of your transactions.

Backend_Detection

AppDynamics detects and measures the response time of all backend calls.

Big data metrics – Each big data technology has it’s own set of relevant KPIs just like any other technology used in the enterprise. The important part is to understand what is normal behavior for each metric while performance is good and then identify when KPIs are deviating from normal. This combined with the end to end transaction tracking will tell you if there is a problem, where the problem is, and possibly the root cause. AppDynamics currently has monitoring extensions for HBase, MongoDB, Redis, Hadoop, Cassandra, CouchBase, and CouchDB. You can find all AppDynamics platform extensions by clicking here.

HadoopDashboard

Hadoop KPI Dashboard 1

HadoopDashboard2

Hadoop KPI Dashboard 2

Big data deep dive – Sometimes KPIs aren’t enough to help solve your big data performance issues. That’s when you need to pull out the big guns and use a deep dive tool to assist with troubleshooting. Deep dive tools will be very detailed and very specific to the big data repository type that you are using/monitoring. In the screen shots below you can see details of AppDynamics monitoring for MongoDB.

MongoDB Monitoring 1

 MongoDB Monitoring 2

MongoDB Monitoring 3

MongoDB Monitoring 4

If your company is using big data technology, it’s IT operations’ responsibility to deploy and support a cohesive performance monitoring strategy for the inevitable performance degradation that will cause business impact. See what AppDynamics has to offer by signing up for our free trial today.

Buzzword Bingo by UrbanDictionary.com

It seems like every article, tweet, blog post I read someone has a different definition of the same buzzwords – especially in technology.  Mentioning cloud or big data on a tech blog is like bringing sand to the beach. That’s one of the reasons why we made The Real DevOps of Silicon Valley – to make fun of the hype.  I got to thinking… has anyone taken the time to shed some light on these ambiguous terms?  I investigated on Urban Dictionary and this is what I found…

Screen Shot 2013-02-22 at 3.04.29 PM

IT According to UrbanDictionary.com
(Not kidding, look it up…)

 

Screen Shot 2013-02-22 at 5.28.11 PM

1. CLOUD COMPUTING
cloud com·put·ing, noun.

“Utilizing the resonance of water molecules in clouds when disturbed by
wireless signals to transmit data around the globe from cloud to cloud.
I use cloud computing so I don’t have to worry about viruses, I
only have to worry about birds flying through my cloud.’”

 

2. AGILE

Screen Shot 2013-02-22 at 5.30.08 PMag·ile
, adj.

“Agile is a generalized term for a group of anti-social behaviors used by office workers to avoid doing any work while simultaneously giving the appearance of being insanely busy. Agile methods include visual distraction, subterfuge, camouflage, psycho-babble, buzzwords, deception, disinformation, and ritual humiliation. It has nothing to do with the art and practice of software engineering.”

 

3. BIG DATA
Screen Shot 2013-02-22 at 5.18.19 PMbig da·ta, noun.

“Modern day version of Big Brother. Online searches, store purchases, Facebook posts, Tweets or Foursquare check-ins, cell phone usage, etc. is creating a flood of data that, when organized and categorized and analyzed, reveals trends and habits about ourselves and society at large.”

 

4. DEVOPS
Screen Shot 2013-02-22 at 5.13.42 PMdev·ops, adj.

“When developers and operations get together to drink beer and color on whiteboards to avoid drama in the War Room.  Also a buzzword for recruiters to use to promote overpaid dev or ops jobs.”

Watch episode HERE.

 

5. SOFTWARE
soft·ware, noun.

Screen Shot 2013-02-22 at 5.13.21 PM

“The parts of a computer that can’t be kicked, but ironically
deserve it most.

 

6. IT
i·t, noun. 

“The word the Knights of Ni cannot hear or say.”Screen Shot 2013-02-22 at 5.14.11 PM
(Monty Python & the Holy Grail reference)

 

FamilySearch Saves $4.8 Million with AppDynamics [Infographic]

Everyone and their mother is talking about big data these days – how to manage it, how to analyze it, how to gain insight from it – but very few organizations actually have big data that they have to worry about managing or analyzing. That’s not the case for FamilySearch, the world’s largest genealogy organization. FamilySearch has 10 petabytes of census records, photographs, immigration records, etc. in its database, and its data grows every day as volunteers upload more documents. Ironically, this organization that’s tasked with cataloging our past is now on the forefront of the big data trend, as they’re being forced to find new and innovative ways to manage and scale this data.

From 2011 to 2012, FamilySearch scaled almost every aspect of their application, from data to throughput to user concurrency. According to Bob Hartley, Principal Engineer and Development Manager at FamilySearch, AppDynamics was instrumental in this project. Hartley estimates that FamilySearch saved $4.8 million over two years by using AppDynamics to optimize the application instead of scaling infrastructure. That’s a pretty big number, so we broke it down for you in this infographic:

Embed this image on your site:

 

How FamilySearch Scaled

  • From 11,500 tpm to 122,000 tpm
  • From 6,000 users per minute to 12,000 users per minute
  • From 12 application releases per year to 20 application releases per year
  • From 10 PB of data to approaching 20 PB of data
  • No additional infrastructure
  • Response time reduced from minutes to seconds

Before AppDynamics

  • 227 Severity-1 incidents/year took 33 hours each to troubleshoot
  • 300 pre-production defects per year took 49 hours each to troubleshoot
  • This amounts to a total of 36,891 man-hours spent on troubleshooting every year

After AppDynamics

FamilySearch estimates that they saved $4.8 million with AppDynamics in two years. That’s a huge number, so let’s break it down:

Infrastructure Savings:

  • FamilySearch would have had to purchase 1,200 servers at approx. $1,000 each, amounting to $1,200,215 in savings
  • Those 1,200 servers would cost $2,064,370 in power and air conditioning
  • Those 1,200 servers would cost $200,000 in administrative costs over two years

Productivity Savings:

FamilySearch estimates that they’ve reduced troubleshooting time for both pre-production defects and production incidents by 45%, amounting to $885,170 in savings for pre-production and $460,836 in savings for production incidents (based on average salaries for those positions).

To learn more about what FamilySearch accomplished and how they use AppDynamics, check out their case study and Bob Hartley’s video interview on the FamilySearch ROI page.

AppJam 2012: When Big Data Meets SOA at FamilySearch

Bob Hartley, Principle Engineer from Family Search
Family Search’s distributed platform generates over 1.5 terabytes of data every day, servicing a user base of 3 million (and growing). Furthermore, Family Search’s application environment is highly dynamic, with code release cycles as short as 40 minutes. In this session Bob will discuss common challenges associated with managing big data in a dynamic and distributed environment, as well as his own technologies, methodologies and processes for maintaining uptime and availability in Family Search’s applications.

Slides:

[slideshare id=15678336&doc=familysearchfinal-121217200655-phpapp01]

AppDynamics Pumps up the Jam in San Francisco

It’s been a week since we hosted AppJam Americas, our first North American user conference in San Francisco. With myself as master of ceremonies, and a minor wardrobe malfunction at the start (see video at the end of this post), the entire day was a huge success for us and our customers. One thing that stuck in my mind was that applications today have become way more complex to manage—and strategic monitoring has become key to mastering that complexity. Simply put, SOA+Virtualization+Big Data+Cloud+Agile != Easy.

The day started with Jyoti Bansal, our CEO and Founder outlining his vision to be the world’s #1 solution for managing modern web applications. The simple facts are that applications have become more dynamic, distributed and virtual. All of these factors have increased their operational complexity, and log files and legacy monitoring solutions are ill-suited to the task.

Jyoti then outlined our core design principles around Business Transaction Monitoring, Self-learning, intelligence and the need to keep app management simple. He then suggested what the audience could expect from AppJam: “AppJam is about sharing knowledge, learning best practices, guiding our direction and Jamming.” (We’re pretty sure by “jamming” he meant “partying.”)

With the intro from Jyoti done, it was time for me to nose dive the stage and introduce our first customer speaker – Ariel Tsetlin from Netflix.

?How Netflix Operates & Monitors in the Cloud

With 27 million customers around the world, Neflix’s growth over the past three years has been meteoric. In fact, they found that they couldn’t build data centers fast enough. Hence, they moved to the public cloud in AWS for better agility.

In his session, Ariel talked about Netflix’s architecture in the cloud and how they built their own PaaS in terms of apps and clusters on top of Amazons IaaS. One unique thing Netflix does is bake their OS, middleware, apps and monitoring agents into a single image rather than using a tool like Chef or Puppet to manage application configuration and deployment separately from the underlying OS, middleware and tools. Everything is automated and managed at the instance level, with developers given the freedom and responsibility to deploy whenever they want to. That’s pretty cool stuff when you consider that developers now manage their own capacity and auto-scaling within the Cloud.

Ariel then talked about the assumption that failure is inevitable in the Cloud, with the need to plan and design around the fact that every part of the application can and will fail at some point. Testing for failure through “monkey theory” and Netflix’s “Simian Army” allows them to simulate failure at every level of the application, from randomly killing instances to taking out entire availability zones in AWS.

From a monitoring perspective, Netflix uses internally developed tools and AppDynamics, which are also baked into their AWS images. Doing so allows developers to live and die by monitoring in production through automated alerts and problem discovery. What’s perhaps different is that Netflix focuses their monitoring at the service level (e.g. app cluster), rather than at the infrastructure level–so they’re really not interested in CPU or memory unless it’s impacting their end users or business transactions.

Finally, Ariel spoke about AppDynamics at Netflix, touching on the fact they monitor over 1 million metrics per minute across 400+ business transactions and 300+ application services, giving them proactive alerts with URL drill-down into business transaction latency and errors from self-learned baselines. Overall, it was a great session for those looking to migrate and operate their application in the Cloud.

When Big Data Meets SOA

Next up was Bob Hartley, development manager from Family Search, who gave an excellent talk about managing SOA and Big Data behind the world largest genealogy architecture. With almost 3 billion names indexed and 550+ million high resolution digital images, FamilySearch has over 20 petabytes of data which needs to be managed by their Java and Node.JS distributed architecture spanning 5,000 servers. What’s scary is that this architecture and data is growing at a rapid pace, meaning application performance and scalability is fundamental to the success of Family Search.

After a brief intro, Bob started to talk about his Big Data architecture in terms of what technologies they were using to manage search queries, images, and people records. Clusters of Apache Lucene, SOLR, and custom map-reduce combined with traditional relational database technology such as Oracle, MySQL, and Postgres.

Bob then talked about his team’s mission – to enable business agility through visibility, responsiveness, standardization, and vendor independence. At the top of this list was to provide joy for customers and stakeholders through delivering features that matter faster.

Bob also emphasized the need for repeatable, reliable and automated processes, as well as the need to monitor everything so his team could manage the performance of their SOA and Big Data application through continuous agile release cycles. Family Search has gone from a 3-month release cycle to a continuous delivery model in which changes can be deployed in just 40 minutes. That’s pretty mind blowing stuff when you consider the size and complexity of their environment!

What’s interesting is that Release != Deploy at FamilySearch; they incrementally roll out out new features to different sets of users using flags, allowing them to test and tease features before making them available to everyone. Monitoring is at the heart of their continuous release cycle, with Dev and Ops using baselines and trending to determine the impact of new features on application performance and scalability.

In terms of the evaluation process, the company looked at 20 different APM vendors over a 6 month period before finally settled on AppDynamics due to our dynamic discovery, baselining, trending, and alerting of business transactions. As Bob said, “AppDynamics gave us valuable performance data in less than one day. The closest competitors took over 2 weeks just to install their tools.”

Today, a single AppDynamics management server is used in production to monitor over 5,000 servers, 40+ application services, and 10 million business transactions a day. Since deployment, Family Search has managed to find dozens of problems they’ve had for years, and have managed to scale their application by 10x without increasing server resources. They’ve also seen MTTD drop from days to minutes and MTTR drop from months to hours and minutes.

Bob finished his talk with his lessons learned for managing SOA, Big Data and Agile applications: “Keep Architecture Simple,” “Speed of delivery is essential,” “Systems will eventually fail,” and “Working with SOA, Big Data and Agile is hard.”

How AppDynamics is accelerating DevOps culture at Edmunds.com

After lunch, John Martin, Senior Director of Production Engineering, spoke about DevOps culture at Edmunds.com and how AppDynamics has become central to driving team collaboration. After a brief architecture overview outlining his SOA environment of 30 application services, John outlined what DevOps meant to him and his team – “DevOps is really about Collaboration – the most challenging issues we faced were communication.” Openly honest and deeply passionate throughout his session, John talked about three key challenges his team faced over the years that were responsible for the move to DevOps:

1. Infrastructure Growth

2. Communication Failure

3. Go Faster & Be Efficient

In 2005 Edmunds.com had just 30 servers; by the end of this year that figure will have risen to 2,500. Through release automation using tools such as Bladelogic and Chef, John and his team are now able to perform a release in minutes versus the 8 hours it took back in 2005.

John gave an example on communication failure in which development was preparing for a major release at Edmunds.com using a new CMS platform. This release was performance-tested just two weeks prior to go-live. Unfortunately the new platform showed massive scalability limitations, causing Ops to work around the clock to over-provision resources as a tactical fix. Fortunately the release was delivered on time and the business was happy. However, they suffered as a technology organization due to finding architecture flaws so late in the game – “We needed a clear picture of what went wrong and how we were going to prevent such breakdown in future.”

Another mistake with a release in 2010 which forced a major re-think between development and operations. It was this occurrence that caused Edmunds.com to get really serious about DevOps. In fact, the technical leads got together and reorganized specialized teams within Dev and Ops to resolve deployment issues and shed pre-conceptions on who should do what. The result was improved relationships, better tooling, and a clearer perspective on how future projects could work.

John then touched on the tools that were accelerating DevOps culture, specifically Splunk for log files and AppDynamics for application monitoring. “AppDynamics provides a way for Dev and Ops to speak the same language. We’ve saved hundreds of hours in pre-release tests and discovered many new hotspots like the performance of our inventory business transaction which increased by 111%.” In fact, within the first year, AppDynamics generated a ROI of $795,166 with year 2 savings estimated at a further $420k. John laughed, “As you can see, AppDynamics wasn’t a bad investment.”

John ended his session with 5 tips for ensuring that DevOps succeeds in an organization: Be honest, communicate early and often, educate, criticize constructively, and create champions. Overall, a great session on why DevOps is needed in today’s IT teams.

Zero to Production APM in 30 days (while sending half a billion messages per day)

The final customer session of the day came from Kevin Siminski, Director of Infrastructure Operations at ExactTarget and it was definitely worth waiting for. Kevin actually kicked off his talk by describing a weekly product tech sync meeting which he had with his COO. The meeting was full with different stakeholders from development and operations who were discussing a problem that they were currently experiencing in production.

“I literally got my laptop out, brought up the AppDynamics UI and in one minute we’d found the root cause of the problem,” Kevin said. Not a bad way to get his point across of why the value of Application Performance Management (APM) in 2012 is so important.

Kevin then gave a brief intro to ExactTarget and the challenges of powering some of the world’s top brands like Nike, BestBuy and Priceline.com. ExactTarget’s .NET messaging environment is highly virtualized with over 5,000 machines that generate north of 500 million messages per day across multiple Terabytes of databases.

Kevin then touched on the role of his global operations team and how his team’s responsibility had shifted over the last four years. “My team went from just triaging system alerts to taking a more proactive approach on how we managed emails and our business. Today my team actively collaborates with development, infrastructure and support teams.” All these teams are now focused and aligned on innovation, stability, performance and high availability.

Kevin then outlined his 30-day implementation plan for deploying AppDynamics across his entire environment using a single dedicated systems engineer and an AppDynamics SaaS management server for production. Week 1 was spent on boarding the IT-security team, reviewing config mgmt and testing agent deployment to validate network and security paths. Week 2 involved deploying agents to a few of the production IIS pools and validating data collection on the AppDynamics management server. Week 3 saw all agents pushed to every IIS pool with collection mechanism sent to disabled. The config mgmt team then took over and “owned” the deployment process for go live. Week 4 saw all services and AppDynamics agents enabled during a production change window with all metrics closely monitored throughout the week to ensure no impact or unacceptable overhead.

AppDynamics’ first mission was to monitor the ExactTarget application as it underwent an upgrade to its mission-critical database from SQL Server 2003 to 2008. It was a high-risk migration as Kevin’s team were unable to assess the full risk due to legacy application components, so with all hands on the deck they watched AppDynamics as the migration happened in real-time. As the switch was made, application calls per minute and response time remained constant but application errors began to spike. By drilling down on these errors in AppDynamics, the dev team was quickly able to locate where they were coming from and resolve the application exceptions.

Today, AppDynamics is used for DevOps collaboration and feedback loops so engineers get to see the true impact of their releases in a production environment, a process that was requested by a product VP outside of Kevin’s global operations team. Overall, Kevin relayed an incredible story of how APM can be deployed rapidly across the enterprise to achieve tangible results in just 30 days.

A nice surprising statistic that I later realized in the evening was that the total number of servers being monitored by AppDynamics across our four customer speakers was well over 20,000 nodes. Having been in the APM market for almost 10 years I’m struggling to think of another vendor with such successful large scale production deployments.

Here’s a link to the photo gallery of AppJam 2012 Americas. A big thank you to our customers for attending and we’ll see you all next year!

For those keen to see my stage nosedive here you go:

https://www.youtube.com/watch?v=WYW6Bur8FaA

Appman.

Data Clouds Part II: My Big Data Dashboard

In my previous blog, I wrote at length about the complexities of running a data cloud in production. This logical data set, spread across many nodes, requires a whole new set of tools and methodologies to run and maintain. Today we’ll look at one of the biggest challenges in managing a data cloud – monitoring.

Database monitoring used to be easy in the days before data clouds. Datasets were stored in a single large database, and there were hundreds of off-the-shelf products available to monitor the performance of that database. When problems occurred, one had simply to open up the monitoring tool and look at a set of graphs and metrics to diagnose the problem.

There are no off-the-shelf tools for monitoring a data cloud, however. There’s no easy way to get a comprehensive view of your entire data cloud, let alone diagnose problems and monitor performance. Database monitoring solutions simply don’t cut it in this kind of environment. So how do we monitor the performance of our data cloud? I’ll tell you what I did.

It just so happens I work at AppDynamics, one of the most powerful application monitoring tools on the market. We monitor all parts of your application including the data layer, with visibility into both Relational and NoSQL systems like Cassandra. With AppDynamics I was able to create a dashboard that gives me a single pane-of-glass view into the performance of my data cloud.

Big Data Dashboard

My Big Data Dashboard

This dashboard is now used in several departments at AppDynamics including Operations, QA, Performance and development teams to see how our data cloud is running. All key metrics about all of our replicas are graphed side by side on one screen. This is the dream of anyone running big data systems in production!

Of course, not all problems are system wide. More often than not you need to drill into one replica or replica set to find a problem. To do that, I simply double click on any part of my big data dashboard to focus on a single replica, change the time range, and add more metrics.

Data clouds are difficult to run, and there aren’t any database monitoring tools fit to monitor them yet. But instead of sitting around waiting for data monitoring tools to catch up with our needs, I’ve built my own Big Data Dashboard with monitoring tool designed for applications.

Of course the fun doesn’t stop here…I still need to find a way to set up alerts and do performance tuning for my data cloud. Stay tuned for more blogs in this series to see how I do it!