Why we chose HBase over other NOSQL Databases for our metric service rearchitecture

AppDynamics Application Intelligence provides the business and operational insights into application performance, user experience and business impact of software applications. At the core of the application intelligence platform is the metric processing engine that helps record, track & compare performance metrics. As application complexity explodes and many businesses break their monolith application to microservices, the number of metrics we needed to capture, store and process explode exponentially as well.

Metrics collection and processing from heterogeneous environment

AppDynamics Application Performance Management (APM) agents automatically instrument millions of lines of code across thousands of application tiers in production environments within minutes, all with minimal overhead. These APM agents provide support for a wide variety of languages and frameworks including Java, .Net, Node.js, PHP, and Python. Browser and mobile agents also collect end user monitoring metrics from browser and mobile devices.

Screen Shot 2016-01-24 at 11.14.59 PM.png

Additionally, server monitoring agents collect server metrics, and AppDynamics extensions bring in metrics from heterogeneous applications and infrastructure components.

The metrics gathered by these agents are sent periodically to a metric processing engine. The metric processing engine aggregates the metric in different dimensions and creates multiple views. These views are used extensively for reporting; a very sophisticated rule engine also continuously evaluates policies, on these aggregated views.

Challenges with initial metric service implementation

MySQL powered our initial metric processing engine. The MySQL engine worked well for our customers and our SaaS deployment, but as metrics grew exponentially, we risked hitting the physical boundaries of MySQL with the number of metrics we needed to process. The cost of processing also skyrocketed, so to reduce cost of storage we rolled up data in a time dimension, providing less resolution for older metrics. That allowed us to expire highly granular collected metrics much earlier and retain time rolled up data for longer periods of time. However, we started receiving feedback from customers to increase this limit even more to accommodate different use cases.

Requirements for metric service re-architecture:

To address these challenges, the AppDynamics Application Intelligence Platform’s metric service needed to:

  • increase the total metrics processing capabilities by many folds.
  • increase data retention period for all resolution levels.

Once these requirements solidified, we soon realized the best solution would be to move to a new platform that could support us on a much greater scale. In addition to addressing these two key requirements we also had to enhance the metric service in following areas:

Real time Cluster Rollup

Cluster-wide performance metric aggregation is a crucial feature of AppDynamics APM solution. Let’s take an example of an E-Commerce Service that may consist of multiple services like inventory service, order processing system, web Tier service, invoicing system, etc. Each of these services could be hosted independently on a single server/node or a distributed cluster consisting of hundreds or thousands of physical nodes. Using the information from application, services, tiers and nodes, AppDynamics creates a topological visualization for the entire application as shown below:

Screen Shot 2016-01-24 at 7.01.18 PM.png

AppDynamics agents are deployed on each of the physical nodes or machines, as part of initialization each agent registers the node or service, informing the metric processing engine which tier or service and which application they belong to. Once registration is completed, agents start collecting metrics on the node and periodically send the gathered information to the metric processing engine. Each metric payload from each node contains the information of which tier and application they belong to. Based on the cluster (node, tier, and applications) information, metric processing engines aggregate the metrics coming from different nodes to their parent tier and then to their parent application. Metrics collected from each processor machine need to be aggregated cluster-wide based on a logical topology of our customer applications. The reporting views must also be in real time.

Support for batch processing

The platform itself had to support batch-processing jobs on the collected metrics to be able to roll up in a time dimension. Apart from this, we needed to be able to run analytics on the collected metrics.

Fault tolerant no single point of failure

The platform needed to work in High Availability mode from day one.

Zero downtime software upgrade

As part of metric processing, we also calculate seasonality based baseline and standard deviations. If we bring down our system for software upgrades, it will create data gaps in the system, corrupting baselines and standard deviation calculations.

Also, the new metric service needed to accommodate the behavior of metric data and type of operations required to create different views:

  • Stream of time series, append-only, immutable data sets
  • Requires ingesting and processing time stamped metric data that are appended to existing records rather than overwriting them
  • State is determined by the natural time-based ordering of the data
  • Requires multiple views in different dimensions
  • All operations on the data are idempotent; view creation is an idempotent operation
  • View creation should be near real time (NRT)

Options for Metric Service Re-architecture

Keeping all these requirements in mind, we started exploring various NOSQL databases and looked at a couple of time series database implementation.

Notable time series databases are OpenTSDB from StumbleUpon and Druid from MetaMarkets. We didn’t select these solutions, as our requirements were more customized and could not fit these solutions as is. However, both these solutions are excellent, and one should try these out before writing anything new.

Next, we started evaluating various Key-Value store databases that provide fault tolerance and scale-out features. Two mainstream open source NOSQL databases are used widely throughout the industry – HBase and Cassandra.

HBase VS Cassandra

Based on our evaluation of use cases, we selected HBase over Cassandra. The main reason for this selection was sharding strategy by HBase in ordered partitioning of its key ranges, allowing us to make longer time range queries and efficiently apply aggregates on the result at a single shard level.

It’s critical to understand what continuous key range is and how it impacts the design and behavior of a system. HBase key-values are stored in Tables, and each table is split into multiple regions. A region is a continuous range within the key space, meaning all the rows in the table that sort between the region’s start key and end key are stored in the same region. More details here in this helpful article from Hortonworks.

Cassandra employs a different strategy. It arranges the Cassandra nodes into a consistent hash ring, and any key stored in the system is converted to Murmur3 hash code and then placed in a node in the ring. This article from Datastax provides additional information on sharding strategies.

Why is it so important for us to have continuous key ranges?

Let’s go over our use cases one more time. AppDynamics Application Performance Dashboard (as shown below) provides a topological view of an entire application and its performance metrics. Dashboards provide information on call flow within the different tiers of the application. Another powerful feature provided through the dashboard are overall performance statistics like the rate of calls per minute, average response time, the rate of errors per minute, exceptions per minute, etc.

everything-dashboard.png

The topological graph, program flow, and the performance statistics are pulled from the metrics store based on the time range selected by the user.

A typical dashboard aggregate query would look like this:

Select SUM(totals calls) from METRICS where metricID = m1 and time in range (t1, t2), t1-start time, t2 – end time.

Select AVG(response time) from METRICS where metricID = m1 and time in range (t1, t2), t1-start time, t2 – end time.

Select SUM(total errors) from METRICS where metricID = m1 and time in range (t1, t2), t1-start time, t2 – end time.

Let’s start with a very simple design: make the metric ID as a row key. All metrics received against the metric would be stored as columns labeled at the time the metric received, against the row key. Both HBase and Cassandra would support this simple design. All information stored in a row key will be kept in a single shard both in HBase and Cassandra. If we query for a metric for a time range, we can retrieve all the data points and apply our aggregates as discussed above.

There is a physical and theoretical limitation to the maximum number of columns that can be stored in a single row key. In our metric processing system, every minute we could receive 1 data point or more data point, over a period of 1 day we may receive thousands of data points and over a week it would go into millions and over a year into billions. The design of having metric ids as row key and storing all the metric data points against it as a column is not practical.

Another much-recommended design strategy is – bucketing the row keys by time range. That is to store the data as columns against a row key received during a time period. Create a new row key for the next time range. For example, if we bucket the row key by 1 hour from 12:00 AM to 12:00 PM we will have 12-row keys as shown below,

Here is a helpful article on how model time series data in Cassandra, that uses the technique mentioned above.

One may argue that in Cassandra we could create a composite key, using a partition key and a cluster key, using the partition key as metric ID and time buckets as cluster key and the metrics received can be stored as columns.

Cassandra’s storage engine uses composite columns under the hood to store clustered rows. All the logical rows with the same partition key get stored as a single, physical wide row. Using this design, Cassandra supports up to 2 billion columns per (physical) row.

Storing the same keys and values would be very different in HBase. There is no concept of a composite key in HBase; a key is a row of bytes. All keys are lexicographically sorted and stored in shards called regions by range. If we design the key efficiently, we can store several years of metric data in a single shard or region in HBase.

metric table.png

This is a very powerful concept and was very useful for our use cases; we are able to perform aggregate operations on significant time ranges– from 6 months to two years in the region server process itself and only the aggregated value could be returned to the calling program, reducing network traffic significantly. Imagine how inefficient a query would be if we had to collect data from 10 different shards (machines) into a client program and then apply aggregates on the values before returning the final aggregated value to the calling program.

At the time of evaluation Cassandra also lacked several key features that were readily available in HBase. Some of the key features were:

  1. HBase allowed us to arrange metrics with different resolutions to expire with different retention periods. We put them in different column families and set the TTL at column family level. Cassandra stores the TTL info with every cell, repeating the same info for same cell types (using the same expiration or retention time), making storage very bulky. Cassandra 3.0 has the fix for this which came out in November 2015.
  2. HBase provides consistency over availability by design. This was a critical component for our team as our policy engine evaluates metrics continuously. You can read more on the CAP theorem from this article on Dzone. More on CAP theorem at Dzone.

Cassandra is catching up fast and has overtaken HBase in many areas. Here is another article comparing HBase and Cassandra.

In my next blog, I’ll address the challenges we face with HBase and suggest ideas for improvement. Looking forward to your feedback and comments.

Big Data Monitoring

The term “Big Data” is quite possibly one of the most difficult IT-related terms to pin down ever. There are so many potential types of, and applications for Big Data that it can be a bit daunting to consider all of the possibilities. Thankfully, for IT operations staff, Big Data is mostly a bunch of new technologies that are being used together to solve some sort of business problem. In this blog post I’m going to focus on what IT Operations teams need to know about big data technology and support.

Big Data Repositories

At the heart of any big data architecture is going to be some sort of NoSQL data repository. If you’re not very familiar with the various types of NoSQL databases that are out there today I recommend reading this article on the MongoDB website. These repositories are designed to run in a distributed/clustered manner so they they can process incoming queries as fast as possible on extremely large data sets.

MongoDB Request Diagram

Source: MongoDB

An important concept to understand when discussing big data repositories is the concept of sharding. Sharding is when you take a large database and break it down into smaller sets of data which are distributed across server instances. This is done to improve performance as your database can be highly distributed and the amount of data to query is less than the same database without sharding. It also allows you to keep scaling horizontally and that is usually much easier than having to scale vertically. If you want more details on sharding you can reference this Wikipedia page.

Application Performance Considerations

Monitoring the performance of big data repositories is just as important as monitoring the performance of any other type of database. Applications that want to use the data stored in these repositories will submit queries in much the same way as traditional applications querying relational databases like Oracle, SQL Server, Sybase, DB2, MySQL, PostgreSQL, etc… Let’s take a look at more information from the MongoDB website. In their documentation there is a section on monitoring MongoDB that states “Monitoring is a critical component of all database administration.” This is a simple statement that is overlooked all too often when deploying new technology in most organizations. Monitoring is usually only considered once major problems start to crop up and by that time there has already been impact to the users and the business.

RedisDashboard

Dashboard showing Redis key metrics.

One thing that we can’t forget is just how important it is to monitor not only the big data repository, but to also monitor the applications that are querying the repository. After all, those applications are the direct clients that could be responsible for creating a performance issue and that certainly rely on the repository to perform well when queried. The application viewpoint is where you will first discover if there is a problem with the data repository that is actually impacting the performance and/or functionality of the app itself.

Monitoring Examples

So now that we have built a quick foundation of big data knowledge, how do we monitor them in the real world?

End to end flow – As we already discussed, you need to understand if your big data applications are being impacted by the performance of your big data repositories. You do that by tracking all of the application transactions across all of the application tiers and analyzing their response times. Given this information it’s easy to identify exactly which components are experiencing problems at any given time.

FS Transaction View

Code level details – When you’ve identified that there is a performance problem in your big data application you need to understand what portion of the code is responsible for the problems. The only way to do this is by using a tool that provides deep code diagnostics and is capable of showing you the call stack of your problematic transactions.

Cassandra_Call_Stack

Back end processing – Tracing transactions from the end user, through the application tier, and into the backend repository is required to properly identify and isolate performance problems. Identification of poor performing backend tiers (big data repositories, relational databases, etc…) is easy if you have the proper tools in place to provide the view of your transactions.

Backend_Detection

AppDynamics detects and measures the response time of all backend calls.

Big data metrics – Each big data technology has it’s own set of relevant KPIs just like any other technology used in the enterprise. The important part is to understand what is normal behavior for each metric while performance is good and then identify when KPIs are deviating from normal. This combined with the end to end transaction tracking will tell you if there is a problem, where the problem is, and possibly the root cause. AppDynamics currently has monitoring extensions for HBase, MongoDB, Redis, Hadoop, Cassandra, CouchBase, and CouchDB. You can find all AppDynamics platform extensions by clicking here.

HadoopDashboard

Hadoop KPI Dashboard 1

HadoopDashboard2

Hadoop KPI Dashboard 2

Big data deep dive – Sometimes KPIs aren’t enough to help solve your big data performance issues. That’s when you need to pull out the big guns and use a deep dive tool to assist with troubleshooting. Deep dive tools will be very detailed and very specific to the big data repository type that you are using/monitoring. In the screen shots below you can see details of AppDynamics monitoring for MongoDB.

MongoDB Monitoring 1

 MongoDB Monitoring 2

MongoDB Monitoring 3

MongoDB Monitoring 4

If your company is using big data technology, it’s IT operations’ responsibility to deploy and support a cohesive performance monitoring strategy for the inevitable performance degradation that will cause business impact. See what AppDynamics has to offer by signing up for our free trial today.

Data Clouds Part II: My Big Data Dashboard

In my previous blog, I wrote at length about the complexities of running a data cloud in production. This logical data set, spread across many nodes, requires a whole new set of tools and methodologies to run and maintain. Today we’ll look at one of the biggest challenges in managing a data cloud – monitoring.

Database monitoring used to be easy in the days before data clouds. Datasets were stored in a single large database, and there were hundreds of off-the-shelf products available to monitor the performance of that database. When problems occurred, one had simply to open up the monitoring tool and look at a set of graphs and metrics to diagnose the problem.

There are no off-the-shelf tools for monitoring a data cloud, however. There’s no easy way to get a comprehensive view of your entire data cloud, let alone diagnose problems and monitor performance. Database monitoring solutions simply don’t cut it in this kind of environment. So how do we monitor the performance of our data cloud? I’ll tell you what I did.

It just so happens I work at AppDynamics, one of the most powerful application monitoring tools on the market. We monitor all parts of your application including the data layer, with visibility into both Relational and NoSQL systems like Cassandra. With AppDynamics I was able to create a dashboard that gives me a single pane-of-glass view into the performance of my data cloud.

Big Data Dashboard

My Big Data Dashboard

This dashboard is now used in several departments at AppDynamics including Operations, QA, Performance and development teams to see how our data cloud is running. All key metrics about all of our replicas are graphed side by side on one screen. This is the dream of anyone running big data systems in production!

Of course, not all problems are system wide. More often than not you need to drill into one replica or replica set to find a problem. To do that, I simply double click on any part of my big data dashboard to focus on a single replica, change the time range, and add more metrics.

Data clouds are difficult to run, and there aren’t any database monitoring tools fit to monitor them yet. But instead of sitting around waiting for data monitoring tools to catch up with our needs, I’ve built my own Big Data Dashboard with monitoring tool designed for applications.

Of course the fun doesn’t stop here…I still need to find a way to set up alerts and do performance tuning for my data cloud. Stay tuned for more blogs in this series to see how I do it!

An Introduction to the Data Cloud

As data has grown exponentially at many sites, companies have been forced to horizontally scale their data.  Some have turned to sharding of databases like Postgres or MySQL, while others have switched to newer NoSQL data systems.  There have been many debates in the last few years about SQL vs. NoSQL data management systems and which is better.  What many have failed to grasp, though, is how similar these systems are and how complex they both are to run in production in high scale.

Both of these systems represent what I call a Data Cloud. This Data Cloud is logical data set spread across many nodes.  While developers have heated debates about which system is better and how to design code around it, those in DevOps usually struggle with very similar issues because the two systems are mostly the same.  Both systems

  • Run across many nodes with large amounts of data flowing between them and from/to the application
  • Strain both the hardware of all nodes, and the network connecting them
  • Maintain duplicate data across nodes for fault tolerance, and must have failover ability
  • Must be tuned on a per node and cluster-wide bases
  • Must allow for growth by adding additional nodes.

Running this Data Cloud in production presents a new set of challenges for DevOps, many of which are not well understood or addressed.  One of the main challenges is the management and monitoring of these systems, for which few (if any) tools or products exist at this time.

When systems were smaller and you ran a single Database in production, you probably had all the necessary systems in place.  With a plethora of products for Management, monitoring, visualizing data, and backups, it was not hard to be successful and meet your SLAs.

But now all this is much more complex once you move into the world of the Data Cloud.  Now you have a large number of nodes, all representing the same system and still needing to meet the same SLAs as the old simple DB from before.  Let us look at the challenges for running a production Data Cloud successfully.

Capacity Planning

Do you know how many nodes you need?  How many nodes do you put in each replica set?  How much latency and throughput do you need in your network for the nodes to communicate fast enough?  What is the ideal hardware to use for each node to balance performance with costs?

Monitoring

How do you monitor dozens, hundreds or even thousands of nodes all at once?   How do you get a unified view of your data cloud, and then drill down to the problem nodes?   Are there even any off-the-shelf monitoring tools that can help?  Your old monitoring tool won’t be very useful anymore unless you are willing to look at every node one by one to see what is going on there.

Alerting

How do you set up a common set of alerts across all nodes?  And how do you keep your alert thresholds in sync as you add nodes and remove them?   More importantly, even assuming you have alerting in place,  once staff receives critical alerts, how will they know where to find the troubled node in the massive cloud, or whether it’s a node level  issue or more global in nature?  This must be done quickly during critical outages.

Data Visualization

How does your staff view the data when it is distributed?  In case of data inaccuracy, how can they quickly identify the faulty nodes and fix up the data?

Performance Tuning

As performance degrades, how do you troubleshoot and identify the bottlenecks?  How do you find which nodes by be the cause of the problem?  How do you improve performance across all the nodes.

Data Cloud Management

How do you back up all the data while consistently tracking which nodes were backed up successfully and when?  How do you make schema changes across all the nodes in one consistent step without breaking your app? And how do you make configuration changes on various nodes or across all nodes?  And how do you track the configurations of each node and keep them consistent across your system?

By now you should see that there is a lot to think about before endeavoring to launch a production Data Cloud.  Too many companies focus all their energies on deciding which DB or NoSQL system to use and developing their apps for it.  But that might turn out to be the lesser of your challenges once you struggle to put the system into production.  Be sure you can answer all the questions I have listed above before your launch.

Boris.

AppDynamics takes Cassandra out for Beer and Big Data

Welcome Cassandra, let’s start with a quick introduction of who you are and what you can do.
I am a high-performance distributed scalable database.  I am really good at doing lots of operations and growing as your data needs increase, and to top that all off I run on commodity hardware which means I’m really great for running on the cloud.

Awesome, oh by the way, what beer are you drinking tonight?
Tonight I’m drinking a Shiner from good old Texas.

Excellent.  First, I heard it on the grapevine and you did tell me that you’re highly available.  Can you confirm or deny these rumors, and what does that really mean?
Highly available means that a distributed system can lose one or more of its servers and still keep the cluster as a whole up and running.  In other words, because I’m designed to run on commodity hardware, failure should be expected.  In my scenario, I can lose a server and keep on chugging along without the cluster going down.

AppDynamics gains Velocity

AppDynamics had a great time last week at Velocity; it was probably the most relevant event we’ll attend this year. Great Keynotes, sessions and an audience that truly understood the significance of application performance. At our booth, it was refreshing to hear questions around monitoring application performance in the cloud (specifically EC2), as well as for new web technologies like NoSQL (Cassandra/MongoDB/etc). Many attendees were considering or are already in the process of moving their apps to the cloud, and it’s good to see Application Performance Management (APM) was a key topic on their agenda.

AppDynamics Monitoring for Cassandra and Big Data

With AppDynamics 3.2, we recently announced support for Big Data. This essentially allows our customers to manage application performance in production applications while aligning with the growing trend of NoSQL technologies. As applications become more distributed and complex, many organizations are looking to exploit the benefits of Cassandra, Hadoop and MongoDB (to name a few), which offer more agile data models along with superior read/write performance. They are also significantly cheaper to own and scale compared to the traditional relational database.

But as I mentioned in my previous blog, NoSQL is about “Not Just SQL” rather than waving goodbye to the relational database. So as a vendor that provides innovative APM solutions, it was a logical step for AppDynamics to provide support for NoSQL technologies at the first knock for our customers so they could monitor the performance of both their relational and non-relational databases.

This blog takes a look at the new capabilities AppDynamics Pro provides for monitoring Cassandra, a highly scalable, second-generation distributed database that has received modest adoption over the last 18 months. This adoption has been helped significantly through the publicity generated by heavyweight Cassandra users such as Facebook and Twitter. It isn’t just hype, though; at AppDynamics we’re seeing more and more of this stuff in our customers environments and they love it.