Will NoSQL and Big Data Kill the DBA?

May 18 2011
 


Remember the good old days where developers and DBAs would argue over who and what was killing the relational database? “It’s your crap SQL,” “You forgot to create an index,” “You don’t know what an index is”…and so on. Do you remember when the DBA occasionally spoke and served out humble pie on how to make SQL statements go faster? Well my friends, those biblical days could soon be over with the adoption of NoSQL technologies. Or is that not true? Will the DBA perform a mighty rollback?

If you look back at the last decade, there is no doubt nearly all business today is connected or conducted online. In fact I can’t physically remember the last time Mrs. App Man and I walked into the bank, wrote a letter, or returned a movie rental. Half our life is done online and it’s not stopping anytime soon now that Mrs. AppMan has a new shiny MacBook to shop on. IDC and EMC recently published a report that stated 70% of the digital universe this year will be generated by users at home, work, and on the go. In total this amounts to 880 billion gigabytes of data being created this year. All that data has got to go somewhere, and it’s left to applications to query and make this data available to users whereever they are in the world. Data availability and performance has never been so important.

In the olden days I spent many days and nights working with customers to help scale their applications through the rise and fall of the Internet. It’s fair to say most of the performance issues I resolved were to do with data access between the application server and relational database.  The verbosity and complexity of APIs like JDBC meant that developers made a lot of mistakes. To make this problem worse, very few developers knew how to fully exploit features of a relational database like Oracle. Admitting to knowing PL/SQL was almost like Luke Skywalker moving to the dark side. Developers and DBAs rarely got along, largely because most of the DBAs broadcasted the fact they were earning more money than Bill Gates contracting. Frameworks like Spring did a rather good job of mastering API complexity by helping developers write less code so they made less mistakes. You then had a series of Object Relational Mapping (ORM) frameworks that provided a virtual database so developers could map their application objects to database tables with relative automation and transparency. Whilst ORMs helped the developer with data persistence, the APIs often abused the database and in turn the DBA (not that any developer was offered sympathy) when they were wrongly implemented. Bottom line is that data access and persistence was and still is a key contributor to application performance.

The problem today is that relational databases make it really difficult for organizations to scale their applications to deal with high user concurrency and transaction volumes. They dictate data should be stored in rigid fixed table schemas that enforce relationships so queries to data become more and more complex, and slower. Another problem is that they are very difficult to scale horizontally–and if that is achieved, it’s done at a very high cost of ownership in terms of software and hardware licenses. Scaling your applications with relational database technology is not a cheap hobby both from a cost and latency perspective. It’s why many vendors like Amazon, Google and Facebook created their own technologies to better scale their applications. They were also super smart in making their new technology open source so the wider communities could leverage and evolve the work they started.

It’s why many organizations today are looking to NoSQL technologies to address the round-trip time, cost, and scalability limitations of the big fat relational database. You’ve probably seen social media networks like Twitter and Facebook grabbing lots of publicity around how they manage several hundred million users processing billions of transactions a day dealing with petabytes of data. Technologies like Hadoops, Big Tables and lovely Cassandra’s are becoming “sexy.” Memory and disk is cheap as chips these days, which mean developers can store, process and do more with data in memory & disk than ever before. And as you know, memory is fast and disk is slow. The closer you bring data to business logic, the faster an application will respond and the less you become dependent on things like disk I/O and contention latency–which limit transaction response times and throughput. The only problem is that memory isn’t permanent or limitless; at some point, you need write data to disk.

Organizations must also consider Cloud Computing–a platform that can allocate IT resource to  applications as and when they need it, thus providing elasticity. The bad news is that relational databases don’t do elasticity very well because their architectures make it difficult and expensive to scale. They also rely heavily on disk, which in the cloud is still a problem because most of it is shared and virtualized. You’d have to be a mad to run Oracle on a VM with cheap disk. It’s kinda like asking Usain Bolt to hop the 100 metres on one leg and still win. To put things in perspective the read/write latency of a NoSQL database like Cassandra can be up to 30 times faster than that of an equivalent relational database like MySQL when both databases are loaded with 50+GB of data. 30 times faster is a huge difference, when it means you can service 30 times more users and transactions.

So how can you leverage NoSQL and make your applications more elastic? You can start by moving your non-critical data from the relational database to things like NoSQL technologies or distributed data stores–which are faster, cheaper, highly available, elastic, and thus easier to scale and manage. Not all data has to live in the relational database even though it may be convenient to keep it all in one place. You want to structure, store and query your data in a way that makes the most sense from what you want to achieve. If your application that has data that rarely changes (e.g. Product catalogs, page content) and you want lightning-fast read times, you might want to store that in distributed caches. If your application has data that changes a lot, isn’t mission critical, and requires low latency, then you might want to read and write that from a NoSQL database and save the relational database for when you need transaction atomicity. You can scale your distributed caches and/or NoSQL databases on the fly so that they take most of the load away from your relational database. The less trips you make to the relational database the better your application concurrency and throughput will be because most of your data will be read from memory.

However, not all business transactions have equal priority or importance. A user search transaction might be important, but if it fails every now and then it’s not the end of the world. However, if a user credit card payment fails then that has a direct impact on the business. Also, moving to NoSQL based application architectures isn’t a fast or easy project. Developers need to learn new skills and master non-functional requirements so they can implement the right patterns and technologies to ensure data is accessible, available and consistent for each business transaction executed by a user.

So will the DBA die? I’m afraid it’s unlikely. You see, every new or old technology has its advantages and drawbacks. The reality is that NoSQL stands for “Not Just SQL” rather than “No SQL” so the relational database isn’t going away anytime soon. Organizations still need transaction atomicity for their mission critical business transactions. When a customer makes a credit card payment or a trader executes a trade, those business transactions need to be committed in real-time to disk. The state of the transaction and its data needs to be consistent wherever that data is updated and accessed from. The reason why NoSQL solutions are so fast, elastic, scalable, and available is because they are basically in-memory distributed data stores that run across multiple nodes and physical servers–meaning that if one node fails, there are plenty others to take over. Therefore when a transaction writes to one node, it’s incredibly fast–but that data has to then be replicated across all nodes in the NoSQL grid for the data to be truly consistent.

Read/Write consistency is still something that NoSQL based solutions have yet to conquer; they sacrifice ACID for read/write performance, elasticity, and high availability.  So while NoSQL enables applications to scale with high transaction concurrency and data volumes, they need to work in parallel with relational databases to ensure data consistency.

Now that Oracle owns Java, it’s going to be interesting to see how they embrace the open source community and their NoSQL Technologies.

App Man.

Sandy Mappic

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form