At the recent NoSQL conference in San Jose, California, I had a chance to chat with Scott Jarr, cofounder and chief strategy officer of VoltDB. I wrote an overview blog where I touched on VoltDB, and this is a detailed version of my conversation with Scott.
When I was researching the NoSQL segment, I found it confusing enough, but there is also a NewSQL movement, which confused me further. The NoSQL movement began in an effort to accommodate the Big Data phenomenon. In the traditional database segment, ACID—atomic, consistent, isolated, and durable—is of utmost importance. The relational database was developed to guarantee ACID and for transaction-oriented applications. The traditional relational or SQL database is fine, as long as the data comes in at a reasonable speed and volume and is of limited variety. But at some point, these parameters exceeded what the traditional SQL database could handle, and new ways to cope with them were increasingly required. That is where NoSQL comes in. NoSQL, in general, does relax some of the rigid SQL rules (abandoning SQL partially or altogether and thus ACID) and accommodate these new requirements; i.e., scale-out, high availability (HA), replication, and performance. Therefore, NoSQL in general does not have SQL, relational schema, joins, or ACID (this is obvious since these are traits of the relational/SQL database). Scott put the comparison of Old SQL, NoSQL, and NewSQL on a piece of paper as we spoke. I reproduced it here. Old SQL (yet another term) refers to the traditional relational/SQL database that dominates the enterprise world.
Comparison of Old SQL and NoSQL
So in other words, in order to gain scale-out, HA, replication, and performance, NoSQL abandoned SQL/relational schema partially or altogether. What NewSQL is saying is that it can accomplish every feature in the table above while keeping the relational/SQL schema (and therefore ACID and join).
If that table is expanded with NewSQL, we have the following.
Comparison of Old SQL, NoSQL, and NewSQL
How can that be possible with NewSQL? Performance gain is a result of new architectures that remove the old baggage of OldSQL, many leverage memory for additional improvements. Actually, Gigaspaces, which I also interviewed and will write a blog about later, has an in-memory cache technology working with other NoSQL companies. However, Michael Stonebraker, CTO of VoltDB, said in one of his talks (available in a 30-minute video) that running in memory alone does not guarantee the performance gain needed to accommodate the speed at which Big Data comes in.
Mike explained in his talk that there is nothing wrong with the concept of SQL itself. It is the implementation of SQL that causes the problems shown in the table. Because of the less than perfect implementation, 96% of the time is spent on overhead and only 4% on useful work, as indicated in the following graph extracted from his presentation.
CPU cycle use in a typical SQL implementation. Most of it—96%—is used for overhead.
Unless these overheads are removed, even if all data is placed into memory, extreme performance improvement is not expected, because it only addresses the 4% but not the rest of the 96%. Typical NoSQL databases abandoned or partially supported SQL to bypass this problem. VoltDB faced the current inefficient implementation of SQL and developed their version of the SQL database from the ground up to eliminate these overheads. I am not covering each overhead in detail, but you can watch his easy-to-follow video.
OK, I get it. Then, what does this mean to the whole area of NoSQL? Does this mean the whole area of NoSQL gets consolidated into a single technology like NewSQL? Scott drew me a good figure to explain this, which he had already published in his own blog posts (the figure below came from part-1 and part-2).
From Scott Jarr’s blog. On the Y axis, data speed, size, and complexity grow upwards.
There are five areas to address in the enterprise in terms of data collection and analysis (analytics): interactive, real-time analytics, record lookup, historical analytics, and exploratory analytics.
The five areas are further explained in the following figure with applications and time scale.
Note that VoltDB is colored differently from the rest of NewSQL in the graph, but that is meant to emphasize its position in the NewSQL group. VoltDB falls into the NewSQL camp. Scott emphasized its performance superiority over others. The performance benchmark they share is 3 million transactions per second (TPS). According to Scott, the traditional RDBMS is trying to cope with the Big Data problem (velocity, volume, and variety) by scaling up (throwing in more CPU and storage power rather than using parallel computing).
In Stonebraker’s video, he said that VoltDB was five times faster than Cassandra and also faster (he did not say how much) than an unnamed incumbent’s database. When I consulted with MySQL, it was very fast (before their 5.0, which incorporated enterprise-ready features) and faster than this incumbent’s, but they could not publish the benchmark for fear of a lawsuit. I can understand that. When I consulted for JBoss, a Japanese open-source consortium compared their performance with other products like IBM’s Websphere, without any tuning. The number was not very good, mainly because those who ran the benchmarks did not know how to tune JBoss’s compared to IBM’s. After a JBoss engineer flew over there and tuned it, it improved drastically. So when we conduct a performance comparison, we need to set up a ground rule for comparison for every participant.
Of course, there are some overlaps among those technologies and their areas of applications, but this figure is a good picture of how each technology is suited for its application area. Hadoop is batch processing based and is not suitable for real-time analysis. Many people think Big Data and Hadoop are synonymous, but this clearly shows they are related but not the same. In the utilities business, a large amount of meter-read data gets collected, aggregated, and stored. By daily or monthly analysis of power usage for a particular area, a utilities company can probe into usage patterns and trends. Actually, some utilities are using Hadoop now, according to the Soft Grid conference.
Scott thinks NewSQL, NoSQL, DataWarehouse, and Hadoop will remain separate technologies because each of them is suited for some specific area of data collection and analytics. But he advocated that these areas and their tools be tightly integrated to provide analytics and thus effective real-time actions, as in the following figure. By incorporating the analytics results for long time spans into short-time analytics, more effective actions could be obtained.
Finally, he showed the current applications of VoltDB, as follows.