At the recent NoSQL Now Conference in San Jose, I attended several sessions and realized the well-established position of open source packages.
During the conference, many people talked about how open source is creeping into the enterprise. Because of my past involvement with MySQL and JBOSS, I’ve experienced pushback from early customers firsthand. But now open source is moving into the enterprise mainstream, as described here.
Open source packages mentioned at the conference
Before attending the conference, I knew of a few such packages like Hadoop, Cassandra, MongoDB, Storm, Kafka, Spark, and Zookeeper. They have been discussed over and over again in conferences and the media. In addition to those packages, I noticed new packages during the conference. I would like to touch on them briefly in this blog. Many more articles exist for more detailed descriptions, but my intent here is to let you know of their existence.
These packages are:
There are many more packages, but these appeared on my radar screen.
Drill can be described as a schema-free SQL engine for many kinds of storage with good performance, including Hadoop, as shown in his slide below.
Shiran is a founder and program committee member (PMC) of Apache Drill. Drill is an open source version of Google’s Dremel system. Shiran pushed the Drill project when he was with MapR and presented the idea (YouTube video, about 45 minutes) to the crowd at Google Developer Group Meetup back in September 2012. The initial idea of Drill was blogged by Shiran here. The rest is history. The video is long but may be of interest for understanding how the Drill idea came about. His presentation starts after one hour and two minutes. MapR provides Hadoop solutions to Google. Also, this explains a lot of contributions from MapR, whose products include a commercial version of Hadoop for the Drill project. Their competitors include Cloudera and Hortonworks.
Dremio is very new (founded in June 2015), and more detailed information about it is accessible from these two Twitter accounts, @ApacheDrill and @DremioHQ.
Druid is (according to the Druid website)
an open-source analytics data store designed for business intelligence (OLAP) queries on event data. Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation.
Among the open source packages I mention in this blog, Druid is one of the two that is not an Apache project. Yang and Gian Merlino (still with MetaMarkets) had a presentation recorded at the O’Reilly Strata Conference in 2014 (about a 33-minute YouTube video). This presentation is very similar to the one he gave at the NoSQL conference. Yang was with MetaMarkets, which introduced Druid in April 2011, and MetaMarkets made it available as open source in October 2012. Druid is designed to deal with streaming analysis as a data store.
Lambda architecture is known to be designed for both streaming and batch data analysis. In developing it, Nathan Marz used Storm, but Yang showed a design with Samza, as shown in the slide below. However, Storm can be used in place of Samza.
Druid is available here.
Continued to Part 2 (Flink and Samza)