This was the second time I’ve attended the NoSQL conference. I attended one of the tutorials and two keynotes and interviewed one company this year. My interest in the NoSQL marketplace is to follow and monitor Big Data produced by instrumented sensors in the smart grid, and to produce useful information for its operations.
Last year I attended a tutorial titled NoSQL 101, presented by Dan McCreary of Kelly-McCreary & Associates, that gave some details of NoSQL. This year I attended Introduction to the Hadoop Ecosystem, presented by Vladimir Bacvanski of SciSpike, which covered how Hadoop works, as well as its ecosystem.
Hadoop has been used synonymously with Big Data and is described in many places, like Wikipedia. I do not know about you, but when I am new to some technology area, reading its description alone does not help very much.
I need to randomly dig a little bit deeper into the subject before I can get the full picture. As with anything else, the gap between the technical keyword and the details is very big. My blogs come in the middle, at a not too high but not too detailed level.
Some of Vladimir’s material is well known, but it was still useful for refreshing my memory and filling gaps in my knowledge. His presentation helped me to fully understand what Hadoop is, and what it is not, at a high level. I highly recommend his tutorial. The following is a random list of what I thought useful.
- Big Data sources are transactions (business systems), documents (text, images, sound, and video), social media (Twitter, Facebook, and blogs), and sensors (instrumented devices). Sensors are of special interest to me because of their use in smart grid. Many heterogeneous types of data are constantly being generated and transmitted asynchronously at different speeds. My interest is to find out how to collect/aggregate/merge/analyze them to produce useful information about what actions to take to optimize power distribution.
- The three Vs of Big Data are volume, variety, and velocity. There some more Vs, such as veracity (truthfulness) and viability.
- The two key aspects of Hadoop are MapReduce and Hadoop distributed file system (HDFS). MapReduce runs using HDFS.
- MapReduce for analysis. Input is split (map) in pieces to many nodes in the cluster. Those pieces are aggregated and merged to be smaller (reduce). These operations are automated and transparent to a user.
- The idea is to move processes to data rather than move data to processes (the traditional way).
- Hive is a data warehousing infrastructure, including HiveQL (query language), based on Hadoop.
- Pig is a platform for analyzing large data sets. It consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turn enables them to handle very large data sets.
- Hbase is a columnar database built on top of HDFS.
- Zookeeper provides a distributed coordination service.
- Other relevant pieces include HCatalog, a part of Hive that enables the interoperability of Pig, Hive, and MapReduce; Sqoop, an automated import/export for Hadoop; and Flume, a distributed streaming tool for collecting, aggregating, and moving large amounts of log data.
Analysis conducted with Hadoop is primarily in a batch system rather than in real time. In smart grid, data collected and transmitted include both real time and non–real time. Real-time processing of incoming data is gaining a lot of interest these days. It requires streaming processing to capture data and apply analytics to them. Vladimir described some of the popular streaming processing systems here:
There is still a lot to learn. The good news is that many of the technology companies providing NoSQL and analytics engines seem to be in search of areas for application in addition to SNS and business transactions. Smart grid is known to intersect power, communications, and IT. Those emerging ICT technologies certainly can be applied much more to smart grid, providing a larger market for ICT companies.