In addition to the tutorial I attended at NoSQL Now 2013, I heard two keynote speech sessions. They had similar themes but different emphases. The power grid transmits not just power but also a vast amount of data/information about power delivery. This is similar to the telephone system’s consisting of signaling (control) and voice transmission (data). Those systems that constitute the nation’s invaluable infrastructures must be able to sustain any foreseen and unforeseen failures. To keep our communications and lights on, they have to be constructed with redundancy, resiliency, self-healing, and fault tolerance.
A computing system is the same way. To be able to monitor and control enormously large and distributed systems, we need to factor in sudden but inevitable failures when we design and operate them. The nation’s infrastructures depend on them.
Keynote by Nathan Marz
The first tutorial was by Nathan Marz, a developer of Storm, who was with BackType and later Twitter. Twitter acquired BackType in mid-2011. After the acquisition, Storm was open-sourced and is available here. You can listen to Nathan’s talk about Storm here (about 55 minutes). He is currently with his stealth startup. The title of his talk was Your Code is Wrong, and his entire presentation is available on the SlideShare site.
In sum, his point was that any computing system has a lot of dependencies on many other undependable and unreliable components, whether they are software or hardware. Even if your code is perfect, other components your code depends on contain bugs, and therefore your code cannot be 100% correct. His argument was very convincing, and the following figure is awesome. He said your code dependency ultimately goes all the way to quantum mechanics, of which no living soul has complete understanding.
Dependency chain that may go all the way to quantum mechanics
In addition, your code cannot be correct because of:
· mismatches between design and real-life input space
· wrong logic implementation in the code
· shifting requirements
Therefore, unforeseen failures are inevitable and systems should be designed and operated with those factored in.
Nathan used Storm as an example of minimizing the deviation of software from its requirements.
He made five points:
- measure and monitor
- consider immutability
- minimize dependencies
- pay attention to input range
When I listened to his talk, #1, #3 and #4 were clear but not #2 and #5. So I did some research and found that he was actually talking about a subject that is much deeper. He was really talking about the Lambda Architecture, which will be discussed extensively in his upcoming book Big Data – Principles and best practices of scalable realtime data systems. The book has not been published, but you can read some of the chapters via Manning Early Access Program (MEAP). I will wait for its official publication, but James Kinley actually used MEAP and read it.
I like the concepts of building immutability and recomputation into a system, and it is the first architecture to really define how batch and stream processing can work togetherto solve a myriad of use cases. With the general emphasis moving more towards realtime, I see this book being a must read for all Big Data developers and architects alike.
There was also a blog post by Christian Gügi regarding the Lambda Architecture. Both James’s and Christian’s analyses are helpful for understanding Nathan’s points. Nathan also has another presentation available here, Runaway complexity in Big Data and a plan to stop it. This also helps with understanding his points.
The Lambda Architecture is pretty complicated and takes some time to explain, if I fully understand it, so I only touch on it at a very high level. You can read some or all of the materials I showed here (and more) to understand it fully. Nathan’s point is that any query is a function of all data. The “all data” part is important.
From Nathan Marz’s presentation, Runaway complexity in Big Data and a plan to stop it.
Data change as time goes by. In other words, data traditionally have been considered mutable (changeable). That is, the value of data changes as time passes, and they have a correct value at a particular point in time. If a wrong value is assigned to data by mistake or data get corrupted, we lose that data completely and forever. Instead, if all the versions of the same data are captured and stored, such problems could be avoided. This is what Nathan means by “immutable” (unchangeable). In this conference and other places, people talk about the “denormalization of data.” This is the same concept, a departure from “create, read, update, and delete (CRUD)” and a move to “create and read” without “update or delete.” In other words, in the new world of NoSQL, we do not update or delete data as they come in. We simply append them to the existing data storage.
If we capture all the versions of all the data, we can recompute with all these data and their versions at any time. But it would be impossibly slow and inconvenient to recompute with a vast amount of data accumulated in the database. The Lambda Architecture solves this problem by inserting precomputed views between data and application, as shown below.
This is what Nathan meant by #2 and #5 above. Finally, although he did not cover it in this talk, the following is a good figure to tie both real-time and batch analysis together.
Lambda Architecture: from Nathan Marz’s presentation, Runaway complexity in Big Data and a plan to stop it.
These are the descriptions for the software packages mentioned above:
· Kafka: publish-subscribe messaging rethought as a distributed commit log
· Storm: a free and open-source distributed real-time computation system
· Hadoop: provides open-source software for reliable, scalable, distributed computing
· Cassandra: a key-value NoSQL database
· ElephantDB: a database that specializes in exporting key/value data from Hadoop
· Voldemort: a distributed key-value storage system
Data produced from various sources in smart grid vary largely in velocity. Some are near real-time, like power demand/supply data, and others may stay static, like equipment configurations along the grid. With the last figure, we can capture all the relevant data in consideration when we perform analytics.