According to discussions about them during the NoSQL Now conference and the ways they are described, Storm, Spark, and Samza solve similar problems in different ways. This article discusses the differences among Spark, Storm, and Samza.
It is nice to see that so many open source packages are readily available in the analytics space, but at the same time it is very difficult to keep up with the pace of introductions and the details of each.
Differences among Storm, Spark, and Samza in a table format
Source: Streaming Big Data: Storm, Spark and Samza by Toni Siciliani https://tsicilian.wordpress.com/2015/02/16/streaming-big-data-storm-spark-and-samza/
Source: Streaming Big Data: Storm, Spark and Samza by Toni Sicilianihttps://tsicilian.wordpress.com/2015/02/16/streaming-big-data-storm-spark-and-samza/
Siciliani said that there are no hard rules but at most a few general guidelines. He described use cases for each of the packages as follows.
All: well suited to efficiently process continuous, massive amounts of real-time data.
Name of package
|Use case||Used by companies|
high-speedevent processing system that allows for incremental computations
must havestateful computations and exactly-once delivery; doesn’t mind higher latency
Yahoo!, NASA JPL, eBay, Baidu
has a large amount of state to work with
LinkedIn, Yahoo!, Metamarkets
Table created by the author from descriptions by Siciliani
What about Flink?
Well, what about Apache Flink? Siciliani’s article was published in February 2015. Apache Flink is very new and became an Apache top-level project in January 2015. That was probably the reason it was not included in this comparison.
Some look at Flink as competition (article by Ian Pointer titled Apache Flink: New Hadoop contender squares off against Spark) for Spark in this space as a MapReduce replacement. I do not think I am qualified to add the Flink information to the above comparison table. But let me do the best I can to add some information that might help the comparison here.
Pointer’s article said that both Spark and Flink have similar syntax but pointed out one difference.
Instead of being a pure stream-processing engine, it is in fact a fast-batch operation working on a small part of incoming data during a unit of time (known in Spark documentation as “micro-batching”). For many applications, this is not an issue, but where low latency is required (such as financial systems and real-time ad auctions) every millisecond lost can lead to monetary consequences.
Flink flips this on its head. Whereas Spark is a batch processing framework that can approximate stream processing, Flink is primarily a stream processing framework that can look like a batch processor. Immediately you get the benefit of being able to use the same algorithms in both streaming and batch modes (exactly as you do in Spark), but you no longer have to turn to a technology like Apache Storm if you require low-latency responsiveness. You get all you need in one framework, without the overhead of programming and maintaining a separate cluster with a different API.
Pointer’s paper includes more details, and I suggest reading it for a better understanding of Flink vs. Spark.
Now what about the comparison of Flink with the rest of the packages? In a blog post by Kostas Tzoumas and Stephan Ewen from data Artisans, titled Real-time stream processing: The next step for Apache Flink, they elaborated Flink’s details and gave some comparisons. Again, after more reading and research, I will try to put Flink into the tables above in a future blog.
Lambda vs. Kappa architectures
I heard Nathan Marz present at a NoSQL Now conference a few years ago and started to follow the Lambda architecture. I read Marz’s book and found the Lambda technology very powerful and useful.
Tzoumas and Ewen claimed that the stream processor, Storm, in the Lambda architecture was not powerful enough to handle both streaming and batch systems. On the other hand, in the Kappa architecture, the streaming processor, Flink, can handle both streaming and batch processing equally well.
They further stated that Flink’s included a framework required for the following reasons:
- optimizer is only applicable for batch programs, as query optimization is very different for batch and stream use cases.
- the behavior of batch and streaming is different for moving intermediate data and state out-of-core. When memory runs out, operators in batch programs should spill to disk. In stream programs, spilling to disk is one option, the others being load shedding and scale-out.
- fault tolerance is somewhat easier in batch programs, as finite streams (e.g., static files) can be replayed completely, which enables an optimized code path for batch recovery.
I am not sure about you, but I am thoroughly confused by all of these packages. I will continue my research and publish some blogs on this very subject.