The true benefits of the cloud come down to distributed computing. World of Warcraft is a networked game on a distributed system. In “Lord of the Rings,” the film makers simulated entire battles using a distributed computer system. With ample resources, you may use a data center to download a chunk of the web, process and analyze it.
It’s that last point that makes Amazon announcement so significant. Amazon announced that it has adopted a Hadoop framework running on the Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). What the Amazon service means is you don’t need a data center anymore to do data mining.
|Does Your Cloud Have a Silver Lining?|
Let’s first look at Amazon’s overall service before we get into Hadoop and in particular, a look at Cloudera, which is also using the Hadoop framework and now appears to have Amazon as its main competition. We will show some screencasts and videos we uncovered about Cloudera and the Hadoop framework.
Dana Gardner calls it a game changer for the business intelligence community:
Given the intriguing price points Amazon is providing, this service could be a game-changer. It will likely force other cloud providers to follow suit, which will make advanced BI services more available and affordable for more kinds of tasks. I can even imagine communities of similarly interested user parties sharing query formulations and search templates of myriad investigations. A whole third-party BI consulting and services industry could crop up virtually overnight.
The difference with the Amazon service is the ease of use. A bit more from Amazon:
Amazon Elastic MapReduce automatically spins up a Hadoop implementation of the MapReduce framework on Amazon EC2 instances, sub-dividing the data in a job flow into smaller chunks so that they can be processed (the “map” function) in parallel, and eventually recombining the processed data into the final solution (the “reduce” function). Amazon S3 serves as the source for the data being analyzed, and as the output destination for the end results.
To use Amazon Elastic MapReduce, you simply:
* Develop your data processing application authored in your choice of Java, Ruby, Perl, Python, PHP, R, or C++. There are several code samples available in the Getting Started Guide that will help you get up and running quickly.
* Upload your data and your processing application into Amazon S3. Amazon S3 provides reliable, scalable, easy-to-use storage for your input and output data.
* Log in to the AWS Management Console to start an Amazon Elastic MapReduce “job flow.” Simply choose the number and type of Amazon EC2 instances you want, specify the location of your data and/or application on Amazon S3, and then click the “Create Job Flow” button. Alternatively you can start a job flow by specifying the same information mentioned above via our Command Line Tools or APIs.
* Monitor the progress of your job flow(s) directly from the AWS Management Console, Command Line Tools or APIs. And, after the job flow is done, retrieve the output from Amazon S3.
* Pay only for the resources that you actually consume. Amazon Elastic MapReduce monitors your job flow, and unless you specify otherwise, shuts down your Amazon EC2 instances after the job completes.
What is Hadoop?
Hadoop is an open-sourced software for scalable, distributed computing. Companies that use Hadoop are a who’s who list of players. They include Yahoo, Google and Facebook, which uses Hadoop “to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.”
Cloudera offers a service most similar to Amazon’s data mining offering. Cloudera installs, configure and runs Hadoop for large-scale data processing and analysis. Here’s a video about their service:
Here are some more videos and information about Hadoop, discussions about the cloud and distributed computing.
Configuration for Hadoop:
Here’s a conversation about Hadoop, MapReduce and big data sets between Sohrab Modi – VP, Chief Technology Office, Sun Microsystems & Chris Wensel, Founder Concurrent, Inc. This is part one of a two part series:
The Google channel on YouTube includes this lecture on cluster computing and MapReduce:
Finally, here’s a podcast that Dana Gardner moderated about MapReduce analytics and why they are a game changer for the Business Intelligence community.