A Secret to Data Science Revealed

Analytics is everywhere; at least, people talk and write about it in conjunction with Big Data. But analytics is always discussed at a very high level, omitting the details. Discussions tend to stay at the level of what it can do but not how to implement it. It is as if a secret society of analytics attended by data scientists did not want to reveal the specifics to the rest of us.

Data science/engineering is an emerging segment, and it seems that only a select few have a good understanding of what it is and how to implement and apply it. I have attended several Big Data conferences and read white papers on the subject. Many people who talk about Big Data emphasize the importance of analytics and, therefore, data science/engineering. However, their coverage of analytics stops there. I understand that no single analytics is the same as another, and it is not easy to discuss it in general terms. But those of us who want to know more about it and want to jump in need more details.

So when I had a chance at the recent Tiecon conference to talk to Jim Kaskade, CEO of Infochimps and a well-known expert in analytics and data science, I wanted to ask what concrete steps I can take to become a data scientist.


Jim Kaskade

But first I summarize my chat with him.

Target segments

It would have been impolite to just jump in and corner him and ask him to reveal the secret society. So I asked him about his company first. Infochimps provides a Big Data/analytics cloud service for the enterprise market. I wanted to know his precise definition of the enterprise market. He told me that the company targets Fortune 1,000 companies that reside in North America with more than $1B sales and international reach. Then he drew me a figure to show how he looked at the current enterprise IT climate. I rendered his handwriting into type here. This figure turned out to be a very good one to describe what is happening in the world of IT from the view of enterprises.




Jim showed four layers surrounding corporate IT. The innermost layer (I call it layer 0 for reference purposes) is the core IT itself. The next layer (layer 1) is a set of business units that might form shadow IT, bypassing corporate IT to implement their IT needs, such as access to outside cloud services and BYOD. Then the next layer (layer 2) is outsourced data centers. It is a recent trend and far more economical to outsource data center operations rather than to own your own data centers. At this layer, you can implement virtual private clouds. The outermost layer (layer 3) is public cloud. Depending on the sensitivity of data, applications and services, it makes sense to use virtual private or public clouds rather than to own physical data centers under the control of IT.

The market opportunity for Infochimps is 30% in layers 0 and 1 and 70% in layers 2 and 3, although the total revenue opportunity is highest at layers 0 and 1, medium at layer 2, and lowest at layer 3. Although the highest growth rate is in public cloud (50%), it represents the smallest overall market opportunity in absolute terms and typically only involves non-critical applications (with no data sensitivities or security concerns). The growth rate in layer 2 (35%) is very attractive and addresses security concerns, presenting the most interesting target for Infochimps. The second-highest growth at layers 0 and 1 (8%) is where the largest absolute spend exists but with the fiercest competition against incumbent enterprise software vendors. Note that layer 0 is very, very hard to penetrate because it is the core of the enterprise. For this reason, Infochimps primarily targets layer 1 & 2, always working with business users to define specific use-cases of Big Data. Both layers 0 and 1 have high security requirements, while layer 2 is medium and layer 3 is lower. Layer 2 is Infochimps’ blue ocean, where competition is still relatively low but there is good potential for growth as a market.

What they do

Infochimps develops and markets a suite of cloud services that provide three different types of analytics: real-time analytics, ad hoc analytics, and batch analytics – each a separate but well integrated cloud service. Big Data analytics started with Hadoop, which is primarily for batch processing. But Hadoop isn’t enough – it is also necessary to process real-time data, like in applications of stock trading and power metering, as it comes in (real-time). Stream Processing is used for real time analytics, Hadoop is used for batch analytics processing, and the NoSQL is used for ad hoc queries across a set of heterogeneous databases (for text and big tables). Stream analytics is increasingly important; how many bytes of data can this type handle per second? Jim said that in theory, with distributed processing with parallel computing and elastic cloud services, the sky is the limit. All that is necessary is to increase parallelism with more processing nodes. Let’s look at the architecture.


An overview

They need to prepare a set of interfaces to cope with collecting data from many different sources and with different formats or with no format. They prepare often-used data-collection interfaces as shown in the following figure. Depending on the company, they may need to develop a new interface; they have a provision to incorporate that into the system. As discussed above, there are three levels of analytics services: streaming (real time), ad hoc, and batch.

Analytics packages

Infochimps provides a platform by which analytics packages can access data, but it does not provide analytics packages yet. Jim wants to develop a startup kit for analytics packages for specific use-cases within several target industries in the near future. That is quite understandable. Each analytics is unique to a particular company in a particular industry segment. To support a specific case, Infochimps needs to develop or acquire industry-specific expertise, and that takes partners, time and money.

What they do do not do

I was wondering how they work with their clients, which are very different from each other. There are many businesses in many industry sectors that require different information from others. To give them what they want, you need to have expert knowledge in each particular vertical market. Also, not everyone necessarily has all the data necessary to analyze to derive useful information for his business and operations. Without such expertise, you cannot tell which data need to be collected to produce necessary information. If some important data are missing, the client needs to be advised to collect them. On top of that, they may need to acquire additional computer systems consisting of hardware and software for new types of data to collect. Does Infochimps do all of that? Jim showed me high-level steps of how to do this:






Before they talk about what information they need, they talk about how the business problem (through a process they refer as “Business Discovery”), which is necessary to understand what the client does, how it’s done, and the business problem they are attempting to solve. Following the business discussion, the necessary information is discussed as it applies to the specific problem. And finally an architecture reflecting the two discussions is decided on. Such a task is given to emerging Big Data system integrators, such as ThinkBig Analytics, Zaloni or Cloudwick, or larger established SIs such as CSC, Wipro, Capegemini, and Accenture. This is actually a good way to form a strong ecosystem. Those big and small system integrators work with their vertical customers with their expert knowledge and give them excellent advice about data collection and what additional computing equipment may be necessary.

 You can become a data scientist, too

Jim moderated a session with that title prior to our meeting. The session was only 15 minutes but very informative. D.J. Patil, who is data scientist in residence with Greylock Partners and formerly Head of Data Products, Chief Scientist, and Chief Security Officer at LinkedIn, discussed some traits necessary in a good data scientist. I plan to write about the session in a future blog. OK, that is fine. But this does not help me grasp the area in more detail.


D.J. Patil (left) and Jim Kaskade (right)

Give me more! So I asked Jim about it. His first answer was something similar to D.J.’s. That is, to become a good coder, be curious, and ask a lot of questions. I did not lose this opportunity to gain more from him. So I gently but persistently asked him for more details. Here’s what he told me as a concrete means of what to do to master data science:

· Download and install Hadoop, and Storm (use Ironfan to orchestrate)

· Use and understand MapReduce, PIG, Hive, Wukong, and Trident.

· Pick up a scripting language like Python, Ruby, or PHP.

· Master a simple selection of four analytics algorithms: Naïve Bayes, logistic regression, linear regression, and hierarchical clustering.

Because most are available as open source and I have a desktop with Ubuntu Linux on it, I have no more excuses or complaints. Of course, even though concrete steps were given, this is just a first step. I hope after doing this, a new horizon opens up before me towards Zen the data scientist.

Applications to power grid

See my previous blog for those. Now that I have a little more detail, I can appreciate those analytics more.

Zen Kishimoto

About Zen Kishimoto

Seasoned research and technology executive with various functional expertise, including roles in analyst, writer, CTO, VP Engineering, general management, sales, and marketing in diverse high-tech and cleantech industry segments, including software, mobile embedded systems, Web technologies, and networking. Current focus and expertise are in the area of the IT application to energy, such as smart grid, green IT, building/data center energy efficiency, and cloud computing.

, , , , ,

Leave a Reply