New Paradigm and Thinking Required for Massively Distributed, Large, and Complex Systems – Part 2

The second keynote speech I attended at NoSQL Now 2013 was by Adrian Cockcroft of Netflix. Adrian gave a talk titled Managing Scale and Complexity at Netflix. He talked about a different aspect of massively distributed, large, and complex systems from Nathan Marz’s. (URL). Between the two, I gained more understanding of such systems.

In sum, Adrian talked about how to make a distributed, large, and very complex system like Netflix’s scalable and reliable, even with occasional inevitable and unforeseen breakdowns. Those breakdowns are why it is necessary to design and implement such a system for redundancy and fault-tolerance. To make sure your system stays operational in spite of unforeseen problems, you should constantly test it by intentionally interrupting it. Again, the theme is similar to what Nathan Marz (url) discussed and James Urquhart talked about. Adrian’s talk was not uploaded to SlideShare, but he has a similar presentation here that is quite useful.

Adrian Cockcroft

OODA loop

Adrian first showed how an enterprise could go to the market quickly to be competitive in the current market. The traditional organizational structure is too cumbersome to cope with the “quick-to-the-market” trend. He explained a new approach using the OODA loop originally applied to fighter jet dogfights. As shown in the next picture, the loop consists of “objective,” “orient,” “decide,” and “act,” and those items are associated with “innovation,” “Big Data,” “culture,” and “cloud,” respectively, in the enterprise environment.

OODA loop as applied to the enterprise environment

This is a closed loop from “objective,” “orient,” “decide,” and finally “act,” and then it goes back to “objective.”

Rectangles associated with each circle:

  • Objective: measure customers, load grab opportunities, competitive move, customer pain point
  • Orient: research and analysis
  • Decide: plan response, get buy-in, and commit resources
  • Act: engage customers, deliver, and implement

Each point is self-explanatory.

What enterprises should do to cope with the new trend

Enterprises need to speed up in getting to the market. For that, Adrian pointed out four pointsas necessary transitions in their organization (Netflix organization is shown in page 14 of the presentation):

· Integrate business, development, and operations into one BusDevOps

· Denormalize data using NoSQL

· Transition responsibility from Ops to Dev for quick continuous delivery

· Transition responsibility from Ops to Dev using cloud

How to design and operate large, distributed, and complex systems

Adrian said that no matter how you design and operate your distributed and complex system, failures are inevitable. The only way to make your system robust is to intentionally disrupt its operations from time to time, learn from that, and make changes accordingly. Netflix’s streaming services are hosted on Amazon’s cloud, which consists of multiple geographically distributed regions. On top of their own system, clouds are known to break occasionally. For that, he introduced a couple of tools like Chaos Monkey and Chaos Gorilla.

Briefly:

Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone.

These tools are part of the Simian Army, which is a suite of tools for keeping your cloud operating in top form.

Netflix open source system

Netflix open-sourced their software like other software in these emerging areas. Adrian discussed Netflix Open Source System (OSS), which is a cloud-native open source platform and is available from here. Why do they make their own technologies open source? His answer was fourfold.

To:

  • Establish their solutions as best practices/standards
  • Hire, retain, and engage top engineers
  • Build up the Netflix technology brand
  • Benefit from a shared ecosystem

He then showed how they actually disrupt their systems, which I do not cover here, for brevity’s sake. Those who are interested can look at his slides. Pay special attention to pages 23 to 27 in the presentation.

Netflix streaming in overall Internet traffic

Another thing that attracted my attention was the increase in and the types of applications/utilities in the traffic. Adrian shared Internet traffic information (November 2012 vs. March 2013) provided by Sandvine. Traffic increased by 39% between those periods. Netflix traffic was about 33% of all Internet traffic during both periods. In March 2013, the traffic between Netflix and YouTube was more than 50% of all Internet traffic. I knew streaming makes up a large percentage of the traffic but did not know it was this much. This shows that Netflix has a large impact on the entirety of Internet traffic.

References

Because the following are not clickable on the references given in the slides, I reproduced them for your convenience.

· Code

· Technical blog

· Slides

Final comments

The power grid is also a massively distributed, large, and complex system. We need a system like Netflix’s to manage it. How to apply such a system is going to be a challenge.

Zen Kishimoto

About Zen Kishimoto

Seasoned research and technology executive with various functional expertise, including roles in analyst, writer, CTO, VP Engineering, general management, sales, and marketing in diverse high-tech and cleantech industry segments, including software, mobile embedded systems, Web technologies, and networking. Current focus and expertise are in the area of the IT application to energy, such as smart grid, green IT, building/data center energy efficiency, and cloud computing.

, , , , , , , , ,

Leave a Reply


*