I have heard James Urquhart talk several times in the past, but the talk he gave at the recent CloudScale, hosted by Plug and Play Tech, was by far the best and most interesting. He has obviously given this talk a few times elsewhere. CloudScale had a good assortment of presentations and discussions, although each session was very short. James’s talk was only 15 minutes, but it stuck in my mind. You can find a longer version of his slide that was used elsewhere here.
There have been a lot of talks on cloud computing, about what it is, what kinds exist, and how it changes the way IT is delivered. Virtualization, security, automation, provisioning, tools, and other details have been discussed many times by many people. But his talk was the first I’ve heard that specifically covered how to design the entire system of cloud as a whole.
If I may oversimplify his point:
- Cloud is an interconnected complex adaptive system.
- The interconnected system as a whole may collapse, even if we carefully design and operate each component of the system.
- It is vital to build resilience into the design and operation of such a system.
He gave an example of a market crash, known as Flash Crash, which occurred on March 6, 2010. I myself did not remember it, and neither did most of the audience. An automated system for stock exchange executed the sell program extremely rapidly in just 20 minutes. With this, the market got more active and more orders poured in. The system spun more threads to process these orders, and finally it got out of hand and crashed. With this story, James introduced the term complex adaptive system.
It is defined as:
where emergent behavior or emergence is defined as:
Because each component of the complex system is relatively simple and easy to control, we tend to think we can control the system as a whole. But as the discussion above shows, that is not the case. If there is no way to predict future behavior, the only thing to do is to accept potential failure and get ready for it. Because there is no way of knowing how the system may fail, we may randomly produce a large number of problems to test our systems. James showed one such system that causes random chaos, Netflix’s chaos monkey.
I usually relate the subject of this blog to green IT or energy. That’s hard to do in this case. One thing I can say is this: if the complex adaptive system fails and starts to behave erratically, that surely wastes energy, because the behavior does not produce any meaningful outcome. But I need to think further about what this means to green IT.
If you want to know more about it, you can read some of his blogs.
James recommended reading Drift into Failure by Sidney Dekker of Griffith University, Australia, and other books, shown below, to further understand this.