At the recent Future Facilities (FF) 6SigmaDC conference, Chuck Rego, chief architect at Intel, delivered a keynote speech. Intel manages their data center with their homegrown DCIM tools and others, including the one from FF.
Stranded capacity is capacity that IT cannot use because of a data center’s configurations and layouts. When we design a data center, we prepare enough power and cooling to meet IT needs. But depending on how we deploy facilities and IT equipment, we may not be able to fully utilize the capacity allocated. Chuck started his talk by saying that he wanted to get a handle on stranded capacity by measuring and quantifying it.
The two most important things in data center operations are reliability and utilization. When we discuss data center energy efficiency, the most-used metric is PUE. Chuck was one of the original five people who discussed the definition of PUE, even before The Green Grid (TGG) defined it officially. What is missing from the current PUE is the incorporation of load information. TGG has been working to incorporate IT utilization and other information to improve PUE. Another metric, CADE, which was suggested jointly by Uptime Institute and McKinsey, also considers the utilization of IT and facilities equipment in its definition. However, I am afraid it has not caught on with the majority of data center operators. PUE is still the dominant metric for energy efficiency for data centers.
Chuck wanted to find out how utilization information might have an impact on PUE. He set up a model that assumes an average of 8 kW/rack and a peak of 12 kW/rack. With this assumption, we can obtain fairly low PUE. Does this level of PUE hold when the pattern of operations changes? What if we calculate PUE for an environment where IT utilization is low? With this average and peak power requirements assumption for a data center, PUE is 2.0. But under a utilization factor of only 20%, the actual operating PUE goes up to 5.7. This is because other supporting elements (both mechanical and electrical) were set up to support much higher loads. He calls this type of PUE actual operating PUE. The point is that the way you operate your data center could make a big impact on the actual efficiency of your data center, even though it was designed to be energy efficient for average utilization.
|SevenWays to Control the Cost of Data Growth|
Hassan Moezzi, director of FF, said that there is a disconnect between the operations of the entire data center on the one hand and server design and rack configuration and layouts on the other. Most IT folks, including me, do not know or care how each server is built; we’re not going to open up a chassis and carefully review the components. According to Chuck, factors like the following may make a 10% difference in energy efficiency:
- Shadowed or unshadowed processors (relative positions of multiple CPUs have impacts on the cooling efficiency of each CPU)
- Processor efficiency based on different levels of workloads
- Fan speed control
- Heat sinks
Even at 21°C, these affect efficiency, and under ASHRAE’s increased temperature and humidity setting of 27°C, the difference would be much more.
Chuck conducted experiments to find out what impacts air flow has over data center operations. He learned two things from his experiments. One is the importance of finding the optimal location to measure temperature. Traditionally, it is measured at the return points of each CRAC unit. His experimentation indicated that temperature control should be done at the supply points (inlets to servers) rather than at the air return points at CRACs. At the return points, there could be some complex air flow, so they may not accurately reflect necessary cooling requirements for server loads. In his experiments, the temperature oscillated widely at the return points, while the supply temperature stayed pretty much constant. As the temperature is increased from 21°C to 27°C, this trend would be amplified.
Another finding was the need to set cooling at a higher temperature. At higher temperature, cooling needs are relaxed, while the IT side may increase power consumption with higher fan speed and silicon leakage (at a higher temperature, CPUs tend to consume more power). So the difference between the gain by facilities and the loss by IT should be carefully weighed. In raising the temperature, reliability and performance should not be compromised. The experiment involved 900 servers for 10 months and tried several temperatures, ranging between 21°C and 35°C. But he did not observe any performance degradation or visible failures at all. This is quite impressive, with real data to back up the result.
Chuck then talked about the placement of sensors. If we want to obtain useful data from each server, we need to attach a sensor to each server. In a big data center, the number of servers can be in the tens of thousands, and it is not reasonable to assume we can attach one sensor to each server. He then talked about smart servers, which come with an embedded sensor. The measurement of relevant information, such as temperature, can be done underneath the OS (so that it is applicable to either Linux or Windows).
Moreover, cooling traditionally has been static and unchanging, even with different loads. But loads change dynamically, and cooling needs should change accordingly. Otherwise, some cooling capacity is wasted. When IT decides to move virtual machines (VM) from one server to another, the loading factor of each server changes with the changed cooling requirements. Power and cooling requirements also could be adjusted, if more accurate loading and operating data are available. Intel has a prototype to give feedback to the dynamically changing server environment and let some servers sleep to optimize the energy efficiency of the whole data center. The last time I talked to PowerAssure, their product had such a feature and worked with Intel.
Sherman Ikemoto of FF said that what ultimately decides energy efficiency for a data center is data from servers but not from facilities equipment. I was somewhat skeptical about that. But after Chuck’s presentation, I am more convinced of his opinion. Maybe we have been tackling the symptoms of the problem rather than its root cause. The problem is, in Sherman’s phrase, “damage done by IT.” But we were not dealing with the real problem of controlling IT equipment. Some time ago, Emerson issued a white paper on Energy Logic and claimed this about the power saving at the server level:
1 watt savings at the server-component level creates a reduction in facility energy consumption of approximately 2.84 watts
Although it was saying the same thing, it was not positioned to emphasize both Sherman and Chuck’s points. By changing the mindset, we may make progress in improving data center energy efficiency.