May 13, 2024 — Data centers need upgraded dashboards to improve energy efficiency, dashboards that show the progress of running real-world applications. The formula for calculating energy efficiency is simple: the amount of work done divided by the energy used. To apply this to your data center, a few details need to be uncovered.
The most widely used gauge today, Power Usage Effectiveness (PUE), compares the total energy consumed by a facility to the amount of computing infrastructure used. Over the past 17 years, PUE has brought the most efficient operators closer to the ideal of wasting very little energy in processes such as power conversion and cooling.
Find the next metric
PUE has served data centers well during the rise of cloud computing and will continue to be useful in the future. But in today's era of generative AI, where workloads and the systems that run them are changing dramatically, that's not enough.
This is because PUE does not measure the useful output of a data center, but only the energy consumed by the data center. It's like measuring the amount of gasoline the engine uses without realizing how many miles the car has traveled. Many standards exist when it comes to data center efficiency. A 2017 paper lists nearly 30 of them, some of which focus on specific goals such as cooling, water use, security, and cost.
Understand what a watt is
When it comes to energy efficiency, the computer industry has a long and somewhat unfortunate history of describing systems and the processors they use in terms of power (usually measured in watts). While this is a valuable metric, many people realize that watts only measures the input power at a point in time, not the actual energy a computer uses or how efficiently it uses it. not.
Therefore, when modern systems and processors report increasing input power levels in watts, this does not mean they are becoming less energy efficient. In fact, depending on the amount of energy you use, the amount of work you do is often much more efficient. Modern data center metrics should focus on energy, known in the engineering community as kilowatt-hours or joules. What matters is how much useful work they do with this energy.
redo something called work
Again, there is a practice in the industry to measure in abstract terms such as processor instructions or mathematical calculations. That's why MIPS (millions of instructions per second) and FLOPS (floating point operations per second) are widely cited.
Only computer scientists care about how well a system can handle these low-level jobs. Users want to know how much work the system actually performs, but the definition of useful work is somewhat subjective.
AI-focused data centers may rely on the MLPerf benchmark. Supercomputing centers working on scientific research typically use additional workloads. Commercial data centers that focus on streaming media may require other data centers.
The resulting set of applications should be able to evolve over time to reflect cutting-edge technology and the most relevant use cases. For example, his last MLPerf round added tests using two of his generative AI models that didn't even exist five years ago.
Gauges for fast computing
Ideally, new benchmarks should measure advances in accelerated computing. This combination of parallel processing hardware, software, and methods allows applications to run dramatically faster and more efficiently than the CPU for many modern workloads.
For example, in scientific applications, the National Energy Research Scientific Computing Center's Perlmutter supercomputer has demonstrated an average five-fold increase in energy efficiency using accelerated computing. Therefore, this supercomputer is among the 39 computers in the top 50 supercomputers, including the #1 system. Green500 list Those that use NVIDIA GPUs.
Companies in many industries share similar results. For example, PayPal improved real-time fraud detection by 10% and reduced server energy consumption by nearly 8x through accelerated computing.
The benefits continue to grow with each new generation of GPU hardware and software. In a recent report, Stanford University's Human-Centered AI Group estimated that since 2003, GPU performance has increased “approximately 7,000 times” and the price per performance has increased “5,600 times.”
Two experts give their opinions
Experts also recognize the need for new energy efficiency metrics.
Today's data centers achieve scores of about 1.2 PUE, and this metric is “making some progress,” said data center engineer Christian Beladi, the inventor of PUE. “When things were bad, data centers became more efficient, but 20 years from now they're better and we need to focus on other metrics that are more relevant to today's problems.”
Looking to the future, he says, “The holy grail is performance metrics. You can't directly compare different workloads, but I think segmenting by workload increases your chances of success.” said Belady, who continues to work on initiatives that promote sustainability.
Jonathan Coomey, a researcher and author on computer efficiency and sustainability, agrees. “To make good decisions regarding efficiency, data center operators need a set of benchmarks that measure the energy impact of today's most widely used AI workloads,” he said. states.
“Tokens per Joule is a good example of what one element of such a suite might look like,” Coomey added. “Companies are encouraged to engage in open discussions, share information about the nuances of their workloads and experiments, and ensure that these metrics are realistic enough to accurately characterize the energy usage of hardware running real-world applications. must agree to appropriate testing procedures.”
“Finally, we need a public forum to implement this important work,” he said.
it takes a village
Thanks to metrics like PUE and rankings like Green500, data centers and supercomputing centers have made great strides in energy efficiency.
More can and must be done to scale efficiency advances in the era of generative AI. Measures of the energy consumed while doing useful work in today's top applications can help you take supercomputing and data center energy efficiency to a new level.
Source: Jeremy Rodriguez, NVIDIA