Contents
Most people don’t have any idea what the true cost of server electricity use is. This is for many reasons, but I want to highlight the main factors.
In leasing / colo arrangements electricity cost is often flat-rated, and/or bundled up with many other costs.
Many hosts don’t provide electricity usage reporting at all. And those that do sometimes provide it only obliquely as the nominal dollar cost, not in terms of actual Joules used (typically expressed as kWh or MWh, which are just different units for the same measure).
If you build & operate your own datacentre, you do get to see your actual usage – at least in terms of power (Watts) and energy (Joules) if not more holistic cost – but very few people have any actual experience with that.
The sticker price is not the cost
Most of us have no idea what electricity actually costs in any context, even for our own residential use at home, or for electric car charging. Practically nobody pays the actual cost of electricity up front (more details). The full cost is usually many times larger than what is claimed on the bill from the utility provider. It’s largely disaggregated (mainly as air pollution & greenhouse gas emissions) and often grossly unfair (e.g. loss of lives due dam failures & nuclear accidents, health harms geographically localised around power stations or around rivers downstream, etc).
This is broadly true for all sources of electricity, “green” or not, although it is dramatically less-so for some forms of renewable energy (particularly photovoltaics and wind).
Electrical power is mostly not an operating expense
It’s often presumed that electricity is only an operating expense, i.e. you pay a power company per kWh and that’s it. That’s actually a small fraction of the cost – most of the cost is in capital expenses. Even factoring in the true cost of power, i.e. minus subsidies and including externalities, electricity operating expenses are still only a fraction of the cost.
Every bit of electricity you want to put into a datacentre requires infrastructure to handle it. The more energy you want to put in, the more it costs to build that capacity (economies of scale help dampen this impact only a little). Transmission lines, substations, transformers, power distribution throughout the building, rectifiers, backup systems (batteries, fly-wheels, emergency generators), etc.
And every bit of power in means heat out, which has to be removed, which takes yet more energy & money. Baffles and cabinets, fans, radiators, plumbing, heat exchangers, heat pumps (air conditioners) and evaporative coolers, etc. It’s not quite as bad as the tyranny of the rocket equation, but in the same vein.
If you look at a datacentre, physically, most of it is actually about power (and the flip-side, heat). The servers themselves take up a minority of the space and actually cost less than the infrastructure that supports them.
Power costs are dominated by provisioned power, not actual usage
You pay for power infrastructure whether you use it or not.
If you operate at your power capacity all the time, then you at least have a technically efficient system in that respect.
Almost no datacentres (nor servers individually) operate that way. Most are actually operating at way below full load, pretty much all the time. And that means you’ve paid a pretty penny for your power capacity and yet you’re wasting most of it (and therefore most of your money).
Plus, what really matters is how much useful work you accomplish with that power – e.g. if you’re Bitcoin mining, all that power is wasted irrespective of its technical efficiency.
Pathologically wasteful workloads aside, this is made dramatically worse – even for well-meaning workloads – by non-linear efficiencies, the most horrific example being Intel’s Turbo Boost (and AMD’s equivalent). Those techniques effectively try to ensure the CPU operates at peak power usage irrespective of its actual performance. With them enabled you’re using much more of your provisioned power capacity all the time, but you’re not actually getting much more done. It’s incredibly wasteful and worse than not using the power at all (since while electricity operating expenses are a minor component, they’re still significant).
There are similar non-linearity problems throughout the server (and datacentre), e.g. RAM that’s always on and consuming most of its provisioned power even if completely unused. Though the processors (CPUs, GPUs, etc) are usually the most egregious and significant offenders.
The reasons they usually operate well below capacity are myriad, but in short the keys are:
- Bad software (encompassing everything from top level architecture to implementation details, and fundamental decisions like choice of programming languages).
This is (in my personal estimation) by far the dominant factor.
This is a huge subtopic in its own right, which I’ll expand upon in a future post. - Unused space. Whether space not yet filled with servers (because installation and growth takes time, and you usually want some spare capacity at all times), or space temporarily bereft of operating servers due to equipment upgrades and the like.
- Poor architecture amounting to (in a nutshell) insufficient over-provisioning.
Server workloads typically aren’t stable but usually are predictable – e.g. a diurnal cycle because most people are asleep at night rather than using your services, weekends differing from weekdays due to the typical western work week, etc. So even with good software it can be impractical to achieve complete utilisation of hardware. In that case, you can still save money by recognising that you don’t need every server in your datacentre to operate at peak power simultaneously – you can essentially move your power capacity around based on moment-by-moment need.
In a simplistic homogenous environment – where every server is identical – this is somewhat pointless, but most non-trivial datacentres are not like that; they have not just many generations of servers, but also different kinds of servers (web servers vs storage servers etc). During peak traffic you might focus your power on your web servers, but when traffic ebbs you can shift that power to your batch-job servers.
Arguably you don’t need to over-provision your hardware if you have a good software architecture, e.g. if you can run any job on any machine and move jobs fluidly, and thus can always efficiently bin-pack your machines. That’s practically never perfectly the case, although you can get usefully close – nonetheless, which approach is fundamentally the best remains a contentious topic and a subject of reasonable debate within server architecture circles.