< Academy

The Environmental Cost of Serving LLMs

Research
Dr Nadine Kroher
Passion Labs

The Environmental Cost of Serving LLMs

What Google’s real-world study says about energy, carbon, and water per prompt

Why focus on serving, not training?

A few years ago, conversations around AI sustainability focused almost entirely on the huge, one-off compute required to train foundation models. Once trained, these systems sat quietly in the background of a few specialised applications.

Today, the situation is very different. Millions of people (and countless automated back-end systems) are sending prompts to large language models every second. This shift means that the ongoing cost of inference, not the one-time training event, now dominates AI’s overall footprint.

This week we unpack the key findings from a recent Google study, Measuring the Environmental Impact of Delivering AI at Google Scale (Elsworth et al., 2025). The paper is notable because it reports empirical, production-scale measurements from live Gemini services, rather than theoretical lab estimates.

What the study actually measured

Google focused on three per-prompt metrics during inference:

  • Energy per prompt (on the data-centre side)
  • Carbon emissions per prompt (CO₂e, derived from energy and grid mix)
  • Water consumed per prompt (primarily for cooling)

Measurement boundary

Included:

  • AI accelerators (GPUs/TPUs) and internal data-centre networking
  • Host CPUs and DRAM supporting the accelerators
  • Idle or reserved capacity (to account for demand fluctuations)
  • Overhead energy to keep facilities running

Excluded:

  • End-user devices (phones and laptops)
  • Public network transfer between user and data centre
  • Model training and training-data storage

Headline numbers (per prompt)

  • Energy: ~0.24 Wh - roughly the energy of watching television for nine seconds.
    For comparison, controlled lab measurements using a single accelerator would appear lower (~0.1 Wh) because they omit real-world overheads.
  • Carbon: ~0.03 g CO₂e - calculated with market-based emission factors, including Scope 1 and 3 contributions such as hardware manufacturing and supply-chain impact.
  • Water: ~0.26 mL - a few drops per prompt. Around 80% of cooling water evaporates rather than being recycled.

Individually, these numbers seem small. But at scale, they add up quickly.

Small numbers × huge scale = real impact

LLMs now power search engines, productivity tools, creative apps, and fully automated back-end systems, from content moderation to recommendation pipelines. A single automated workflow can trigger more model calls in a day than a human would in a year.

That’s why per-prompt efficiency improvements matter: they compound massively in deployment.

How Google improved their efficiency over the past year

Google reports major efficiency gains from model and systems research:

  • ~33× reduction in energy per prompt through model-side innovations such as distillation, Mixture-of-Experts (MoE) architectures, and lighter inference paths.
  • ~44× reduction in total emissions per prompt over 12 months.

The takeaway: work on model and hardware efficiency doesn’t just lower costs, it directly reduces the environmental footprint of AI.

Why this study matters

What makes this study stand out is that it’s empirical at production scale. Rather than relying on laboratory assumptions about prompt length or cluster load, it draws directly from live Gemini services reflecting the realities of large-scale deployment.

It also offers a clear benchmarking framework, outlining transparent boundaries and methods that other providers could adopt to allow fair comparisons of environmental performance.

Finally, it takes a holistic view of impact, extending beyond GPU power draw to include idle capacity, data-centre overheads, and water consumption, before translating total energy use into CO₂e with recognised emission factors.

While this study offers valuable empirical insight, it’s important to note that its findings still need assessment from specialists in environmental impact and sustainability analysis. As ML researchers, we can interpret the technical implications, how architecture choices, inference efficiency, and serving strategies influence resource use, but we can’t fully evaluate the robustness of the environmental accounting itself. That’s where expertise from environmental scientists, energy analysts, and life-cycle assessment researchers will be critical. Their perspective will help validate the assumptions behind these figures and clarify how they translate into real-world impact.

What to watch next

  • Will other LLM providers publish comparable real-world data?
  • How much footprint comes from consumer prompts versus automated back-end processes?
  • Can systems automatically direct queries to the lowest-impact capable model?
  • How does latency, context length and response format influence total energy and emissions.

The bottom line

Per-prompt impacts may seem negligible until you multiply by billions of requests a day. The encouraging news is that architecture choices, serving strategies and hardware efficiency are already driving major reductions.

As AI adoption deepens, efficiency is no longer just an engineering concern. It’s a product strategy - shaping cost, performance, and sustainability.

Reference


Elsworth, Cooper, et al. Measuring the Environmental Impact of Delivering AI at Google Scale. arXiv preprint arXiv:2508.15734 (2025).

< back to academy
< previous
Next >