The Environmental Cost of Serving LLMs


What Google’s real-world study says about energy, carbon, and water per prompt
Why focus on serving, not training?
A few years ago, conversations around AI sustainability focused almost entirely on the huge, one-off compute required to train foundation models. Once trained, these systems sat quietly in the background of a few specialised applications.
Today, the situation is very different. Millions of people (and countless automated back-end systems) are sending prompts to large language models every second. This shift means that the ongoing cost of inference, not the one-time training event, now dominates AI’s overall footprint.
This week we unpack the key findings from a recent Google study, Measuring the Environmental Impact of Delivering AI at Google Scale (Elsworth et al., 2025). The paper is notable because it reports empirical, production-scale measurements from live Gemini services, rather than theoretical lab estimates.
Google focused on three per-prompt metrics during inference:
Included:
Excluded:
Individually, these numbers seem small. But at scale, they add up quickly.
LLMs now power search engines, productivity tools, creative apps, and fully automated back-end systems, from content moderation to recommendation pipelines. A single automated workflow can trigger more model calls in a day than a human would in a year.
That’s why per-prompt efficiency improvements matter: they compound massively in deployment.
Google reports major efficiency gains from model and systems research:
The takeaway: work on model and hardware efficiency doesn’t just lower costs, it directly reduces the environmental footprint of AI.
What makes this study stand out is that it’s empirical at production scale. Rather than relying on laboratory assumptions about prompt length or cluster load, it draws directly from live Gemini services reflecting the realities of large-scale deployment.
It also offers a clear benchmarking framework, outlining transparent boundaries and methods that other providers could adopt to allow fair comparisons of environmental performance.
Finally, it takes a holistic view of impact, extending beyond GPU power draw to include idle capacity, data-centre overheads, and water consumption, before translating total energy use into CO₂e with recognised emission factors.
While this study offers valuable empirical insight, it’s important to note that its findings still need assessment from specialists in environmental impact and sustainability analysis. As ML researchers, we can interpret the technical implications, how architecture choices, inference efficiency, and serving strategies influence resource use, but we can’t fully evaluate the robustness of the environmental accounting itself. That’s where expertise from environmental scientists, energy analysts, and life-cycle assessment researchers will be critical. Their perspective will help validate the assumptions behind these figures and clarify how they translate into real-world impact.
Per-prompt impacts may seem negligible until you multiply by billions of requests a day. The encouraging news is that architecture choices, serving strategies and hardware efficiency are already driving major reductions.
As AI adoption deepens, efficiency is no longer just an engineering concern. It’s a product strategy - shaping cost, performance, and sustainability.
Reference
Elsworth, Cooper, et al. Measuring the Environmental Impact of Delivering AI at Google Scale. arXiv preprint arXiv:2508.15734 (2025).