EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

📄 Abstract - EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

We present EnergyLens, an end-to-end framework for energy-aware large language model (LLM) inference optimization. As LLMs scale, predicting and reducing their energy footprint has become critical for sustainability and datacenter operations, yet existing approaches either require production-level code and expensive profiling or fail to accurately capture multi-GPU energy behavior. As a result, practitioners lack tools for deciding which optimizations to prioritize and for selecting among existing deployment configurations when exhaustive profiling is impractical. EnergyLens addresses this gap with an intuitive einsum-based interface that captures LLM specifications including fusion, parallelism, and compute-communication overlap, combined with load-imbalance-aware MoE modeling and an empirically driven communication energy model for multi-GPU settings. We validate EnergyLens on Llama3 and Qwen3-MoE across tensor-parallel and expert-parallel configurations, achieving mean absolute percentage errors (MAPEs) between 9.25% and 13.19% for multi-GPU prefill and decode energy, and 12.97% across SM allocations for Megatron-style overlap. Our energy-driven exploration reveals up to 1.47x and 52.9x energy variation across configurations in prefill and decode efficiency and motivates distributed serving. We further show that compute-communication overlap is difficult to optimize with intuition alone, but EnergyLens correctly identifies Pareto-optimal overlap configurations.

EnergyLens：面向多GPU大语言模型推理优化的预测性节能探索 / EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

1️⃣ 一句话总结

本文提出了一种名为EnergyLens的框架，能够在不进行昂贵实际测试的情况下，准确预测多GPU环境下大语言模型推理的能耗，帮助开发者快速选择最优的部署配置和优化策略，从而实现节能目标。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要