菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-14
📄 Abstract - EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

We present EnergyLens, an end-to-end framework for energy-aware large language model (LLM) inference optimization. As LLMs scale, predicting and reducing their energy footprint has become critical for sustainability and datacenter operations, yet existing approaches either require production-level code and expensive profiling or fail to accurately capture multi-GPU energy behavior. As a result, practitioners lack tools for deciding which optimizations to prioritize and for selecting among existing deployment configurations when exhaustive profiling is impractical. EnergyLens addresses this gap with an intuitive einsum-based interface that captures LLM specifications including fusion, parallelism, and compute-communication overlap, combined with load-imbalance-aware MoE modeling and an empirically driven communication energy model for multi-GPU settings. We validate EnergyLens on Llama3 and Qwen3-MoE across tensor-parallel and expert-parallel configurations, achieving mean absolute percentage errors (MAPEs) between 9.25% and 13.19% for multi-GPU prefill and decode energy, and 12.97% across SM allocations for Megatron-style overlap. Our energy-driven exploration reveals up to 1.47x and 52.9x energy variation across configurations in prefill and decode efficiency and motivates distributed serving. We further show that compute-communication overlap is difficult to optimize with intuition alone, but EnergyLens correctly identifies Pareto-optimal overlap configurations.

顶级标签: llm systems
详细标签: energy optimization multi-gpu inference model serving efficiency 或 搜索:

EnergyLens:面向多GPU大语言模型推理优化的预测性节能探索 / EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization


1️⃣ 一句话总结

本文提出了一种名为EnergyLens的框架,能够在不进行昂贵实际测试的情况下,准确预测多GPU环境下大语言模型推理的能耗,帮助开发者快速选择最优的部署配置和优化策略,从而实现节能目标。

源自 arXiv: 2605.14249