菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-07-02
📄 Abstract - WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs

Large Language Model (LLM) inference workloads are a rapidly growing contributor to data center energy consumption. Optimizing these deployments requires matching specific LLMs to the most efficient GPUs, but operators currently lack the tools to do so without exhaustively profiling each combination. While some predictive models exist, they still require profiling data and struggle to generalize to hardware unseen during training. To address this, we introduce \textit{WattGPU}, featuring two predictive models for mean GPU power draw and Inter-Token Latency (ITL). Our approach leverages only publicly available LLM metadata and GPU specifications, eliminating the need for hardware access or profiling while enabling generalization to unseen NVIDIA server-grade GPUs and LLMs. We evaluate our models using rigorous leave-one-GPU-out and leave-one-LLM-out cross-validation on a dataset of 42 open-source LLMs (0.1B--27B parameters) and 8 GPUs under both offline and server scenarios. The mean power draw model achieves a median absolute percentage error of $\leq3.4\%$ for offline and $\leq13.5\%$ for server scenarios on unseen GPUs, while the latency model achieves $\leq8.5\%$ in server mode, both maintaining strong GPU ranking correlations for server scenarios (Kendall $\tau\geq0.76$). Compared to standard physically grounded baselines -- Load-Scaled Thermal Design Power (TDP) for power draw and roofline for latency -- our models reduce median absolute percentage error by approximately 4$\times$ on unseen LLM-GPU combinations for server scenarios or approximately 2$\times$ for completely unseen GPUs. WattGPU's data and code are publicly available at this https URL.

顶级标签: llm systems model evaluation
详细标签: energy prediction gpu latency inference optimization cross-validation power modeling 或 搜索:

WattGPU:在未见过的GPU和大语言模型上预测推理功耗与延迟 / WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs


1️⃣ 一句话总结

本文提出了WattGPU,一种仅利用公开的GPU规格和LLM元数据,无需硬件实测即可预测不同大模型在不同GPU上运行时的功耗和延迟的方法,其预测精度显著优于传统基于功耗和带宽的物理模型。

源自 arXiv: 2607.02391