菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-06
📄 Abstract - LP-GEMM: Integrating Layout Propagation into GEMM Operations

In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources. This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads. Our results show average speedups of 2.25x over OpenBLAS on Intel x86 for sequential GEMMs and competitive gains relative to vendor-optimized libraries such as Intel MKL. We demonstrate the practicality of the approach beyond microbenchmarks by implementing a standalone C++ version of the Llama-3.2 inference path using exclusively BLAS-level GEMM calls. These results confirm that leveraging data layout propagation between operations can significantly boost performance.

顶级标签: systems model training machine learning
详细标签: gemm optimization layout propagation blas libraries performance scientific computing 或 搜索:

LP-GEMM:将布局传播集成到GEMM运算中 / LP-GEMM: Integrating Layout Propagation into GEMM Operations


1️⃣ 一句话总结

这篇论文提出了一种名为LP-GEMM的新方法,通过让连续矩阵乘法运算共享数据的内存排列格式,避免了重复的数据格式转换开销,从而在科学计算和机器学习任务中显著提升了运算速度。

源自 arXiv: 2604.04599