📄
Abstract - CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels
Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for flexibility, while expert-written kernels achieve high efficiency but are difficult to adapt. Recent work explores large language models (LLMs) for GPU kernel generation, but prior studies report unstable correctness and significant performance gaps for complex operators such as attention. We present CuBridge, an LLM-based framework that adapts expert-written attention kernels through a structured lift-transfer-lower workflow. CuBridge starts from expert-written CUDA attention kernels and lifts them into an executable intermediate representation that makes execution orchestration explicit while abstracting low-level CUDA syntax. Given a user-provided PyTorch specification, CuBridge generates and verifies a target IR program, then reconstructs optimized CUDA code via reference-guided lowering. Across diverse attention variants and GPU platforms, CuBridge consistently produces correct kernels and substantially outperforms general frameworks, compiler-based approaches, and prior LLM-based methods.
CuBridge:基于大型语言模型的理解与重构高性能注意力核的框架 /
CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels
1️⃣ 一句话总结
CuBridge是一个利用大型语言模型自动处理CUDA注意力核的框架,它通过将专家编写的高效代码转换为中间表示,再根据用户需求重新生成优化后的CUDA代码,从而在保持高性能的同时,轻松适配多种新型注意力机制。