Qrita:一种基于枢轴截断与选择的高性能GPU Top-k和Top-p算法 / Qrita: High-performance Top-k and Top-p Algorithm for GPUs using Pivot-based Truncation and Selection
1️⃣ 一句话总结
这篇论文提出了一种名为Qrita的新算法,它通过创新的枢轴搜索和截断技术,在大语言模型生成文本时,能比现有方法快两倍、省一半内存,高效且确定性地完成关键的Top-k和Top-p筛选步骤。
Top-k and Top-p are the dominant truncation operators in the sampling of large language models. Despite their widespread use, implementing them efficiently over large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incur significant computation and memory overhead on GPUs, or stochastic approaches, which alter the algorithm output. In this work, we propose Qrita, an efficient Top-k and Top-p algorithm based on a pivot-based selection strategy. Based on RTop-k, which uses a pivot-based search for node selection in graph neural networks, Qrita extends the concept of pivot-based search to both Top-k and Top-p with two key techniques: 1. Gaussian-based sigma-truncation, which greatly reduces the search space of the target elements, and 2. Quaternary pivot search with duplication handling, which halves the pivot search iteration and guarantees deterministic output. We provide the full implementation of Qrita using Triton, a popular GPU programming language. Our evaluation of Qrita against the Top-k and Top-p kernels of high performance LLM execution engines such as vLLM, SGLang, and Flashinfer show that Qrita achieves up to 2 times throughput and half memory use while providing the same output to the the sorting-based algorithms.
Qrita:一种基于枢轴截断与选择的高性能GPU Top-k和Top-p算法 / Qrita: High-performance Top-k and Top-p Algorithm for GPUs using Pivot-based Truncation and Selection
这篇论文提出了一种名为Qrita的新算法,它通过创新的枢轴搜索和截断技术,在大语言模型生成文本时,能比现有方法快两倍、省一半内存,高效且确定性地完成关键的Top-k和Top-p筛选步骤。
源自 arXiv: 2602.01518