TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning

📄 Abstract - TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning

Deep learning (DL) compilers rely on cost models and auto-tuning to optimize tensor programs for target hardware. However, existing approaches depend on large offline datasets, incurring high collection costs and offering suboptimal transferability across platforms. In this paper, we introduce TCL, a novel efficient and transferable compiler framework for fast tensor program optimization across diverse hardware platforms to address these challenges. Specifically, TCL is built on three core enablers: (1) the RDU Sampler, a data-efficient active learning strategy that selects only 10% of tensor programs by jointly optimizing Representativeness, Diversity, and Uncertainty, substantially reducing data collection costs while maintaining near-original model accuracy; (2) a new Mamba-based cost model that efficiently captures long-range schedule dependencies while achieving a favorable trade-off between prediction accuracy and computational cost through reduced parameterization and lightweight sequence modeling; and (3) a continuous knowledge distillation framework that effectively and progressively transfers knowledge across multiple hardware platforms while avoiding the parameter explosion and data dependency issues typically caused by traditional multi-task learning. Extensive experiments validate the effectiveness of each individual enabler and the holistic TCL framework. When optimizing a range of mainstream DL models on both CPU and GPU platforms, TCL achieves, on average, 16.8x and 12.48x faster tuning time, and 1.20x and 1.13x lower inference latency, respectively, compared to Tenset-MLP.

TCL：通过持续学习实现跨硬件快速高效的张量程序优化 / TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning

1️⃣ 一句话总结

这篇论文提出了一个名为TCL的深度学习编译器框架，它通过一种高效的主动学习采样器、一个轻量级的新型成本预测模型以及一个持续知识蒸馏机制，显著降低了为不同硬件平台优化张量程序所需的数据收集成本和调优时间，同时提升了程序运行性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要