菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-30
📄 Abstract - Strait: Perceiving Priority and Interference in ML Inference Serving

Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. However, limited support for task prioritization and insufficient latency estimation under concurrent execution may restrict their applicability in on-premises scenarios. We present \emph{Strait}, a serving system designed to enhance deadline satisfaction for dual-priority inference traffic under high GPU utilization. To improve latency estimation, Strait models potential contention during data transfer and accounts for kernel execution interference through an adaptive prediction model. By drawing on these predictions, it performs priority-aware scheduling to deliver differentiated handling. Evaluation results under intense workloads suggest that Strait reduces deadline violations for high-priority tasks by 1.02 to 11.18 percentage points while incurring acceptable costs on low-priority tasks. Compared to software-defined preemption approaches, Strait also exhibits more equitable performance.

顶级标签: systems model evaluation
详细标签: inference serving priority scheduling latency estimation interference modeling gpu scheduling 或 搜索:

Strait:感知机器学习推理服务中的优先级与干扰 / Strait: Perceiving Priority and Interference in ML Inference Serving


1️⃣ 一句话总结

本文提出了一种名为Strait的机器学习推理服务系统,它通过预测GPU上的数据传输冲突和内核执行干扰,并据此进行优先级感知的任务调度,从而在高负载下显著减少高优先级推理请求的截止时间违反率,同时兼顾低优先级任务的性能。

源自 arXiv: 2604.28175