菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-29
📄 Abstract - FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resources. This underutilization is further pronounced in multi-tenant scenarios. In this paper, we propose FaaSMoE, a multi-tenant MoE serving architecture built on Function-as-a-Service (FaaS) platforms. FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert invocation across tenants. FaaSMoE further supports configurable expert granularity within functions, trading off per-expert elasticity for reduced invocation overhead. We implement a prototype with an open-source edge-oriented FaaS platform and evaluate it using Qwen1.5-moe-2.7B under multi-tenant workloads. Compared to a full-model baseline, FaaSMoE uses less than one third of the resources, demonstrating a practical and resource-efficient path towards scalable MoE serving in a multi-tenant environment.

顶级标签: systems machine learning
详细标签: mixture-of-experts serverless computing multi-tenant serving resource efficiency function-as-a-service 或 搜索:

FaaSMoE:面向多租户混合专家模型服务的无服务器框架 / FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving


1️⃣ 一句话总结

这篇论文提出了一种名为FaaSMoE的新型系统,它利用无服务器计算平台的按需扩展和闲置归零特性,将混合专家模型中的每个“专家”部署为独立的即用即付函数,从而大幅降低多用户场景下的资源浪费——实验表明,相比传统方式,它能节约超过三分之二的计算资源。

源自 arXiv: 2604.26881