F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

📄 Abstract - F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

F2LLM-v2：面向多语言世界的包容、高性能且高效的嵌入模型 / F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

1️⃣ 一句话总结

这篇论文提出了一个名为F2LLM-v2的多语言嵌入模型系列，它通过创新的训练方法在支持200多种语言的同时，实现了从高到低不同计算资源需求下的高性能表现，旨在让AI技术更包容地服务于全球各种语言。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要