📄 论文总结
TurkColBERT:土耳其语信息检索中稠密与延迟交互模型的基准研究 / TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval
1️⃣ 一句话总结
这篇论文为土耳其语信息检索创建了首个综合基准,证明延迟交互模型在参数效率上显著优于传统稠密编码器,能在模型体积缩小数百倍的同时保持高性能,并提出了优化索引算法以实现低延迟检索。
Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.
TurkColBERT:土耳其语信息检索中稠密与延迟交互模型的基准研究 / TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval
这篇论文为土耳其语信息检索创建了首个综合基准,证明延迟交互模型在参数效率上显著优于传统稠密编码器,能在模型体积缩小数百倍的同时保持高性能,并提出了优化索引算法以实现低延迟检索。