基于大语言模型的嵌入:注意力值比隐藏状态更能编码句子语义 / LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States
1️⃣ 一句话总结
这篇论文发现,从大语言模型的注意力机制中提取的‘注意力值’向量,比传统使用的最终层‘隐藏状态’能更好地捕捉句子的整体含义,并提出了一种简单有效的聚合方法,在不额外训练的情况下就达到了顶尖的句子表示效果。
Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.
基于大语言模型的嵌入:注意力值比隐藏状态更能编码句子语义 / LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States
这篇论文发现,从大语言模型的注意力机制中提取的‘注意力值’向量,比传统使用的最终层‘隐藏状态’能更好地捕捉句子的整体含义,并提出了一种简单有效的聚合方法,在不额外训练的情况下就达到了顶尖的句子表示效果。
源自 arXiv: 2602.01572