Subliminal Effects in Your Data: A General Mechanism via Log-Linearity

📄 Abstract - Subliminal Effects in Your Data: A General Mechanism via Log-Linearity

Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset-centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit-Linear-Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real-world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.

数据中的潜意识效应：一种通过对数线性实现的通用机制 / Subliminal Effects in Your Data: A General Mechanism via Log-Linearity

1️⃣ 一句话总结

这篇论文发现了一种通用机制，通过有选择地组合训练数据中的子集，可以在大语言模型中引发各种隐藏的、非直观的行为模式，例如特定偏好、跨语言响应或不同角色扮演，且该效应在不同模型架构中普遍存在。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要