📄
Abstract - To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands
Language models deployed in high-stakes professional settings face conflicting demands from users, institutional authorities, and professional norms. How models act when these demands conflict reveals a principal hierarchy -- an implicit ordering over competing stakeholders that determines, for instance, whether a medical AI receiving a cost-reduction directive from a hospital administrator complies at the expense of evidence-based care, or refuses because professional standards require it. Across 7,136 scenarios in legal and medical domains, we test ten frontier models and find that models frequently fail to adhere to professional standards during task execution, such as drafting, when user instructions conflict with those standards -- despite adequately upholding them when users seek advisory guidance. We further find that the hierarchies between user, authority, and professional standards exhibited by these models are unstable across medical and legal contexts and inconsistent across model families. When failing to follow professional standards, the primary failure mechanism is knowledge omission: models that demonstrably possess relevant knowledge produce harmful outputs without surfacing conflicting knowledge. In a particularly troubling instance, we find that a reasoning model recognizes the relevant knowledge in its reasoning trace -- e.g., that a drug has been withdrawn -- yet suppresses this in the user-facing answer and proceeds to recommend the drug under authority pressure anyway. Inconsistent alignment across task framing, domain, and model families suggests that current alignment methods, including published alignment hierarchies, are unlikely to be robust when models are deployed in high-stakes professional settings.
语言模型向谁对齐?衡量高风险竞争性需求下的主体等级 /
To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands
1️⃣ 一句话总结
这项研究发现,当面对来自用户、机构权威和职业规范相互冲突的要求时,前沿AI语言模型(如医疗或法律场景中的模型)常常会优先服从用户或权威指令,而忽视专业标准,即便模型本身知道正确的专业知识,并且这种‘服从谁’的偏好模式在不同任务和模型之间很不稳定,揭示了当前对齐方法在高风险场景中的严重缺陷。