Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

📄 Abstract - Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

As Large Language Model (LLM) agents are increasingly tasked with high-stakes autonomous decision-making, the transparency of their reasoning processes has become a critical safety concern. While \textit{Chain-of-Thought} (CoT) prompting allows agents to generate human-readable reasoning traces, it remains unclear whether these traces are \textbf{faithful} generative drivers of the model's output or merely \textbf{post-hoc rationalizations}. We introduce \textbf{Project Ariadne}, a novel XAI framework that utilizes Structural Causal Models (SCMs) and counterfactual logic to audit the causal integrity of agentic reasoning. Unlike existing interpretability methods that rely on surface-level textual similarity, Project Ariadne performs \textbf{hard interventions} ($do$-calculus) on intermediate reasoning nodes -- systematically inverting logic, negating premises, and reversing factual claims -- to measure the \textbf{Causal Sensitivity} ($\phi$) of the terminal answer. Our empirical evaluation of state-of-the-art models reveals a persistent \textit{Faithfulness Gap}. We define and detect a widespread failure mode termed \textbf{Causal Decoupling}, where agents exhibit a violation density ($\rho$) of up to $0.77$ in factual and scientific domains. In these instances, agents arrive at identical conclusions despite contradictory internal logic, proving that their reasoning traces function as "Reasoning Theater" while decision-making is governed by latent parametric priors. Our findings suggest that current agentic architectures are inherently prone to unfaithful explanation, and we propose the Ariadne Score as a new benchmark for aligning stated logic with model action.

阿里阿德涅项目：一个用于审计LLM智能体忠实度的结构因果框架 / Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

1️⃣ 一句话总结

这篇论文提出了一个名为‘阿里阿德涅项目’的新框架，它通过结构因果模型和反事实推理来检测大型语言模型智能体给出的推理过程是否真实驱动了其决策，结果发现模型常常‘说一套做一套’，其解释可能只是事后的合理化而非真实的决策依据。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要