菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-04
📄 Abstract - Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

We present a method for diagnosing interpretation in neural networks by identifying an input subspace where a proposed interpretation is highly faithful. Our method is particularly useful for causal-abstraction-style interpretability, where a high-level causal hypothesis is evaluated by interchange interventions. Rather than treating interchange intervention accuracy as a single global summary, we refine this framework by partitioning the input space into well-interpreted and under-interpreted regions according to pairwise interchange-intervention behavior. This turns causal abstraction from a purely global evaluation into a more diagnostic tool: it not only measures whether an interpretation works, but also reveals where it works, where it fails, and what distinguishes the two cases. This diagnostic view also provides practical heuristics for improving interpretations. By analyzing the structure of the well-interpreted and under-interpreted regions, we can identify missing distinctions in a high-level hypothesis, discover previously unmodeled intermediate variables, and combine complementary partial interpretations into a stronger one. We instantiate this idea as a simple four-step recipe and show that it yields informative error analyses across multiple causal abstraction settings. In a toy logic task, recursively applying the recipe recovers a high-level hypothesis from scratch. More broadly, our results suggest that partitioning the input space is a useful step toward more precise, constructive, and scalable mechanistic interpretability.

顶级标签: machine learning model training model evaluation
详细标签: causal abstraction interpretability interchange interventions input space partitioning faithfulness diagnosis 或 搜索:

区分好苹果:一种诊断与改进因果抽象的方法 / Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction


1️⃣ 一句话总结

本文提出一种方法,通过将输入数据划分为“解释良好”与“解释不足”两个区域,来定位神经网络中因果解释的弱点,并利用这些区域的差异自动改进解释,使得原本只能整体评价的解释方法变得可以诊断和优化。

源自 arXiv: 2605.02234