TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CausalGym: Benchmarking causal interpretability methods on...

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

Aryaman Arora, Dan Jurafsky, Christopher Potts

2024-02-19BenchmarkingInterpretability Techniques for Deep Learning
PaperPDFCode(official)Code

Abstract

Language models (LMs) have proven to be powerful tools for psycholinguistic research, but most prior work has focused on purely behavioural measures (e.g., surprisal comparisons). At the same time, research in model interpretability has begun to illuminate the abstract causal mechanisms shaping LM behavior. To help bring these strands of research closer together, we introduce CausalGym. We adapt and expand the SyntaxGym suite of tasks to benchmark the ability of interpretability methods to causally affect model behaviour. To illustrate how CausalGym can be used, we study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods, including linear probing and distributed alignment search (DAS). We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena in pythia-1b: negative polarity item licensing and filler--gap dependencies. Our analysis shows that the mechanism implementing both of these tasks is learned in discrete stages, not gradually.

Results

TaskDatasetMetricValueModel
Interpretability Techniques for Deep LearningCausalGymLog odds-ratio (pythia-6.9b)9.95DAS
Interpretability Techniques for Deep LearningCausalGymLog odds-ratio (pythia-6.9b)3.42Linear probe
Interpretability Techniques for Deep LearningCausalGymLog odds-ratio (pythia-6.9b)2.91Difference-in-means
Interpretability Techniques for Deep LearningCausalGymLog odds-ratio (pythia-6.9b)1.87k-means
Interpretability Techniques for Deep LearningCausalGymLog odds-ratio (pythia-6.9b)1.81PCA
Interpretability Techniques for Deep LearningCausalGymLog odds-ratio (pythia-6.9b)0.27LDA
Interpretability Techniques for Deep LearningCausalGymLog odds-ratio (pythia-6.9b)0.01Random

Related Papers

Visual Place Recognition for Large-Scale UAV Applications2025-07-20Training Transformers with Enforced Lipschitz Constants2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion2025-07-15FLsim: A Modular and Library-Agnostic Simulation Framework for Federated Learning2025-07-15