Mechanistic Interpretability Needs Philosophy

Williams, Iwan, Oldenburg, Ninell, Dhar, Ruchira, Hatherley, Joshua, Fierro, Constanza, Rajcic, Nina, Schiller, Sandrine R., Stamatiou, Filippos, Søgaard, Anders

Jun-24-2025–arXiv.org Artificial Intelligence

Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. We argue that mechanistic interpretability needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts, refining its methods, and assessing the epistemic and ethical stakes of interpreting AI systems. Taking three open problems from the MI literature as examples, this position paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jun-24-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China (0.04)
  - Japan > Honshū
    - Tōhoku > Iwate Prefecture > Morioka (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - France (0.04)
  - Germany > Hesse
    - Darmstadt Region > Frankfurt (0.04)
  - Switzerland > Zürich
    - Zürich (0.14)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
    - Oxfordshire > Oxford (0.14)
- North America > United States
  - California > Los Angeles County
    - Alhambra (0.04)
  - Illinois > Cook County
    - Chicago (0.04)
  - Massachusetts (0.04)
  - Washington > King County
    - Seattle (0.04)
- Pacific Ocean > North Pacific Ocean
  - San Francisco Bay > Golden Gate (0.04)

Genre:
- Overview (0.68)
- Research Report (0.64)

Industry:
- Health & Medicine > Therapeutic Area > Neurology (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science (1.00)
  - Machine Learning > Neural Networks (1.00)
  - Natural Language (1.00)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found