Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Zhengxuan Wu, Atticus Geiger
–Neural Information Processing Systems
Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety.
Neural Information Processing Systems
Feb-18-2026, 00:22:04 GMT
- Country:
- Genre:
- Research Report > New Finding (0.68)
- Technology: