Faithfulness of LLM Self-Explanations for Commonsense Tasks: Larger Is Better, and Instruction-Tuning Allows Trade-Offs but Not Pareto Dominance
Siegel, Noah Y., Heess, Nicolas, Perez-Ortiz, Maria, Camburu, Oana-Maria
–arXiv.org Artificial Intelligence
As large language models (LLMs) become increasingly capable, ensuring that their self-generated explanations are faithful to their internal decision-making process is critical for safety and oversight. In this work, we conduct a comprehensive counterfactual faithfulness analysis across 62 models from 8 families, encompassing both pretrained and instruction-tuned variants and significantly extending prior studies of counterfactual tests. We introduce phi-CCT, a simplified variant of the Correlational Counterfactual Test, which avoids the need for token probabilities while explaining most of the variance of the original test. Our findings reveal clear scaling trends: larger models are consistently more faithful on our metrics. However, when comparing instruction-tuned and human-imitated explanations, we find that observed differences in faithfulness can often be attributed to explanation verbosity, leading to shifts along the true-positive/false-positive Pareto frontier. While instruction-tuning and prompting can influence this trade-off, we find limited evidence that they fundamentally expand the frontier of explanatory faithfulness beyond what is achievable with pretrained models of comparable size. Our analysis highlights the nuanced relationship between instruction-tuning, verbosity, and the faithful representation of model decision processes.
arXiv.org Artificial Intelligence
Mar-17-2025
- Country:
- Africa > Middle East (0.04)
- Asia
- Afghanistan (0.04)
- Japan (0.04)
- Middle East
- Jordan (0.04)
- Republic of Türkiye (0.06)
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Atlantic Ocean > North Atlantic Ocean
- North Sea (0.04)
- Europe
- France (0.04)
- Italy > Tuscany
- Florence (0.04)
- Middle East (0.04)
- Monaco (0.04)
- Netherlands (0.04)
- North Sea (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- North America
- Canada > Quebec
- Montreal (0.04)
- Mexico (0.04)
- United States
- California > San Diego County
- San Diego (0.04)
- Colorado (0.04)
- District of Columbia > Washington (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Michigan (0.04)
- New York (0.04)
- Texas (0.04)
- California > San Diego County
- Canada > Quebec
- South America > Brazil (0.04)
- Genre:
- Research Report > New Finding (0.47)
- Industry:
- Leisure & Entertainment > Sports
- Football (0.67)
- Media (1.00)
- Retail (1.00)
- Education (1.00)
- Government (0.67)
- Transportation > Air (0.67)
- Health & Medicine (1.00)
- Consumer Products & Services (0.94)
- Law Enforcement & Public Safety (0.67)
- Leisure & Entertainment > Sports
- Technology: