Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Dec-27-2025, 03:25:41 GMT–Neural Information Processing Systems

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. This level of transparency into LLMs' predictions would yield significant safety benefits. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs--e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always (A)--which models systematically fail to mention in their explanations.

chain-of-thought prompting, cot explanation, unfaithful explanation, (5 more...)

Neural Information Processing Systems

Dec-27-2025, 03:25:41 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)