Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Jan-19-2025, 11:45:54 GMT–Neural Information Processing Systems

We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a learned set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5\% to 65.1\% . We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength.

eliciting truthful answer, inference-time intervention, language model, (1 more...)

Neural Information Processing Systems

Jan-19-2025, 11:45:54 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)