Truth is Universal: Robust Detection of Lies in LLMs

May-27-2025, 21:43:54 GMT–Neural Information Processing Systems

Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities. In particular, LLMs are capable of "lying", knowingly outputting false statements. Hence, it is of interest and importance to develop methods to detect when LLMs lie. Indeed, several authors trained classifiers to detect LLM lies based on their internal model activations. However, other researchers showed that these classifiers may fail to generalise, for example to negated statements.

llm, robust detection, universal, (4 more...)

Neural Information Processing Systems

May-27-2025, 21:43:54 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)