Goto

Collaborating Authors

 Shefa




Large Language Models Are More Persuasive Than Incentivized Human Persuaders

Schoenegger, Philipp, Salvi, Francesco, Liu, Jiacheng, Nan, Xiaoli, Debnath, Ramit, Fasolo, Barbara, Leivada, Evelina, Recchia, Gabriel, Günther, Fritz, Zarifhonarvar, Ali, Kwon, Joe, Islam, Zahoor Ul, Dehnert, Marco, Lee, Daryl Y. H., Reinecke, Madeline G., Kamper, David G., Kobaş, Mert, Sandford, Adam, Kgomo, Jonas, Hewitt, Luke, Kapoor, Shreya, Oktar, Kerem, Kucuk, Eyup Engin, Feng, Bo, Jones, Cameron R., Gainsburg, Izzy, Olschewski, Sebastian, Heinzelmann, Nora, Cruz, Francisco, Tappin, Ben M., Ma, Tao, Park, Peter S., Onyonka, Rayan, Hjorth, Arthur, Slattery, Peter, Zeng, Qingcheng, Finke, Lennart, Grossmann, Igor, Salatiello, Alessandro, Karger, Ezra

arXiv.org Artificial Intelligence

We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real - time conversational quiz setting. In this preregistered, large - scale incentivized expe riment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. We find that LLM persuaders achieved significantly higher compliance with their dire ctional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. We also find that LLM persuaders significantly incre ased quiz takers' accuracy, leading to higher earnings, when steering quiz takers toward correct answers, and significantly decreased their accuracy, leading to lower earnings, when steering them toward incorrect answers. Overall, our findings suggest that AI's persuasion capabilities already exceed those of humans that have real - money bonuses tied to performance. Our findings of increasingly capable AI persuaders thus underscore the urgency of emerging alignment and governance frameworks.


EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

Zhang, Lin, Dong, Wenshuo, Zhang, Zhuoran, Yang, Shu, Hu, Lijie, Liu, Ninghao, Zhou, Pan, Wang, Di

arXiv.org Artificial Intelligence

Understanding the internal mechanisms of transformer-based language models remains challenging. Mechanistic interpretability based on circuit discovery aims to reverse engineer neural networks by analyzing their internal processes at the level of computational subgraphs. In this paper, we revisit existing gradient-based circuit identification methods and find that their performance is either affected by the zero-gradient problem or saturation effects, where edge attribution scores become insensitive to input changes, resulting in noisy and unreliable attribution evaluations for circuit components. To address the saturation effect, we propose Edge Attribution Patching with GradPath (EAP-GP), EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region. This approach enhances attribution reliability and improves the faithfulness of circuit identification. We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL. Experimental results demonstrate that EAP-GP outperforms existing methods in circuit faithfulness, achieving improvements up to 17.7%. Comparisons with manually annotated ground-truth circuits demonstrate that EAP-GP achieves precision and recall comparable to or better than previous approaches, highlighting its effectiveness in identifying accurate circuits.