Patel, Oam
Towards Interpretable Soft Prompts
Patel, Oam, Wang, Jason, Nayak, Nikhil Shivakumar, Srinivas, Suraj, Lakkaraju, Himabindu
Soft prompts have been popularized as a cheap and easy way to improve task-specific LLM performance beyond few-shot prompts. Despite their origin as an automated prompting method, however, soft prompts and other trainable prompts remain a black-box method with no immediately interpretable connections to prompting. We create a novel theoretical framework for evaluating the interpretability of trainable prompts based on two desiderata: faithfulness and scrutability. We find that existing methods do not naturally satisfy our proposed interpretability criterion. Instead, our framework inspires a new direction of trainable prompting methods that explicitly optimizes for interpretability. To this end, we formulate and test new interpretability-oriented objective functions for two state-of-the-art prompt tuners: Hard Prompts Made Easy (PEZ) and RLPrompt. Our experiments with GPT-2 demonstrate a fundamental trade-off between interpretability and the task-performance of the trainable prompt, explicating the hardness of the soft prompt interpretability problem and revealing odd behavior that arises when one optimizes for an interpretability proxy.
Humanity's Last Exam
Phan, Long, Gatti, Alice, Han, Ziwen, Li, Nathaniel, Hu, Josephina, Zhang, Hugh, Zhang, Chen Bo Calvin, Shaaban, Mohamed, Ling, John, Shi, Sean, Choi, Michael, Agrawal, Anish, Chopra, Arnav, Khoja, Adam, Kim, Ryan, Ren, Richard, Hausenloy, Jason, Zhang, Oliver, Mazeika, Mantas, Nguyen, Tung, Anderson, Daron, Shah, Imad Ali, Doroshenko, Mikhail, Stokes, Alun Cennyth, Mahmood, Mobeen, Lee, Jaeho, Pokutnyi, Oleksandr, Iskra, Oleg, Wang, Jessica P., Gerbicz, Robert, Levin, John-Clark, Popov, Serguei, Feng, Fiona, Feng, Steven Y., Zhao, Haoran, Yu, Michael, Gangal, Varun, Zou, Chelsea, Wang, Zihan, Kazakov, Mstyslav, Galgon, Geoff, Schmitt, Johannes, Sanchez, Alvaro, Lee, Yongki, Yeadon, Will, Sauers, Scott, Roth, Marc, Agu, Chidozie, Riis, Søren, Giska, Fabian, Utpala, Saiteja, Cheatom, Antrell, Giboney, Zachary, Goshu, Gashaw M., Crowson, Sarah-Jane, Naiya, Mohinder Maheshbhai, Burns, Noah, Finke, Lennart, Cheng, Zerui, Park, Hyunwoo, Fournier-Facio, Francesco, Zampese, Jennifer, Wydallis, John, Wydallis, John B., Hoerr, Ryan G., Nandor, Mark, Gehrunger, Tim, Cai, Jiaqi, McCarty, Ben, Nam, Jungbae, Taylor, Edwin, Jin, Jun, Loume, Gautier Abou, Cao, Hangrui, Garretson, Alexis C, Sileo, Damien, Ren, Qiuyu, Cojoc, Doru, Arkhipov, Pavel, Qazi, Usman, Bacho, Aras, Li, Lianghui, Motwani, Sumeet, de Witt, Christian Schroeder, Kopylov, Alexei, Veith, Johannes, Singer, Eric, Rissone, Paolo, Jin, Jaehyeok, Shi, Jack Wei Lun, Willcocks, Chris G., Prabhu, Ameya, Tang, Longke, Zhou, Kevin, Santos, Emily de Oliveira, Maksimov, Andrey Pupasov, Vendrow, Edward, Zenitani, Kengo, Robinson, Joshua, Mikov, Aleksandar, Guillod, Julien, Li, Yuqi, Pageler, Ben, Vendrow, Joshua, Kuchkin, Vladyslav, Marion, Pierre, Efremov, Denis, Lynch, Jayson, Liang, Kaiqu, Gritsevskiy, Andrew, Martinez, Dakotah, Crispino, Nick, Zvonkine, Dimitri, Fraga, Natanael Wildner, Soori, Saeed, Press, Ori, Tang, Henry, Salazar, Julian, Green, Sean R., Brüssel, Lina, Twayana, Moon, Dieuleveut, Aymeric, Rogers, T. Ryan, Zhang, Wenjin, Finocchio, Ross, Li, Bikun, Yang, Jinzhou, Rao, Arun, Loiseau, Gabriel, Kalinin, Mikhail, Lukas, Marco, Manolescu, Ciprian, Stambaugh, Nate, Mishra, Subrata, Kamdoum, Ariel Ghislain Kemogne, Hogg, Tad, Jin, Alvin, Bosio, Carlo, Sun, Gongbo, Coppola, Brian P, Heidinger, Haline, Sayous, Rafael, Ivanov, Stefan, Cavanagh, Joseph M, Shen, Jiawei, Imperial, Joseph Marvin, Schwaller, Philippe, Senthilkuma, Shaipranesh, Bran, Andres M, Algaba, Andres, Verbeken, Brecht, Houte, Kelsey Van den, Van Der Sypt, Lynn, Noever, David, Schut, Lisa, Sucholutsky, Ilia, Zheltonozhskii, Evgenii, Yuan, Qiaochu, Lim, Derek, Stanley, Richard, Sivarajan, Shankar, Yang, Tong, Maar, John, Wykowski, Julian, Oller, Martí, Sandlin, Jennifer, Sahu, Anmol, Ardito, Cesare Giulio, Hu, Yuzheng, Dias, Felipe Meneguitti, Kreiman, Tobias, Rawal, Kaivalya, Vilchis, Tobias Garcia, Zu, Yuexuan, Lackner, Martin, Koppel, James, Nguyen, Jeremy, Antonenko, Daniil S., Chern, Steffi, Zhao, Bingchen, Arsene, Pierrot, Ivanov, Sergey, Poświata, Rafał, Wang, Chenguang, Li, Daofeng, Crisostomi, Donato, Dehghan, Ali, Achilleos, Andrea, Ambay, John Arnold, Myklebust, Benjamin, Sen, Archan, Perrella, David, Kaparov, Nurdin, Inlow, Mark H, Zang, Allen, Ramakrishnan, Kalyan, Orel, Daniil, Poritski, Vladislav, Ben-David, Shalev, Berger, Zachary, Whitfill, Parker, Foster, Michael, Munro, Daniel, Ho, Linh, Hava, Dan Bar, Kuchkin, Aleksey, Lauff, Robert, Holmes, David, Sommerhage, Frank, Zhang, Anji, Moat, Richard, Schneider, Keith, Pyda, Daniel, Kazibwe, Zakayo, Singh, Mukhwinder, Clarke, Don, Kim, Dae Hyun, Fish, Sara, Elser, Veit, Vilchis, Victor Efren Guadarrama, Klose, Immo, Demian, Christoph, Anantheswaran, Ujjwala, Zweiger, Adam, Albani, Guglielmo, Li, Jeffery, Daans, Nicolas, Radionov, Maksim, Rozhoň, Václav, Ginis, Vincent, Ma, Ziqiao, Stump, Christian, Platnick, Jacob, Nevirkovets, Volodymyr, Basler, Luke, Piccardo, Marco, Cohen, Niv, Singh, Virendra, Tkadlec, Josef, Rosu, Paul, Goldfarb, Alan, Padlewski, Piotr, Barzowski, Stanislaw, Montgomery, Kyle, Menezes, Aline, Patel, Arkil, Wang, Zixuan, Tucker-Foltz, Jamie, Stade, Jack, Grabb, Declan, Goertzen, Tom, Kazemi, Fereshteh, Milbauer, Jeremiah, Shukla, Abhishek, Elgnainy, Hossam, Labrador, Yan Carlos Leyva, He, Hao, Zhang, Ling, Givré, Alan, Wolff, Hew, Demir, Gözdenur, Aziz, Muhammad Fayez, Kaddar, Younesse, Ängquist, Ivar, Chen, Yanxu, Thornley, Elliott, Zhang, Robin, Pan, Jiayi, Terpin, Antonio, Muennighoff, Niklas, Schoelkopf, Hailey, Zheng, Eric, Carmi, Avishy, Shah, Jainam, Brown, Ethan D. L., Zhu, Kelin, Bartolo, Max, Wheeler, Richard, Ho, Andrew, Barkan, Shaul, Wang, Jiaqi, Stehberger, Martin, Kretov, Egor, Bradshaw, Peter, Heimonen, JP, Sridhar, Kaustubh, Hossain, Zaki, Akov, Ido, Makarychev, Yury, Tam, Joanna, Hoang, Hieu, Cunningham, David M., Goryachev, Vladimir, Patramanis, Demosthenes, Krause, Michael, Redenti, Andrew, Aldous, David, Lai, Jesyin, Coleman, Shannon, Xu, Jiangnan, Lee, Sangwon, Magoulas, Ilias, Zhao, Sandy, Tang, Ning, Cohen, Michael K., Carroll, Micah, Paradise, Orr, Kirchner, Jan Hendrik, Steinerberger, Stefan, Ovchynnikov, Maksym, Matos, Jason O., Shenoy, Adithya, Wang, Michael, Nie, Yuzhou, Giordano, Paolo, Petersen, Philipp, Sztyber-Betley, Anna, Faraboschi, Paolo, Riblet, Robin, Crozier, Jonathan, Halasyamani, Shiv, Pinto, Antonella, Verma, Shreyas, Joshi, Prashant, Meril, Eli, Yong, Zheng-Xin, Tee, Allison, Andréoletti, Jérémy, Weller, Orion, Singhal, Raghav, Zhang, Gang, Ivanov, Alexander, Khoury, Seri, Gustafsson, Nils, Mostaghimi, Hamid, Thaman, Kunvar, Chen, Qijia, Khánh, Tran Quoc, Loader, Jacob, Cavalleri, Stefano, Szlyk, Hannah, Brown, Zachary, Narayan, Himanshu, Roberts, Jonathan, Alley, William, Sun, Kunyang, Stendall, Ryan, Lamparth, Max, Reuel, Anka, Wang, Ting, Xu, Hanmeng, Hernández-Cámara, Pablo, Martin, Freddie, Preu, Thomas, Korbak, Tomek, Abramovitch, Marcus, Williamson, Dominic, Bosio, Ida, Chen, Ziye, Bálint, Biró, Lo, Eve J. Y., Nunes, Maria Inês S., Jiang, Yibo, Bari, M Saiful, Kassani, Peyman, Wang, Zihao, Ansarinejad, Behzad, Sun, Yewen, Durand, Stephane, Douville, Guillaume, Tordera, Daniel, Balabanian, George, Anderson, Earth, Kvistad, Lynna, Moyano, Alejandro José, Milliron, Hsiaoyun, Sakor, Ahmad, Eron, Murat, McAlister, Isaac C., O., Andrew Favre D., Shah, Shailesh, Zhou, Xiaoxiang, Kamalov, Firuz, Clark, Ronald, Abdoli, Sherwin, Santens, Tim, Wang, Harrison K, Chen, Evan, Tomasiello, Alessandro, De Luca, G. Bruno, Looi, Shi-Zhuo, Le, Vinh-Kha, Kolt, Noam, Mündler, Niels, Semler, Avi, Rodman, Emma, Drori, Jacob, Fossum, Carl J, Gloor, Luk, Jagota, Milind, Pradeep, Ronak, Fan, Honglu, Shah, Tej, Eicher, Jonathan, Chen, Michael, Thaman, Kushal, Merrill, William, Firsching, Moritz, Harris, Carter, Ciobâcă, Stefan, Gross, Jason, Pandey, Rohan, Gusev, Ilya, Jones, Adam, Agnihotri, Shashank, Zhelnov, Pavel, Usawasutsakorn, Siranut, Mofayezi, Mohammadreza, Piperski, Alexander, Carauleanu, Marc, Zhang, David K., Dobarskyi, Kostiantyn, Ler, Dylan, Leventov, Roman, Soroko, Ignat, Jansen, Thorben, Creighton, Scott, Lauer, Pascal, Duersch, Joshua, Taamazyan, Vage, Bezzi, Dario, Morak, Wiktor, Ma, Wenjie, Held, William, Huy, Tran Đuc, Xian, Ruicheng, Zebaze, Armel Randy, Mohamed, Mohanad, Leser, Julian Noah, Yuan, Michelle X, Yacar, Laila, Lengler, Johannes, Olszewska, Katarzyna, Shahrtash, Hossein, Oliveira, Edson, Jackson, Joseph W., Gonzalez, Daniel Espinosa, Zou, Andy, Chidambaram, Muthu, Manik, Timothy, Haffenden, Hector, Stander, Dashiell, Dasouqi, Ali, Shen, Alexander, Duc, Emilien, Golshani, Bita, Stap, David, Uzhou, Mikalai, Zhidkovskaya, Alina Borisovna, Lewark, Lukas, Rodriguez, Miguel Orbegozo, Vincze, Mátyás, Wehr, Dustin, Tang, Colin, Phillips, Shaun, Samuele, Fortuna, Muzhen, Jiang, Ekström, Fredrik, Hammon, Angela, Patel, Oam, Farhidi, Faraz, Medley, George, Mohammadzadeh, Forough, Peñaflor, Madellene, Kassahun, Haile, Friedrich, Alena, Sparrow, Claire, Perez, Rayner Hernandez, Sakal, Taom, Dhamane, Omkar, Mirabadi, Ali Khajegili, Hallman, Eric, Okutsu, Kenchi, Battaglia, Mike, Maghsoudimehrabani, Mohammad, Amit, Alon, Hulbert, Dave, Pereira, Roberto, Weber, Simon, Handoko, null, Peristyy, Anton, Malina, Stephen, Albanie, Samuel, Cai, Will, Mehkary, Mustafa, Aly, Rami, Reidegeld, Frank, Dick, Anna-Katharina, Friday, Cary, Sidhu, Jasdeep, Shapourian, Hassan, Kim, Wanyoung, Costa, Mariana, Gurdogan, Hubeyb, Weber, Brian, Kumar, Harsh, Jiang, Tong, Agarwal, Arunim, Ceconello, Chiara, Vaz, Warren S., Zhuang, Chao, Park, Haon, Tawfeek, Andrew R., Aggarwal, Daattavya, Kirchhof, Michael, Dai, Linjie, Kim, Evan, Ferret, Johan, Wang, Yuzhou, Yan, Minghao, Burdzy, Krzysztof, Zhang, Lixin, Franca, Antonio, Pham, Diana T., Loh, Kang Yong, Robinson, Joshua, Jackson, Abram, Gul, Shreen, Chhablani, Gunjan, Du, Zhehang, Cosma, Adrian, Colino, Jesus, White, Colin, Votava, Jacob, Vinnikov, Vladimir, Delaney, Ethan, Spelda, Petr, Stritecky, Vit, Shahid, Syed M., Mourrat, Jean-Christophe, Vetoshkin, Lavr, Sponselee, Koen, Bacho, Renas, de la Rosa, Florencia, Li, Xiuyu, Malod, Guillaume, Lang, Leon, Laurendeau, Julien, Kazakov, Dmitry, Adesanya, Fatimah, Portier, Julien, Hollom, Lawrence, Souza, Victor, Zhou, Yuchen Anna, Degorre, Julien, Yalın, Yiğit, Obikoya, Gbenga Daniel, Arnaboldi, Luca, Rai, null, Bigi, Filippo, Boscá, M. C., Shumar, Oleg, Bacho, Kaniuar, Clavier, Pierre, Recchia, Gabriel, Popescu, Mara, Shulga, Nikita, Tanwie, Ngefor Mildred, Peskoff, Denis, Lux, Thomas C. H., Rank, Ben, Ni, Colin, Brooks, Matthew, Yakimchyk, Alesia, Huanxu, null, Liu, null, Häggström, Olle, Verkama, Emil, Gundlach, Hans, Brito-Santana, Leonor, Amaro, Brian, Vajipey, Vivek, Grover, Rynaa, Fan, Yiyang, Silva, Gabriel Poesia Reis e, Xin, Linwei, Kratish, Yosi, Łucki, Jakub, Li, Wen-Ding, Gopi, Sivakanth, Caciolai, Andrea, Xu, Justin, Scaria, Kevin Joseph, Vargus, Freddie, Habibi, Farzad, Long, null, Lian, null, Rodolà, Emanuele, Robins, Jules, Cheng, Vincent, Fruhauff, Tony, Raynor, Brad, Qi, Hao, Jiang, Xi, Segev, Ben, Fan, Jingxuan, Martinson, Sarah, Wang, Erik Y., Hausknecht, Kaylie, Brenner, Michael P., Mao, Mao, Zhang, Xinyu, Avagian, David, Scipio, Eshawn Jessica, Ragoler, Alon, Tan, Justin, Sims, Blake, Plecnik, Rebeka, Kirtland, Aaron, Bodur, Omer Faruk, Shinde, D. P., Adoul, Zahra, Zekry, Mohamed, Karakoc, Ali, Santos, Tania C. B., Shamseldeen, Samir, Karim, Loukmane, Liakhovitskaia, Anna, Resman, Nate, Farina, Nicholas, Gonzalez, Juan Carlos, Maayan, Gabe, Hoback, Sarah, Pena, Rodrigo De Oliveira, Sherman, Glen, Kelley, Elizabeth, Mariji, Hodjat, Pouriamanesh, Rasoul, Wu, Wentao, Mendoza, Sandra, Alarab, Ismail, Cole, Joshua, Ferreira, Danyelle, Johnson, Bryan, Safdari, Mohammad, Dai, Liangti, Arthornthurasuk, Siriphan, Pronin, Alexey, Fan, Jing, Ramirez-Trinidad, Angel, Cartwright, Ashley, Pottmaier, Daphiny, Taheri, Omid, Outevsky, David, Stepanic, Stanley, Perry, Samuel, Askew, Luke, Rodríguez, Raúl Adrián Huerta, Minissi, Ali M. R., Ali, Sam, Lorena, Ricardo, Iyer, Krishnamurthy, Fasiludeen, Arshad Anil, Salauddin, Sk Md, Islam, Murat, Gonzalez, Juan, Ducey, Josh, Somrak, Maja, Mavroudis, Vasilios, Vergo, Eric, Qin, Juehang, Borbás, Benjámin, Chu, Eric, Lindsey, Jack, Radhakrishnan, Anil, Jallon, Antoine, McInnis, I. M. J., Kumar, Pawan, Goswami, Laxman Prasad, Bugas, Daniel, Heydari, Nasser, Jeanplong, Ferenc, Apronti, Archimedes, Galal, Abdallah, Ze-An, Ng, Singh, Ankit, Xavier, Joan of Arc, Agarwal, Kanu Priya, Berkani, Mohammed, Junior, Benedito Alves de Oliveira, Malishev, Dmitry, Remy, Nicolas, Hartman, Taylor D., Tarver, Tim, Mensah, Stephen, Gimenez, Javier, Montecillo, Roselynn Grace, Campbell, Russell, Sharma, Asankhaya, Meer, Khalida, Alapont, Xavier, Patil, Deepakkumar, Maheshwari, Rajat, Dendane, Abdelkader, Shukla, Priti, Bogdanov, Sergei, Möller, Sören, Siddiqi, Muhammad Rehan, Saxena, Prajvi, Gupta, Himanshu, Enyekwe, Innocent, P, Ragavendran V, EL-Wasif, Zienab, Maksapetyan, Aleksandr, Rossbach, Vivien, Harjadi, Chris, Bahaloohoreh, Mohsen, Bian, Song, Lai, John, Uro, Justine Leon, Bateman, Greg, Sayed, Mohamed, Menshawy, Ahmed, Duclosel, Darling, Jain, Yashaswini, Aaron, Ashley, Tiryakioglu, Murat, Siddh, Sheeshram, Krenek, Keith, Hoover, Alex, McGowan, Joseph, Patwardhan, Tejal, Yue, Summer, Wang, Alexandr, Hendrycks, Dan
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
Designing a Dashboard for Transparency and Control of Conversational AI
Chen, Yida, Wu, Aoyu, DePodesta, Trevor, Yeh, Catherine, Li, Kenneth, Marin, Nicholas Castillo, Patel, Oam, Riecke, Jan, Raval, Shivam, Seow, Olivia, Wattenberg, Martin, Viégas, Fernanda
Conversational LLMs function as black box systems, leaving users guessing about why they see the output they do. This lack of transparency is potentially problematic, especially given concerns around bias and truthfulness. To address this issue, we present an end-to-end prototype-connecting interpretability techniques with user experience design-that seeks to make chatbots more transparent. We begin by showing evidence that a prominent open-source LLM has a "user model": examining the internal state of the system, we can extract data related to a user's age, gender, educational level, and socioeconomic status. Next, we describe the design of a dashboard that accompanies the chatbot interface, displaying this user model in real time. The dashboard can also be used to control the user model and the system's behavior. Finally, we discuss a study in which users conversed with the instrumented system. Our results suggest that users appreciate seeing internal states, which helped them expose biased behavior and increased their sense of control. Participants also made valuable suggestions that point to future directions for both design and machine learning research. The project page and video demo of our TalkTuner system are available at https://bit.ly/talktuner-project-page
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Li, Nathaniel, Pan, Alexander, Gopal, Anjali, Yue, Summer, Berrios, Daniel, Gatti, Alice, Li, Justin D., Dombrowski, Ann-Kathrin, Goel, Shashwat, Phan, Long, Mukobi, Gabriel, Helm-Burger, Nathan, Lababidi, Rassin, Justen, Lennart, Liu, Andrew B., Chen, Michael, Barrass, Isabelle, Zhang, Oliver, Zhu, Xiaoyuan, Tamirisa, Rishub, Bharathi, Bhrugu, Khoja, Adam, Zhao, Zhenqi, Herbert-Voss, Ariel, Breuer, Cort B., Marks, Samuel, Patel, Oam, Zou, Andy, Mazeika, Mantas, Wang, Zifan, Oswal, Palash, Lin, Weiran, Hunt, Adam A., Tienken-Harder, Justin, Shih, Kevin Y., Talley, Kemper, Guan, John, Kaplan, Russell, Steneker, Ian, Campbell, David, Jokubaitis, Brad, Levinson, Alex, Wang, Jean, Qian, William, Karmakar, Kallol Krishna, Basart, Steven, Fitz, Stephen, Levine, Mindy, Kumaraguru, Ponnurangam, Tupakula, Uday, Varadharajan, Vijay, Wang, Ruoyu, Shoshitaishvili, Yan, Ba, Jimmy, Esvelt, Kevin M., Wang, Alexandr, Hendrycks, Dan
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Casper, Stephen, Schulze, Lennart, Patel, Oam, Hadfield-Menell, Dylan
Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors. Finding and fixing these is challenging because the attack surface is so large -- it is not tractable to exhaustively search for inputs that may elicit harmful behaviors. Red-teaming and adversarial training (AT) are commonly used to improve robustness, however, they empirically struggle to fix failure modes that differ from the attacks used during training. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without generating inputs that elicit them. LAT leverages the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. We use it to remove trojans and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness to novel attacks and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Li, Kenneth, Patel, Oam, Viégas, Fernanda, Pfister, Hanspeter, Wattenberg, Martin
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a trade-off between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.