Cai, Will
Improving LLM Safety Alignment with Dual-Objective Optimization
Zhao, Xuandong, Cai, Will, Shi, Tianneng, Huang, David, Lin, Licong, Mei, Song, Song, Dawn
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at https://github.com/wicai24/DOOR-Alignment
Humanity's Last Exam
Phan, Long, Gatti, Alice, Han, Ziwen, Li, Nathaniel, Hu, Josephina, Zhang, Hugh, Zhang, Chen Bo Calvin, Shaaban, Mohamed, Ling, John, Shi, Sean, Choi, Michael, Agrawal, Anish, Chopra, Arnav, Khoja, Adam, Kim, Ryan, Ren, Richard, Hausenloy, Jason, Zhang, Oliver, Mazeika, Mantas, Nguyen, Tung, Anderson, Daron, Shah, Imad Ali, Doroshenko, Mikhail, Stokes, Alun Cennyth, Mahmood, Mobeen, Lee, Jaeho, Pokutnyi, Oleksandr, Iskra, Oleg, Wang, Jessica P., Gerbicz, Robert, Levin, John-Clark, Popov, Serguei, Feng, Fiona, Feng, Steven Y., Zhao, Haoran, Yu, Michael, Gangal, Varun, Zou, Chelsea, Wang, Zihan, Kazakov, Mstyslav, Galgon, Geoff, Schmitt, Johannes, Sanchez, Alvaro, Lee, Yongki, Yeadon, Will, Sauers, Scott, Roth, Marc, Agu, Chidozie, Riis, Søren, Giska, Fabian, Utpala, Saiteja, Cheatom, Antrell, Giboney, Zachary, Goshu, Gashaw M., Crowson, Sarah-Jane, Naiya, Mohinder Maheshbhai, Burns, Noah, Finke, Lennart, Cheng, Zerui, Park, Hyunwoo, Fournier-Facio, Francesco, Zampese, Jennifer, Wydallis, John, Wydallis, John B., Hoerr, Ryan G., Nandor, Mark, Gehrunger, Tim, Cai, Jiaqi, McCarty, Ben, Nam, Jungbae, Taylor, Edwin, Jin, Jun, Loume, Gautier Abou, Cao, Hangrui, Garretson, Alexis C, Sileo, Damien, Ren, Qiuyu, Cojoc, Doru, Arkhipov, Pavel, Qazi, Usman, Bacho, Aras, Li, Lianghui, Motwani, Sumeet, de Witt, Christian Schroeder, Kopylov, Alexei, Veith, Johannes, Singer, Eric, Rissone, Paolo, Jin, Jaehyeok, Shi, Jack Wei Lun, Willcocks, Chris G., Prabhu, Ameya, Tang, Longke, Zhou, Kevin, Santos, Emily de Oliveira, Maksimov, Andrey Pupasov, Vendrow, Edward, Zenitani, Kengo, Robinson, Joshua, Mikov, Aleksandar, Guillod, Julien, Li, Yuqi, Pageler, Ben, Vendrow, Joshua, Kuchkin, Vladyslav, Marion, Pierre, Efremov, Denis, Lynch, Jayson, Liang, Kaiqu, Gritsevskiy, Andrew, Martinez, Dakotah, Crispino, Nick, Zvonkine, Dimitri, Fraga, Natanael Wildner, Soori, Saeed, Press, Ori, Tang, Henry, Salazar, Julian, Green, Sean R., Brüssel, Lina, Twayana, Moon, Dieuleveut, Aymeric, Rogers, T. Ryan, Zhang, Wenjin, Finocchio, Ross, Li, Bikun, Yang, Jinzhou, Rao, Arun, Loiseau, Gabriel, Kalinin, Mikhail, Lukas, Marco, Manolescu, Ciprian, Stambaugh, Nate, Mishra, Subrata, Kamdoum, Ariel Ghislain Kemogne, Hogg, Tad, Jin, Alvin, Bosio, Carlo, Sun, Gongbo, Coppola, Brian P, Heidinger, Haline, Sayous, Rafael, Ivanov, Stefan, Cavanagh, Joseph M, Shen, Jiawei, Imperial, Joseph Marvin, Schwaller, Philippe, Senthilkuma, Shaipranesh, Bran, Andres M, Algaba, Andres, Verbeken, Brecht, Houte, Kelsey Van den, Van Der Sypt, Lynn, Noever, David, Schut, Lisa, Sucholutsky, Ilia, Zheltonozhskii, Evgenii, Yuan, Qiaochu, Lim, Derek, Stanley, Richard, Sivarajan, Shankar, Yang, Tong, Maar, John, Wykowski, Julian, Oller, Martí, Sandlin, Jennifer, Sahu, Anmol, Ardito, Cesare Giulio, Hu, Yuzheng, Dias, Felipe Meneguitti, Kreiman, Tobias, Rawal, Kaivalya, Vilchis, Tobias Garcia, Zu, Yuexuan, Lackner, Martin, Koppel, James, Nguyen, Jeremy, Antonenko, Daniil S., Chern, Steffi, Zhao, Bingchen, Arsene, Pierrot, Ivanov, Sergey, Poświata, Rafał, Wang, Chenguang, Li, Daofeng, Crisostomi, Donato, Dehghan, Ali, Achilleos, Andrea, Ambay, John Arnold, Myklebust, Benjamin, Sen, Archan, Perrella, David, Kaparov, Nurdin, Inlow, Mark H, Zang, Allen, Ramakrishnan, Kalyan, Orel, Daniil, Poritski, Vladislav, Ben-David, Shalev, Berger, Zachary, Whitfill, Parker, Foster, Michael, Munro, Daniel, Ho, Linh, Hava, Dan Bar, Kuchkin, Aleksey, Lauff, Robert, Holmes, David, Sommerhage, Frank, Zhang, Anji, Moat, Richard, Schneider, Keith, Pyda, Daniel, Kazibwe, Zakayo, Singh, Mukhwinder, Clarke, Don, Kim, Dae Hyun, Fish, Sara, Elser, Veit, Vilchis, Victor Efren Guadarrama, Klose, Immo, Demian, Christoph, Anantheswaran, Ujjwala, Zweiger, Adam, Albani, Guglielmo, Li, Jeffery, Daans, Nicolas, Radionov, Maksim, Rozhoň, Václav, Ginis, Vincent, Ma, Ziqiao, Stump, Christian, Platnick, Jacob, Nevirkovets, Volodymyr, Basler, Luke, Piccardo, Marco, Cohen, Niv, Singh, Virendra, Tkadlec, Josef, Rosu, Paul, Goldfarb, Alan, Padlewski, Piotr, Barzowski, Stanislaw, Montgomery, Kyle, Menezes, Aline, Patel, Arkil, Wang, Zixuan, Tucker-Foltz, Jamie, Stade, Jack, Grabb, Declan, Goertzen, Tom, Kazemi, Fereshteh, Milbauer, Jeremiah, Shukla, Abhishek, Elgnainy, Hossam, Labrador, Yan Carlos Leyva, He, Hao, Zhang, Ling, Givré, Alan, Wolff, Hew, Demir, Gözdenur, Aziz, Muhammad Fayez, Kaddar, Younesse, Ängquist, Ivar, Chen, Yanxu, Thornley, Elliott, Zhang, Robin, Pan, Jiayi, Terpin, Antonio, Muennighoff, Niklas, Schoelkopf, Hailey, Zheng, Eric, Carmi, Avishy, Shah, Jainam, Brown, Ethan D. L., Zhu, Kelin, Bartolo, Max, Wheeler, Richard, Ho, Andrew, Barkan, Shaul, Wang, Jiaqi, Stehberger, Martin, Kretov, Egor, Bradshaw, Peter, Heimonen, JP, Sridhar, Kaustubh, Hossain, Zaki, Akov, Ido, Makarychev, Yury, Tam, Joanna, Hoang, Hieu, Cunningham, David M., Goryachev, Vladimir, Patramanis, Demosthenes, Krause, Michael, Redenti, Andrew, Aldous, David, Lai, Jesyin, Coleman, Shannon, Xu, Jiangnan, Lee, Sangwon, Magoulas, Ilias, Zhao, Sandy, Tang, Ning, Cohen, Michael K., Carroll, Micah, Paradise, Orr, Kirchner, Jan Hendrik, Steinerberger, Stefan, Ovchynnikov, Maksym, Matos, Jason O., Shenoy, Adithya, Wang, Michael, Nie, Yuzhou, Giordano, Paolo, Petersen, Philipp, Sztyber-Betley, Anna, Faraboschi, Paolo, Riblet, Robin, Crozier, Jonathan, Halasyamani, Shiv, Pinto, Antonella, Verma, Shreyas, Joshi, Prashant, Meril, Eli, Yong, Zheng-Xin, Tee, Allison, Andréoletti, Jérémy, Weller, Orion, Singhal, Raghav, Zhang, Gang, Ivanov, Alexander, Khoury, Seri, Gustafsson, Nils, Mostaghimi, Hamid, Thaman, Kunvar, Chen, Qijia, Khánh, Tran Quoc, Loader, Jacob, Cavalleri, Stefano, Szlyk, Hannah, Brown, Zachary, Narayan, Himanshu, Roberts, Jonathan, Alley, William, Sun, Kunyang, Stendall, Ryan, Lamparth, Max, Reuel, Anka, Wang, Ting, Xu, Hanmeng, Hernández-Cámara, Pablo, Martin, Freddie, Preu, Thomas, Korbak, Tomek, Abramovitch, Marcus, Williamson, Dominic, Bosio, Ida, Chen, Ziye, Bálint, Biró, Lo, Eve J. Y., Nunes, Maria Inês S., Jiang, Yibo, Bari, M Saiful, Kassani, Peyman, Wang, Zihao, Ansarinejad, Behzad, Sun, Yewen, Durand, Stephane, Douville, Guillaume, Tordera, Daniel, Balabanian, George, Anderson, Earth, Kvistad, Lynna, Moyano, Alejandro José, Milliron, Hsiaoyun, Sakor, Ahmad, Eron, Murat, McAlister, Isaac C., O., Andrew Favre D., Shah, Shailesh, Zhou, Xiaoxiang, Kamalov, Firuz, Clark, Ronald, Abdoli, Sherwin, Santens, Tim, Wang, Harrison K, Chen, Evan, Tomasiello, Alessandro, De Luca, G. Bruno, Looi, Shi-Zhuo, Le, Vinh-Kha, Kolt, Noam, Mündler, Niels, Semler, Avi, Rodman, Emma, Drori, Jacob, Fossum, Carl J, Gloor, Luk, Jagota, Milind, Pradeep, Ronak, Fan, Honglu, Shah, Tej, Eicher, Jonathan, Chen, Michael, Thaman, Kushal, Merrill, William, Firsching, Moritz, Harris, Carter, Ciobâcă, Stefan, Gross, Jason, Pandey, Rohan, Gusev, Ilya, Jones, Adam, Agnihotri, Shashank, Zhelnov, Pavel, Usawasutsakorn, Siranut, Mofayezi, Mohammadreza, Piperski, Alexander, Carauleanu, Marc, Zhang, David K., Dobarskyi, Kostiantyn, Ler, Dylan, Leventov, Roman, Soroko, Ignat, Jansen, Thorben, Creighton, Scott, Lauer, Pascal, Duersch, Joshua, Taamazyan, Vage, Bezzi, Dario, Morak, Wiktor, Ma, Wenjie, Held, William, Huy, Tran Đuc, Xian, Ruicheng, Zebaze, Armel Randy, Mohamed, Mohanad, Leser, Julian Noah, Yuan, Michelle X, Yacar, Laila, Lengler, Johannes, Olszewska, Katarzyna, Shahrtash, Hossein, Oliveira, Edson, Jackson, Joseph W., Gonzalez, Daniel Espinosa, Zou, Andy, Chidambaram, Muthu, Manik, Timothy, Haffenden, Hector, Stander, Dashiell, Dasouqi, Ali, Shen, Alexander, Duc, Emilien, Golshani, Bita, Stap, David, Uzhou, Mikalai, Zhidkovskaya, Alina Borisovna, Lewark, Lukas, Rodriguez, Miguel Orbegozo, Vincze, Mátyás, Wehr, Dustin, Tang, Colin, Phillips, Shaun, Samuele, Fortuna, Muzhen, Jiang, Ekström, Fredrik, Hammon, Angela, Patel, Oam, Farhidi, Faraz, Medley, George, Mohammadzadeh, Forough, Peñaflor, Madellene, Kassahun, Haile, Friedrich, Alena, Sparrow, Claire, Perez, Rayner Hernandez, Sakal, Taom, Dhamane, Omkar, Mirabadi, Ali Khajegili, Hallman, Eric, Okutsu, Kenchi, Battaglia, Mike, Maghsoudimehrabani, Mohammad, Amit, Alon, Hulbert, Dave, Pereira, Roberto, Weber, Simon, Handoko, null, Peristyy, Anton, Malina, Stephen, Albanie, Samuel, Cai, Will, Mehkary, Mustafa, Aly, Rami, Reidegeld, Frank, Dick, Anna-Katharina, Friday, Cary, Sidhu, Jasdeep, Shapourian, Hassan, Kim, Wanyoung, Costa, Mariana, Gurdogan, Hubeyb, Weber, Brian, Kumar, Harsh, Jiang, Tong, Agarwal, Arunim, Ceconello, Chiara, Vaz, Warren S., Zhuang, Chao, Park, Haon, Tawfeek, Andrew R., Aggarwal, Daattavya, Kirchhof, Michael, Dai, Linjie, Kim, Evan, Ferret, Johan, Wang, Yuzhou, Yan, Minghao, Burdzy, Krzysztof, Zhang, Lixin, Franca, Antonio, Pham, Diana T., Loh, Kang Yong, Robinson, Joshua, Jackson, Abram, Gul, Shreen, Chhablani, Gunjan, Du, Zhehang, Cosma, Adrian, Colino, Jesus, White, Colin, Votava, Jacob, Vinnikov, Vladimir, Delaney, Ethan, Spelda, Petr, Stritecky, Vit, Shahid, Syed M., Mourrat, Jean-Christophe, Vetoshkin, Lavr, Sponselee, Koen, Bacho, Renas, de la Rosa, Florencia, Li, Xiuyu, Malod, Guillaume, Lang, Leon, Laurendeau, Julien, Kazakov, Dmitry, Adesanya, Fatimah, Portier, Julien, Hollom, Lawrence, Souza, Victor, Zhou, Yuchen Anna, Degorre, Julien, Yalın, Yiğit, Obikoya, Gbenga Daniel, Arnaboldi, Luca, Rai, null, Bigi, Filippo, Boscá, M. C., Shumar, Oleg, Bacho, Kaniuar, Clavier, Pierre, Recchia, Gabriel, Popescu, Mara, Shulga, Nikita, Tanwie, Ngefor Mildred, Peskoff, Denis, Lux, Thomas C. H., Rank, Ben, Ni, Colin, Brooks, Matthew, Yakimchyk, Alesia, Huanxu, null, Liu, null, Häggström, Olle, Verkama, Emil, Gundlach, Hans, Brito-Santana, Leonor, Amaro, Brian, Vajipey, Vivek, Grover, Rynaa, Fan, Yiyang, Silva, Gabriel Poesia Reis e, Xin, Linwei, Kratish, Yosi, Łucki, Jakub, Li, Wen-Ding, Gopi, Sivakanth, Caciolai, Andrea, Xu, Justin, Scaria, Kevin Joseph, Vargus, Freddie, Habibi, Farzad, Long, null, Lian, null, Rodolà, Emanuele, Robins, Jules, Cheng, Vincent, Fruhauff, Tony, Raynor, Brad, Qi, Hao, Jiang, Xi, Segev, Ben, Fan, Jingxuan, Martinson, Sarah, Wang, Erik Y., Hausknecht, Kaylie, Brenner, Michael P., Mao, Mao, Zhang, Xinyu, Avagian, David, Scipio, Eshawn Jessica, Ragoler, Alon, Tan, Justin, Sims, Blake, Plecnik, Rebeka, Kirtland, Aaron, Bodur, Omer Faruk, Shinde, D. P., Adoul, Zahra, Zekry, Mohamed, Karakoc, Ali, Santos, Tania C. B., Shamseldeen, Samir, Karim, Loukmane, Liakhovitskaia, Anna, Resman, Nate, Farina, Nicholas, Gonzalez, Juan Carlos, Maayan, Gabe, Hoback, Sarah, Pena, Rodrigo De Oliveira, Sherman, Glen, Kelley, Elizabeth, Mariji, Hodjat, Pouriamanesh, Rasoul, Wu, Wentao, Mendoza, Sandra, Alarab, Ismail, Cole, Joshua, Ferreira, Danyelle, Johnson, Bryan, Safdari, Mohammad, Dai, Liangti, Arthornthurasuk, Siriphan, Pronin, Alexey, Fan, Jing, Ramirez-Trinidad, Angel, Cartwright, Ashley, Pottmaier, Daphiny, Taheri, Omid, Outevsky, David, Stepanic, Stanley, Perry, Samuel, Askew, Luke, Rodríguez, Raúl Adrián Huerta, Minissi, Ali M. R., Ali, Sam, Lorena, Ricardo, Iyer, Krishnamurthy, Fasiludeen, Arshad Anil, Salauddin, Sk Md, Islam, Murat, Gonzalez, Juan, Ducey, Josh, Somrak, Maja, Mavroudis, Vasilios, Vergo, Eric, Qin, Juehang, Borbás, Benjámin, Chu, Eric, Lindsey, Jack, Radhakrishnan, Anil, Jallon, Antoine, McInnis, I. M. J., Kumar, Pawan, Goswami, Laxman Prasad, Bugas, Daniel, Heydari, Nasser, Jeanplong, Ferenc, Apronti, Archimedes, Galal, Abdallah, Ze-An, Ng, Singh, Ankit, Xavier, Joan of Arc, Agarwal, Kanu Priya, Berkani, Mohammed, Junior, Benedito Alves de Oliveira, Malishev, Dmitry, Remy, Nicolas, Hartman, Taylor D., Tarver, Tim, Mensah, Stephen, Gimenez, Javier, Montecillo, Roselynn Grace, Campbell, Russell, Sharma, Asankhaya, Meer, Khalida, Alapont, Xavier, Patil, Deepakkumar, Maheshwari, Rajat, Dendane, Abdelkader, Shukla, Priti, Bogdanov, Sergei, Möller, Sören, Siddiqi, Muhammad Rehan, Saxena, Prajvi, Gupta, Himanshu, Enyekwe, Innocent, P, Ragavendran V, EL-Wasif, Zienab, Maksapetyan, Aleksandr, Rossbach, Vivien, Harjadi, Chris, Bahaloohoreh, Mohsen, Bian, Song, Lai, John, Uro, Justine Leon, Bateman, Greg, Sayed, Mohamed, Menshawy, Ahmed, Duclosel, Darling, Jain, Yashaswini, Aaron, Ashley, Tiryakioglu, Murat, Siddh, Sheeshram, Krenek, Keith, Hoover, Alex, McGowan, Joseph, Patwardhan, Tejal, Yue, Summer, Wang, Alexandr, Hendrycks, Dan
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.