Huang, Ben
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
Ghosh, Shaona, Frase, Heather, Williams, Adina, Luger, Sarah, Röttger, Paul, Barez, Fazl, McGregor, Sean, Fricklas, Kenneth, Kumar, Mala, Feuillade--Montixi, Quentin, Bollacker, Kurt, Friedrich, Felix, Tsang, Ryan, Vidgen, Bertie, Parrish, Alicia, Knotz, Chris, Presani, Eleonora, Bennion, Jonathan, Boston, Marisa Ferrara, Kuniavsky, Mike, Hutiri, Wiebke, Ezick, James, Salem, Malek Ben, Sahay, Rajat, Goswami, Sujata, Gohar, Usman, Huang, Ben, Sarin, Supheakmungkol, Alhajjar, Elie, Chen, Canyu, Eng, Roman, Manjusha, Kashyap Ramanandula, Mehta, Virendra, Long, Eileen, Emani, Murali, Vidra, Natan, Rukundo, Benjamin, Shahbazi, Abolfazl, Chen, Kongtao, Ghosh, Rajat, Thangarasa, Vithursan, Peigné, Pierre, Singh, Abhinav, Bartolo, Max, Krishna, Satyapriya, Akhtar, Mubashara, Gold, Rafael, Coleman, Cody, Oala, Luis, Tashev, Vassil, Imperial, Joseph Marvin, Russ, Amy, Kunapuli, Sasidhar, Miailhe, Nicolas, Delaunay, Julien, Radharapu, Bhaktipriya, Shinde, Rajat, Tuesday, null, Dutta, Debojyoti, Grabb, Declan, Gangavarapu, Ananya, Sahay, Saurav, Gangavarapu, Agasthya, Schramowski, Patrick, Singam, Stephen, David, Tom, Han, Xudong, Mammen, Priyanka Mary, Prabhakar, Tarunima, Kovatchev, Venelin, Ahmed, Ahmed, Manyeki, Kelvin N., Madireddy, Sandeep, Khomh, Foutse, Zhdanov, Fedor, Baumann, Joachim, Vasan, Nina, Yang, Xianjun, Mougn, Carlos, Varghese, Jibin Rajan, Chinoy, Hussain, Jitendar, Seshakrishna, Maskey, Manil, Hardgrove, Claire V., Li, Tianhao, Gupta, Aakash, Joswin, Emil, Mai, Yifan, Kumar, Shachi H, Patlak, Cigdem, Lu, Kevin, Alessi, Vincent, Balija, Sree Bhargavi, Gu, Chenhe, Sullivan, Robert, Gealy, James, Lavrisa, Matt, Goel, James, Mattson, Peter, Liang, Percy, Vanschoren, Joaquin
The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories, including violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (election, financial, health, legal). Our method incorporates a complete assessment standard, extensive prompt datasets, a novel evaluation framework, a grading and reporting system, and the technical as well as organizational infrastructure for long-term support and evolution. In particular, the benchmark employs an understandable five-tier grading scale (Poor to Excellent) and incorporates an innovative entropy-based system-response evaluation. In addition to unveiling the benchmark, this report also identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions. This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories. Our findings provide valuable insights for model developers, system integrators, and policymakers working to promote safer AI deployment.
Results of the Big ANN: NeurIPS'23 competition
Simhadri, Harsha Vardhan, Aumüller, Martin, Ingber, Amir, Douze, Matthijs, Williams, George, Manohar, Magdalen Dobson, Baranchuk, Dmitry, Liberty, Edo, Liu, Frank, Landrum, Ben, Karjikar, Mazin, Dhulipala, Laxman, Chen, Meng, Chen, Yue, Ma, Rui, Zhang, Kai, Cai, Yuzheng, Shi, Jiayang, Chen, Yizhuo, Zheng, Weiguo, Wan, Zihao, Yin, Jie, Huang, Ben
The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\cite{DBLP:conf/nips/SimhadriWADBBCH21}, this competition addressed filtered search, out-of-distribution data, sparse and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency over industry-standard baselines, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search.