Vasan, Nina
Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Lamparth, Max, Grabb, Declan, Franks, Amy, Gershan, Scott, Kunstman, Kaitlyn N., Lulla, Aaron, Roots, Monika Drummond, Sharma, Manu, Shrivastava, Aryan, Vasan, Nina, Waickman, Colleen
Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all 203 base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables (e.g., AGE), and are available for male, female, or non-binary-coded patients. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating eleven off-the-shelf and four mental health fine-tuned LMs on category-specific task accuracy, on the impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human annotated samples.
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
Ghosh, Shaona, Frase, Heather, Williams, Adina, Luger, Sarah, Rรถttger, Paul, Barez, Fazl, McGregor, Sean, Fricklas, Kenneth, Kumar, Mala, Feuillade--Montixi, Quentin, Bollacker, Kurt, Friedrich, Felix, Tsang, Ryan, Vidgen, Bertie, Parrish, Alicia, Knotz, Chris, Presani, Eleonora, Bennion, Jonathan, Boston, Marisa Ferrara, Kuniavsky, Mike, Hutiri, Wiebke, Ezick, James, Salem, Malek Ben, Sahay, Rajat, Goswami, Sujata, Gohar, Usman, Huang, Ben, Sarin, Supheakmungkol, Alhajjar, Elie, Chen, Canyu, Eng, Roman, Manjusha, Kashyap Ramanandula, Mehta, Virendra, Long, Eileen, Emani, Murali, Vidra, Natan, Rukundo, Benjamin, Shahbazi, Abolfazl, Chen, Kongtao, Ghosh, Rajat, Thangarasa, Vithursan, Peignรฉ, Pierre, Singh, Abhinav, Bartolo, Max, Krishna, Satyapriya, Akhtar, Mubashara, Gold, Rafael, Coleman, Cody, Oala, Luis, Tashev, Vassil, Imperial, Joseph Marvin, Russ, Amy, Kunapuli, Sasidhar, Miailhe, Nicolas, Delaunay, Julien, Radharapu, Bhaktipriya, Shinde, Rajat, Tuesday, null, Dutta, Debojyoti, Grabb, Declan, Gangavarapu, Ananya, Sahay, Saurav, Gangavarapu, Agasthya, Schramowski, Patrick, Singam, Stephen, David, Tom, Han, Xudong, Mammen, Priyanka Mary, Prabhakar, Tarunima, Kovatchev, Venelin, Ahmed, Ahmed, Manyeki, Kelvin N., Madireddy, Sandeep, Khomh, Foutse, Zhdanov, Fedor, Baumann, Joachim, Vasan, Nina, Yang, Xianjun, Mougn, Carlos, Varghese, Jibin Rajan, Chinoy, Hussain, Jitendar, Seshakrishna, Maskey, Manil, Hardgrove, Claire V., Li, Tianhao, Gupta, Aakash, Joswin, Emil, Mai, Yifan, Kumar, Shachi H, Patlak, Cigdem, Lu, Kevin, Alessi, Vincent, Balija, Sree Bhargavi, Gu, Chenhe, Sullivan, Robert, Gealy, James, Lavrisa, Matt, Goel, James, Mattson, Peter, Liang, Percy, Vanschoren, Joaquin
The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories, including violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (election, financial, health, legal). Our method incorporates a complete assessment standard, extensive prompt datasets, a novel evaluation framework, a grading and reporting system, and the technical as well as organizational infrastructure for long-term support and evolution. In particular, the benchmark employs an understandable five-tier grading scale (Poor to Excellent) and incorporates an innovative entropy-based system-response evaluation. In addition to unveiling the benchmark, this report also identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions. This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories. Our findings provide valuable insights for model developers, system integrators, and policymakers working to promote safer AI deployment.
Risks from Language Models for Automated Mental Healthcare: Ethics and Structure for Implementation
Grabb, Declan, Lamparth, Max, Vasan, Nina
Amidst the growing interest in developing task-autonomous AI for automated mental health care, this paper addresses the ethical and practical challenges associated with the issue and proposes a structured framework that delineates levels of autonomy, outlines ethical requirements, and defines beneficial default behaviors for AI agents in the context of mental health support. We also evaluate ten state-of-the-art language models using 16 mental health-related questions designed to reflect various mental health conditions, such as psychosis, mania, depression, suicidal thoughts, and homicidal tendencies. The question design and response evaluations were conducted by mental health clinicians (M.D.s). We find that existing language models are insufficient to match the standard provided by human professionals who can navigate nuances and appreciate context. This is due to a range of issues, including overly cautious or sycophantic responses and the absence of necessary safeguards. Alarmingly, we find that most of the tested models could cause harm if accessed in mental health emergencies, failing to protect users and potentially exacerbating existing symptoms. We explore solutions to enhance the safety of current models. Before the release of increasingly task-autonomous AI systems in mental health, it is crucial to ensure that these models can reliably detect and manage symptoms of common psychiatric disorders to prevent harm to users. This involves aligning with the ethical framework and default behaviors outlined in our study. We contend that model developers are responsible for refining their systems per these guidelines to safeguard against the risks posed by current AI technologies to user mental health and safety. Trigger warning: Contains and discusses examples of sensitive mental health topics, including suicide and self-harm.