Gangavarapu, Agasthya
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
Ghosh, Shaona, Frase, Heather, Williams, Adina, Luger, Sarah, Röttger, Paul, Barez, Fazl, McGregor, Sean, Fricklas, Kenneth, Kumar, Mala, Feuillade--Montixi, Quentin, Bollacker, Kurt, Friedrich, Felix, Tsang, Ryan, Vidgen, Bertie, Parrish, Alicia, Knotz, Chris, Presani, Eleonora, Bennion, Jonathan, Boston, Marisa Ferrara, Kuniavsky, Mike, Hutiri, Wiebke, Ezick, James, Salem, Malek Ben, Sahay, Rajat, Goswami, Sujata, Gohar, Usman, Huang, Ben, Sarin, Supheakmungkol, Alhajjar, Elie, Chen, Canyu, Eng, Roman, Manjusha, Kashyap Ramanandula, Mehta, Virendra, Long, Eileen, Emani, Murali, Vidra, Natan, Rukundo, Benjamin, Shahbazi, Abolfazl, Chen, Kongtao, Ghosh, Rajat, Thangarasa, Vithursan, Peigné, Pierre, Singh, Abhinav, Bartolo, Max, Krishna, Satyapriya, Akhtar, Mubashara, Gold, Rafael, Coleman, Cody, Oala, Luis, Tashev, Vassil, Imperial, Joseph Marvin, Russ, Amy, Kunapuli, Sasidhar, Miailhe, Nicolas, Delaunay, Julien, Radharapu, Bhaktipriya, Shinde, Rajat, Tuesday, null, Dutta, Debojyoti, Grabb, Declan, Gangavarapu, Ananya, Sahay, Saurav, Gangavarapu, Agasthya, Schramowski, Patrick, Singam, Stephen, David, Tom, Han, Xudong, Mammen, Priyanka Mary, Prabhakar, Tarunima, Kovatchev, Venelin, Ahmed, Ahmed, Manyeki, Kelvin N., Madireddy, Sandeep, Khomh, Foutse, Zhdanov, Fedor, Baumann, Joachim, Vasan, Nina, Yang, Xianjun, Mougn, Carlos, Varghese, Jibin Rajan, Chinoy, Hussain, Jitendar, Seshakrishna, Maskey, Manil, Hardgrove, Claire V., Li, Tianhao, Gupta, Aakash, Joswin, Emil, Mai, Yifan, Kumar, Shachi H, Patlak, Cigdem, Lu, Kevin, Alessi, Vincent, Balija, Sree Bhargavi, Gu, Chenhe, Sullivan, Robert, Gealy, James, Lavrisa, Matt, Goel, James, Mattson, Peter, Liang, Percy, Vanschoren, Joaquin
The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories, including violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (election, financial, health, legal). Our method incorporates a complete assessment standard, extensive prompt datasets, a novel evaluation framework, a grading and reporting system, and the technical as well as organizational infrastructure for long-term support and evolution. In particular, the benchmark employs an understandable five-tier grading scale (Poor to Excellent) and incorporates an innovative entropy-based system-response evaluation. In addition to unveiling the benchmark, this report also identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions. This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories. Our findings provide valuable insights for model developers, system integrators, and policymakers working to promote safer AI deployment.
IMAS: A Comprehensive Agentic Approach to Rural Healthcare Delivery
Gangavarapu, Agasthya, Gangavarapu, Ananya
Since the onset of COVID-19, rural communities worldwide have faced significant challenges in accessing healthcare due to the migration of experienced medical professionals to urban centers. Semi-trained caregivers, such as Community Health Workers (CHWs) and Registered Medical Practitioners (RMPs), have stepped in to fill this gap, but often lack formal training. This paper proposes an advanced agentic medical assistant system designed to improve healthcare delivery in rural areas by utilizing Large Language Models (LLMs) and agentic approaches. The system is composed of five crucial components: translation, medical complexity assessment, expert network integration, final medical advice generation, and response simplification. Our innovative framework ensures context-sensitive, adaptive, and reliable medical assistance, capable of clinical triaging, diagnostics, and identifying cases requiring specialist intervention. The system is designed to handle cultural nuances and varying literacy levels, providing clear and actionable medical advice in local languages. Evaluation results using the MedQA, PubMedQA, and JAMA datasets demonstrate that this integrated approach significantly enhances the effectiveness of rural healthcare workers, making healthcare more accessible and understandable for underserved populations. All code and supplemental materials associated with the paper and IMAS are available at https://github.com/uheal/imas.
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Vidgen, Bertie, Agrawal, Adarsh, Ahmed, Ahmed M., Akinwande, Victor, Al-Nuaimi, Namir, Alfaraj, Najla, Alhajjar, Elie, Aroyo, Lora, Bavalatti, Trupti, Bartolo, Max, Blili-Hamelin, Borhane, Bollacker, Kurt, Bomassani, Rishi, Boston, Marisa Ferrara, Campos, Siméon, Chakra, Kal, Chen, Canyu, Coleman, Cody, Coudert, Zacharie Delpierre, Derczynski, Leon, Dutta, Debojyoti, Eisenberg, Ian, Ezick, James, Frase, Heather, Fuller, Brian, Gandikota, Ram, Gangavarapu, Agasthya, Gangavarapu, Ananya, Gealy, James, Ghosh, Rajat, Goel, James, Gohar, Usman, Goswami, Sujata, Hale, Scott A., Hutiri, Wiebke, Imperial, Joseph Marvin, Jandial, Surgan, Judd, Nick, Juefei-Xu, Felix, Khomh, Foutse, Kailkhura, Bhavya, Kirk, Hannah Rose, Klyman, Kevin, Knotz, Chris, Kuchnik, Michael, Kumar, Shachi H., Kumar, Srijan, Lengerich, Chris, Li, Bo, Liao, Zeyi, Long, Eileen Peters, Lu, Victor, Luger, Sarah, Mai, Yifan, Mammen, Priyanka Mary, Manyeki, Kelvin, McGregor, Sean, Mehta, Virendra, Mohammed, Shafee, Moss, Emanuel, Nachman, Lama, Naganna, Dinesh Jinenhally, Nikanjam, Amin, Nushi, Besmira, Oala, Luis, Orr, Iftach, Parrish, Alicia, Patlak, Cigdem, Pietri, William, Poursabzi-Sangdeh, Forough, Presani, Eleonora, Puletti, Fabrizio, Röttger, Paul, Sahay, Saurav, Santos, Tim, Scherrer, Nino, Sebag, Alice Schoenauer, Schramowski, Patrick, Shahbazi, Abolfazl, Sharma, Vin, Shen, Xudong, Sistla, Vamsi, Tang, Leonard, Testuggine, Davide, Thangarasa, Vithursan, Watkins, Elizabeth Anne, Weiss, Rebecca, Welty, Chris, Wilbers, Tyler, Williams, Adina, Wu, Carole-Jean, Yadav, Poonam, Yang, Xianjun, Zeng, Yi, Zhang, Wenhui, Zhdanov, Fedor, Zhu, Jiacheng, Liang, Percy, Mattson, Peter, Vanschoren, Joaquin
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions
Gangavarapu, Agasthya
Addressing the imminent shortfall of 10 million health workers by 2030, predominantly in Low- and Middle-Income Countries (LMICs), this paper introduces an innovative approach that harnesses the power of Large Language Models (LLMs) integrated with machine translation models. This solution is engineered to meet the unique needs of Community Health Workers (CHWs), overcoming language barriers, cultural sensitivities, and the limited availability of medical dialog datasets. I have crafted a model that not only boasts superior translation capabilities but also undergoes rigorous fine-tuning on open-source datasets to ensure medical accuracy and is equipped with comprehensive safety features to counteract the risks of misinformation. Featuring a modular design, this approach is specifically structured for swift adaptation across various linguistic and cultural contexts, utilizing open-source components to significantly reduce healthcare operational costs. This strategic innovation markedly improves the accessibility and quality of healthcare services by providing CHWs with contextually appropriate medical knowledge and diagnostic tools. This paper highlights the transformative impact of this context-aware LLM, underscoring its crucial role in addressing the global healthcare workforce deficit and propelling forward healthcare outcomes in LMICs.