AITopics | Singh, Shrutika

Collaborating Authors

Singh, Shrutika

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

Singh, Shrutika, Alyakin, Anton, Alber, Daniel Alexander, Stryker, Jaden, Tong, Ai Phuong S, Sangwon, Karl, Goff, Nicolas, de la Paz, Mathew, Hernandez-Rovira, Miguel, Park, Ki Yun, Leuthardt, Eric Claude, Oermann, Eric Karl

arXiv.org Artificial IntelligenceMar-13-2025

The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.13508

Country: North America > United States > New York (0.16)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine > Therapeutic Area (0.69)
Education > Educational Setting > Higher Education (0.50)
Health & Medicine > Diagnostic Medicine > Imaging (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.50)

Add feedback

Repurposing the scientific literature with vision-language models

Alyakin, Anton, Stryker, Jaden, Alber, Daniel Alexander, Sangwon, Karl L., Duderstadt, Brandon, Save, Akshay, Kurland, David, Frome, Spencer, Singh, Shrutika, Zhang, Jeff, Yang, Eunice, Park, Ki Yun, Orillac, Cordelia, Valliani, Aly A., Neifert, Sean, Liu, Albert, Patel, Aneek, Livia, Christopher, Lau, Darryl, Laufer, Ilya, Rozman, Peter A., Hidalgo, Eveline Teresa, Riina, Howard, Feng, Rui, Hollon, Todd, Aphinyanaphongs, Yindalon, Golfinos, John G., Snyder, Laura, Leuthardt, Eric, Kondziolka, Douglas, Oermann, Eric Karl

arXiv.org Artificial IntelligenceFeb-26-2025

Research in AI for Science often focuses on using AI technologies to augment components of the scientific process, or in some cases, the entire scientific method; how about AI for scientific publications? Peer-reviewed journals are foundational repositories of specialized knowledge, written in discipline-specific language that differs from general Internet content used to train most large language models (LLMs) and vision-language models (VLMs). We hypothesized that by combining a family of scientific journals with generative AI models, we could invent novel tools for scientific communication, education, and clinical care. We converted 23,000 articles from Neurosurgery Publications into a multimodal database - NeuroPubs - of 134 million words and 78,000 image-caption pairs to develop six datasets for building AI models. We showed that the content of NeuroPubs uniquely represents neurosurgery-specific clinical contexts compared with broader datasets and PubMed. For publishing, we employed generalist VLMs to automatically generate graphical abstracts from articles. Editorial board members rated 70% of these as ready for publication without further edits. For education, we generated 89,587 test questions in the style of the ABNS written board exam, which trainee and faculty neurosurgeons found indistinguishable from genuine examples 54% of the time. We used these questions alongside a curriculum learning process to track knowledge acquisition while training our 34 billion-parameter VLM (CNS-Obsidian). In a blinded, randomized controlled trial, we demonstrated the non-inferiority of CNS-Obsidian to GPT-4o (p = 0.1154) as a diagnostic copilot for a neurosurgical service. Our findings lay a novel foundation for AI with Science and establish a framework to elevate scientific communication using state-of-the-art generative artificial intelligence while maintaining rigorous quality standards.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2502.19546

Country: North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)

Genre:

Research Report > Strength High (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.87)

Add feedback