Goto

Collaborating Authors

 Generative AI


Comparative Evaluation of Generative AI Models for Chest Radiograph Report Generation in the Emergency Department

arXiv.org Artificial Intelligence

Purpose: To benchmark open-source or commercial medical image-specific VLMs against real-world radiologist-written reports. Methods: This retrospective study included adult patients who presented to the emergency department between January 2022 and April 2025 and underwent same-day CXR and CT for febrile or respiratory symptoms. Reports from five VLMs (AIRead, Lingshu, MAIRA-2, MedGemma, and MedVersa) and radiologist-written reports were randomly presented and blindly evaluated by three thoracic radiologists using four criteria: RADPEER, clinical acceptability, hallucination, and language clarity. Comparative performance was assessed using generalized linear mixed models, with radiologist-written reports treated as the reference. Finding-level analyses were also performed with CT as the reference. Results: A total of 478 patients (median age, 67 years [interquartile range, 50-78]; 282 men [59.0%]) were included. AIRead demonstrated the lowest RADPEER 3b rate (5.3% [76/1434] vs. radiologists 13.9% [200/1434]; P<.001), whereas other VLMs showed higher disagreement rates (16.8-43.0%; P<.05). Clinical acceptability was the highest with AIRead (84.5% [1212/1434] vs. radiologists 74.3% [1065/1434]; P<.001), while other VLMs performed worse (41.1-71.4%; P<.05). Hallucinations were rare with AIRead, comparable to radiologists (0.3% [4/1425]) vs. 0.1% [1/1425]; P=.21), but frequent with other models (5.4-17.4%; P<.05). Language clarity was higher with AIRead (82.9% [1189/1434]), Lingshu (88.0% [1262/1434]), and MedVersa (88.4% [1268/1434]) compared with radiologists (78.1% [1120/1434]; P<.05). Sensitivity varied substantially across VLMs for the common findings: AIRead, 15.5-86.7%; Lingshu, 2.4-86.7%; MAIRA-2, 6.0-72.0%; MedGemma, 4.8-76.7%; and MedVersa, 20.2-69.3%. Conclusion: Medical VLMs for CXR report generation exhibited variable performance in report quality and diagnostic measures.


SafeCiM: Investigating Resilience of Hybrid Floating-Point Compute-in-Memory Deep Learning Accelerators

arXiv.org Artificial Intelligence

Deep Neural Networks (DNNs) continue to grow in complexity with Large Language Models (LLMs) incorporating vast numbers of parameters. Handling these parameters efficiently in traditional accelerators is limited by data-transmission bottlenecks, motivating Compute-in-Memory (CiM) architectures that integrate computation within or near memory to reduce data movement. Recent work has explored CiM designs using Floating-Point (FP) and Integer (INT) operations. FP computations typically deliver higher output quality due to their wider dynamic range and precision, benefiting precision-sensitive Generative AI applications. These include models such as LLMs, thus driving advancements in FP-CiM accelerators. However, the vulnerability of FP-CiM to hardware faults remains underexplored, posing a major reliability concern in mission-critical settings. To address this gap, we systematically analyze hardware fault effects in FP-CiM by introducing bit-flip faults at key computational stages, including digital multipliers, CiM memory cells, and digital adder trees. Experiments with Convolutional Neural Networks (CNNs) such as AlexNet and state-of-the-art LLMs including LLaMA-3.2-1B and Qwen-0.3B-Base reveal how faults at each stage affect inference accuracy. Notably, a single adder fault can reduce LLM accuracy to 0%. Based on these insights, we propose a fault-resilient design, SafeCiM, that mitigates fault impact far better than a naive FP-CiM with a pre-alignment stage. For example, with 4096 MAC units, SafeCiM reduces accuracy degradation by up to 49x for a single adder fault compared to the baseline FP-CiM architecture.


Extracting memorized pieces of (copyrighted) books from open-weight language models

arXiv.org Artificial Intelligence

Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression in their training data. Drawing on both machine learning and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we extend a recent probabilistic extraction technique to measure memorization of 50 books in 17 open-weight LLMs. Through thousands of experiments, we show that the extent of memorization varies both by model and by book. With respect to our specific extraction methodology, we find that most LLMs do not memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B entirely memorizes some books, like the first Harry Potter book and 1984. In fact, the first Harry Potter is so memorized that, using a seed prompt consisting of just the first few tokens of the first chapter, we can deterministically generate the entire book near-verbatim. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.


XFlowMP: Task-Conditioned Motion Fields for Generative Robot Planning with Schrodinger Bridges

arXiv.org Artificial Intelligence

Generative robotic motion planning requires not only the synthesis of smooth and collision-free trajectories but also feasibility across diverse tasks and dynamic constraints. Prior planning methods, both traditional and generative, often struggle to incorporate high-level semantics with low-level constraints, especially the nexus between task configurations and motion controllability. In this work, we present XFlowMP, a task-conditioned generative motion planner that models robot trajectory evolution as entropic flows bridging stochastic noises and expert demonstrations via Schrodinger bridges given the inquiry task configuration. Specifically, our method leverages Schrodinger bridges as a conditional flow matching coupled with a score function to learn motion fields with high-order dynamics while encoding start-goal configurations, enabling the generation of collision-free and dynamically-feasible motions. Through evaluations, XFlowMP achieves up to 53.79% lower maximum mean discrepancy, 36.36% smoother motions, and 39.88% lower energy consumption while comparing to the next-best baseline on the RobotPointMass benchmark, and also reducing short-horizon planning time by 11.72%. On long-horizon motions in the LASA Handwriting dataset, our method maintains the trajectories with 1.26% lower maximum mean discrepancy, 3.96% smoother, and 31.97% lower energy. We further demonstrate the practicality of our method on the Kinova Gen3 manipulator, executing planning motions and confirming its robustness in real-world settings.


How to glimpse a pre-AI internet

Popular Science

Slop Evader isn't meant as a solution, but it gives a temporary reprieve. Breakthroughs, discoveries, and DIY tips sent every weekday. A sizable portion of the internet has devolved into an AI-contaminated wasteland . While an easy solution remains elusive, a browser extension called Slop Evader offers a glimpse at what the internet to be only a few short years ago. While always prone to innumerable hazards, the online ecosystem is degrading largely due to the misuse of generative artificial intelligence content .


The State of AI: Welcome to the economic singularity

MIT Technology Review

Bonus: If you're an subscriber, you can join David and Richard, alongside's editor in chief, Mat Honan, for an exclusive conversation live on Tuesday, December 9 at 1pm ET about this topic. Sign up to be a part here . Any far-reaching new technology is always uneven in its adoption, but few have been more uneven than generative AI. That makes it hard to assess its likely impact on individual businesses, let alone on productivity across the economy as a whole. At one extreme, AI coding assistants have revolutionized the work of software developers. Mark Zuckerberg recently predicted that half of Meta's code would be written by AI within a year.


James Cameron says AI actors are 'horrifying to me'

The Guardian

'Generative AI can't create something new' James Cameron. 'Generative AI can't create something new' James Cameron. James Cameron says AI actors are'horrifying to me' Avatar director, known for his advocacy of new technology, told interviewer generative AI performance puts'all human experience into a blender' Avatar director James Cameron has called AI actors "horrifying" and said what generative AI technology creates is "an average". Cameron was speaking to CBS on Sunday Morning in the run-up to the release of the third Avatar film, subtitled Fire and Ash, and was asked about the pioneering technology he used in his film-making. After praising motion-capture performance as "a celebration of the actor-director moment", Cameron expressed his disdain for artificial intelligence.


Proactive Defense: Compound AI for Detecting Persuasion Attacks and Measuring Inoculation Effectiveness

arXiv.org Artificial Intelligence

This paper introduces BRIES, a novel compound AI architecture designed to detect and measure the effectiveness of persuasion attacks across information environments. We present a system with specialized agents: a Twister that generates adversarial content employing targeted persuasion tactics, a Detector that identifies attack types with configurable parameters, a Defender that creates resilient content through content inoculation, and an Assessor that employs causal inference to evaluate inoculation effectiveness. Experimenting with the SemEval 2023 Task 3 taxonomy across the synthetic persuasion dataset, we demonstrate significant variations in detection performance across language agents. Our comparative analysis reveals significant performance disparities with GPT-4 achieving superior detection accuracy on complex persuasion techniques, while open-source models like Llama3 and Mistral demonstrated notable weaknesses in identifying subtle rhetorical, suggesting that different architectures encode and process persuasive language patterns in fundamentally different ways. We show that prompt engineering dramatically affects detection efficacy, with temperature settings and confidence scoring producing model-specific variations; Gemma and GPT-4 perform optimally at lower temperatures while Llama3 and Mistral show improved capabilities at higher temperatures. Our causal analysis provides novel insights into socio-emotional-cognitive signatures of persuasion attacks, revealing that different attack types target specific cognitive dimensions. This research advances generative AI safety and cognitive security by quantifying LLM-specific vulnerabilities to persuasion attacks and delivers a framework for enhancing human cognitive resilience through structured interventions before exposure to harmful content.


ORION: Teaching Language Models to Reason Efficiently in the Language of Thought

arXiv.org Artificial Intelligence

Large Reasoning Models (LRMs) achieve state-of-the-art performance in mathematics, code generation, and task planning. Inspired by the Language of Thought Hypothesis --which posits that human reasoning operates over a symbolic, compositional mental language called Mentalese--we introduce a cognitively motivated framework that trains models to reason in a similar compact style. Mentalese encodes abstract reasoning as ultra-compressed, structured tokens, enabling models to solve complex problems with far fewer steps. When applied to Mentalese-aligned models, SLPO achieves much larger compression rates by enabling compressed reasoning that preserves the benefits of detailed thinking without the computational overhead, allowing us to present the best-performing models at each compression level along the performance-efficiency Pareto frontier. Across mathematical benchmarks -- including AIME 2024 & 2025, Minerva-Math, OlympiadBench, Math500, and AMC -- our ORION models generate reasoning traces with 4-16 fewer tokens, achieve up to 5 lower inference latency, and reduce training costs by 7-9 relative to the base DeepSeek R1 Distilled model, while maintaining 90-98% of the baseline accuracy. ORION models also surpass Claude and ChatGPT -4o by up to 5% in accuracy while maintaining 2 compression. Our findings demonstrate Mentalese-style compressed reasoning offers a breakthrough toward human-like cognitive efficiency, opening new possibilities for real-time, cost-effective reasoning without sacrificing accuracy. The dotted curve indicates the Pareto frontier, which illustrates the trade-off between higher compression rates and loss in accuracy. Our proposed method, combining Mentalese alignment with SLPO, consistently lies on this frontier, identifying an optimal operating point that achieves a balance between accuracy and efficiency. Work done during internship at Hippocratic AI. Recent advances such as OpenAI o1 (OpenAI et al., 2024b) and DeepSeek R1 (DeepSeek-AI et al., 2025) have reshaped how we think about language model reasoning. By letting models "think before they answer," these systems dramatically improved credibility and performance--achievements that were once thought impossible for LLMs (Wu et al., 2024). Explicit reasoning has thus emerged as a central focus of LLM research (Xu et al., 2025).


Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection

arXiv.org Artificial Intelligence

Automated detection and classification of marine mammals vocalizations is critical for conservation and management efforts but is hindered by limited annotated datasets and the acoustic complexity of real-world marine environments. Data augmentation has proven to be an effective strategy to address this limitation by increasing dataset diversity and improving model generalization without requiring additional field data. However, most augmentation techniques used to date rely on effective but relatively simple transformations, leaving open the question of whether deep generative models can provide additional benefits. In this study, we evaluate the potential of deep generative for data augmentation in marine mammal call detection including: Variational Autoencoders, Generative Adversarial Networks, and Denoising Diffusion Probabilistic Models. Using Southern Resident Killer Whale (Orcinus orca) vocalizations from two long-term hydrophone deployments in the Salish Sea, we compare these approaches against traditional augmentation methods such as time-shifting and vocalization masking. While all generative approaches improved classification performance relative to the baseline, diffusion-based augmentation yielded the highest recall (0.87) and overall F1-score (0.75). A hybrid strategy combining generative-based synthesis with traditional methods achieved the best overall performance with an F1-score of 0.81. We hope this study encourages further exploration of deep generative models as complementary augmentation strategies to advance acoustic monitoring of threatened marine mammal populations.