Goto

Collaborating Authors

 Generative AI


Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases

arXiv.org Artificial Intelligence

Recent advancements in reasoning-enhanced large language models (LLMs), such as DeepSeek-R1 and OpenAI-o3, have demonstrated significant progress. However, their application in professional medical contexts remains underexplored, particularly in evaluating the quality of their reasoning processes alongside final outputs. Here, we introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references derived from clinical case reports. Spanning 13 body systems and 10 specialties, it includes both common and rare diseases. To comprehensively evaluate LLM performance, we propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey. To assess reasoning quality, we present the Reasoning Evaluator, a novel automated system that objectively scores free-text reasoning responses based on efficiency, actuality, and completeness using dynamic cross-referencing and evidence checks. Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc. Our results show that current LLMs achieve over 85% accuracy in relatively simple diagnostic tasks when provided with sufficient examination results. However, performance declines in more complex tasks, such as examination recommendation and treatment planning. While reasoning outputs are generally reliable, with factuality scores exceeding 90%, critical reasoning steps are frequently missed. These findings underscore both the progress and limitations of clinical LLMs. Notably, open-source models like DeepSeek-R1 are narrowing the gap with proprietary systems, highlighting their potential to drive accessible and equitable advancements in healthcare.


Synthetic Data Generation for Minimum-Exposure Navigation in a Time-Varying Environment using Generative AI Models

arXiv.org Artificial Intelligence

We study the problem of synthetic generation of samples of environmental features for autonomous vehicle navigation. These features are described by a spatiotemporally varying scalar field that we refer to as a threat field. The threat field is known to have some underlying dynamics subject to process noise. Some "real-world" data of observations of various threat fields are also available. The assumption is that the volume of ``real-world'' data is relatively small. The objective is to synthesize samples that are statistically similar to the data. The proposed solution is a generative artificial intelligence model that we refer to as a split variational recurrent neural network (S-VRNN). The S-VRNN merges the capabilities of a variational autoencoder, which is a widely used generative model, and a recurrent neural network, which is used to learn temporal dependencies in data. The main innovation in this work is that we split the latent space of the S-VRNN into two subspaces. The latent variables in one subspace are learned using the ``real-world'' data, whereas those in the other subspace are learned using the data as well as the known underlying system dynamics. Through numerical experiments we demonstrate that the proposed S-VRNN can synthesize data that are statistically similar to the training data even in the case of very small volume of ``real-world'' training data.


Task-Oriented Connectivity for Networked Robotics with Generative AI and Semantic Communications

arXiv.org Artificial Intelligence

--The convergence of robotics, advanced communication networks, and artificial intelligence (AI) holds the promise of transforming industries through fully automated and intelligent operations. In this work, we introduce a novel co-working framework for robots that unifies goal-oriented semantic communication (SemCom) with a Generative AI (GenAI)-agent under a semantic-aware network. Meanwhile, the GenAI-agent leverages generative AI models to interpret high-level task instructions, allocate resources, and adapt to dynamic changes in both network and robotic environments. This agent-driven paradigm ushers in a new level of autonomy and intelligence, enabling complex tasks of networked robots to be conducted with minimal human intervention. We validate our approach through a multi-robot anomaly detection use-case simulation, where robots detect, compress, and transmit relevant information for classification. Simulation results confirm that SemCom significantly reduces data traffic while preserving critical semantic details, and the GenAI-agent ensures task coordination and network adaptation. This synergy provides a robust, efficient, and scalable solution for modern industrial environments.


Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help

arXiv.org Artificial Intelligence

Generative modeling is widely regarded as one of the most essential problems in today's AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability. Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.


Actionable AI: Enabling Non Experts to Understand and Configure AI Systems

arXiv.org Artificial Intelligence

Interaction between humans and AI systems raises the question of how people understand AI systems. This has been addressed with explainable AI, the interpretability arising from users' domain expertise, or collaborating with AI in a stable environment. In the absence of these elements, we discuss designing Actionable AI, which allows non-experts to configure black-box agents. In this paper, we experiment with an AI-powered cartpole game and observe 22 pairs of participants to configure it via direct manipulation. Our findings suggest that, in uncertain conditions, non-experts were able to achieve good levels of performance. By influencing the behaviour of the agent, they exhibited an operational understanding of it, which proved sufficient to reach their goals. Based on this, we derive implications for designing Actionable AI systems. In conclusion, we propose Actionable AI as a way to open access to AI-based agents, giving end users the agency to influence such agents towards their own goals.


Agent models: Internalizing Chain-of-Action Generation into Reasoning models

arXiv.org Artificial Intelligence

Traditional agentic workflows rely on external prompts to manage interactions with tools and the environment, which limits the autonomy of reasoning models. We position Large Agent Models (LAMs) that internalize the generation of Chain-of-Action (CoA), enabling the model to autonomously decide when and how to use external tools. Our proposed AutoCoA framework combines supervised fine-tuning (SFT) and reinforcement learning (RL), allowing the model to seamlessly switch between reasoning and action while efficiently managing environment interactions. Main components include step-level action triggering, trajectory-level CoA optimization, and an internal world model to reduce realenvironment interaction costs. Evaluations on open-domain QA tasks demonstrate that AutoCoA-trained agent models significantly outperform ReAct-based workflows in task completion, especially in tasks that require long-term reasoning and multi-step actions. Code and dataset are available at https://github.com/ OpenAI has outlined five progressive stages on the path to Artificial General Intelligence (AGI). The first stage, characterized as Chatbot, is exemplified by Large Language Models (LLMs) like GPT-3.5 and GPT-4 OpenAI (2023). The second stage, termed Reasoner, introduces Large Reasoning Models (LRMs) such as o1 OpenAI (2024) and o3. Recently, OpenAI released Operator OpenAI (2025a) and Deep Research OpenAI (2025b), signaling the arrival of the third stage: Agent. These systems reportedly combine reasoning with autonomous tool usage, enabling independent execution of multi-round workflows by interacting with the real-world environment. It is believed that the technology behind Operator and Deep Research is not merely integrating existing LLMs or LRMs with agentic workflows (e.g., ReAct Yao et al. (2022), Reflexion Shinn et al. (2023)). Instead, it represents a further upgrade in model capabilities: the new models are capable of long-term planning, tool manipulation, and environmental interaction.


Generative AI as Digital Media

arXiv.org Artificial Intelligence

Generative AI is frequently portrayed as revolutionary or even apocalyptic, prompting calls for novel regulatory approaches. This essay argues that such views are misguided. Instead, generative AI should be understood as an evolutionary step in the broader algorithmic media landscape, alongside search engines and social media. Like these platforms, generative AI centralizes information control, relies on complex algorithms to shape content, and extensively uses user data, thus perpetuating common problems: unchecked corporate power, echo chambers, and weakened traditional gatekeepers. Regulation should therefore share a consistent objective: ensuring media institutions remain trustworthy. Without trust, public discourse risks fragmenting into isolated communities dominated by comforting, tribal beliefs -- a threat intensified by generative AI's capacity to bypass gatekeepers and personalize truth. Current governance frameworks, such as the EU's AI Act and the US Executive Order 14110, emphasize reactive risk mitigation, addressing measurable threats like national security, public health, and algorithmic bias. While effective for novel technological risks, this reactive approach fails to adequately address broader issues of trust and legitimacy inherent to digital media. Proactive regulation fostering transparency, accountability, and public confidence is essential. Viewing generative AI exclusively as revolutionary risks repeating past regulatory failures that left social media and search engines insufficiently regulated. Instead, regulation must proactively shape an algorithmic media environment serving the public good, supporting quality information and robust civic discourse.


Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues

arXiv.org Artificial Intelligence

Recent advances in generative artificial intelligence (AI), including large language models (LLMs), have opened new possibilities in education and in particular on scaling up personalization. One form of personalization that generative AI powers is interactive learning via tutoring dialogues between AI-powered tutors and students. These interactions have the potential to tailor instruction to each student's needs and progress, while offering personalized feedback, all in real time, in a scalable way. Given the widespread success of human tutors for improving student outcomes [29], many recent works have developed LLM-based tutors, showing promise across various educational domains [15, 25, 30, 32, 33, 39, 42, 50]. Many LLM-based tutors are even deployed in practice, such as Khan Academy's Khanmigo [21] and Carnegie Learning's LiveHint [4]. Several preliminary studies have shown that interacting with LLMs can increase student learning [52], although some have shown that students can develop an over-reliance on LLMs which negatively impacts their learning [23]. Many prior works have focused on improving LLMs' ability to follow effective tutoring principles, adapting them for the tutoring task that they are not pre-trained for. One approach, explored in [46], analyzes the decision-making process underlying human tutor utterances, showing that integrating expert decisions enhances LLM-based tutoring. Another study, [28], examines tutor moves in interactions with an LLM-powered simulated student agent, demonstrating that move annotation data contributes to better tutoring performance.


GenAI for Simulation Model in Model-Based Systems Engineering

arXiv.org Artificial Intelligence

Generative AI (GenAI) has demonstrated remarkable capabilities in code generation, and its integration into complex product modeling and simulation code generation can significantly enhance the efficiency of the system design phase in Model-Based Systems Engineering (MBSE). In this study, we introduce a generative system design methodology framework for MBSE, offering a practical approach for the intelligent generation of simulation models for system physical properties. First, we employ inference techniques, generative models, and integrated modeling and simulation languages to construct simulation models for system physical properties based on product design documents. Subsequently, we fine-tune the language model used for simulation model generation on an existing library of simulation models and additional datasets generated through generative modeling. Finally, we introduce evaluation metrics for the generated simulation models for system physical properties. Our proposed approach to simulation model generation presents the innovative concept of scalable templates for simulation models. Using these templates, GenAI generates simulation models for system physical properties through code completion. The experimental results demonstrate that, for mainstream open-source Transformer-based models, the quality of the simulation model is significantly improved using the simulation model generation method proposed in this paper.


InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

arXiv.org Artificial Intelligence

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users, which is vital for developing generalpurpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench that evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-3.5-Sonnet. Our evaluation results indicate that even the state-of-the-art LMM, OpenAI-o1, struggles to refine its responses based on human feedback, achieving an average score of less than 50%. Our findings point to the need for methods that can enhance LMMs' capabilities to interpret and benefit from feedback. In this paper, we are curious about the question "Can Large Multimodal Models evolve through Interactive Human Feedback?" It is central to developing general-purpose AI assistants with Large Multimodal Models (LMMs). While these models show exceptional performance on tackling multimodal tasks directly, their ability to interact with humans remains largely unknown. We argue that an LMM functioning as the general assistant should possess two capabilities: 1) exceptional problem-solving ability and 2) the ability to improve itself through feedback (e.g., human feedback, execution results).