Overview
Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models
Luo, Wenjie, Li, Ruocheng, Zhu, Shanshan, Perry, Julian
--Despite significant advancements, current large language models (LLMs) and vision-language models (L VLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of "deliberative thinking." They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. T o address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances L VLMs' common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Coupled with an Adaptive Iterative Refinement strategy, CMRF systematically refines its reasoning paths. Built upon LLaV A-1.6-34B and trained on a novel Multimodal Daily Activity Reasoning (MDAR) dataset, CMRF achieves state-of-the-art performance among open-source L VLMs on challenging benchmarks like VCR, A-OKVQA, and DailyLife-MRC. Extensive ablation studies and human evaluations confirm the critical contributions of each module and the effectiveness of iterative refinement in fostering more coherent and accurate reasoning. The remarkable advancements in large language models (LLMs) [1], [2] and vision-language models (L VLMs) have revolutionized various aspects of artificial intelligence, demonstrating unprecedented capabilities in understanding, generating, and processing information across modalities [3]. These models excel in tasks ranging from complex question answering to creative content generation, largely due to their extensive pre-training on vast amounts of data.
A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering
Yi, Ziruo, Liu, Jinyu, Xiao, Ting, Albert, Mark V.
Radiology visual question answering (RVQA) provides precise answers to questions about chest X-ray images, alleviating radiologists' workload. While recent methods based on multimodal large language models (MLLMs) and retrieval-augmented generation (RAG) have shown promising progress in RVQA, they still face challenges in factual accuracy, hallucinations, and cross-modal misalignment. We introduce a multi-agent system (MAS) designed to support complex reasoning in RVQA, with specialized agents for context understanding, multimodal reasoning, and answer validation. We evaluate our system on a challenging RVQA set curated via model disagreement filtering, comprising consistently hard cases across multiple MLLMs. Extensive experiments demonstrate the superiority and effectiveness of our system over strong MLLM baselines, with a case study illustrating its reliability and interpretability. This work highlights the potential of multi-agent approaches to support explainable and trustworthy clinical AI applications that require complex reasoning.
Synthetic medical data generation: state of the art and application to trauma mechanism classification
Doremus, Ocรฉane, Guerra-Adames, Ariel, Avalos-Fernandez, Marta, Jouhet, Vianney, Gil-Jardinรฉ, Cรฉdric, Lagarde, Emmanuel
Faced with the challenges of patient confidentiality and scientific reproducibility, research on machine learning for health is turning towards the conception of synthetic medical databases. This article presents a brief overview of state-of-the-art machine learning methods for generating synthetic tabular and textual data, focusing their application to the automatic classification of trauma mechanisms, followed by our proposed methodology for generating high-quality, synthetic medical records combining tabular and unstructured text data. 1 Introduction
Towards a Manifesto for Cyber Humanities: Paradigms, Ethics, and Prospects
Adorni, Giovanni, Bellini, Emanuele
The accelerated evolution of digital infrastructures and algorithmic systems is reshaping how the humanities engage with knowledge and culture. Rooted in the traditions of Digital Humanities and Digital Humanism, the concept of "Cyber Humanities" proposes a critical reconfiguration of humanistic inquiry for the post-digital era. This Manifesto introduces a flexible framework that integrates ethical design, sustainable digital practices, and participatory knowledge systems grounded in human-centered approaches. By means of a Decalogue of foundational principles, the Manifesto invites the scientific community to critically examine and reimagine the algorithmic infrastructures that influence culture, creativity, and collective memory. Rather than being a simple extension of existing practices, "Cyber Humanities" should be understood as a foundational paradigm for humanistic inquiry in a computationally mediated world. Keywords: Cyber Humanities, Digital Humanities, Transdisciplinary Epistemology, Algorithmic Reflexivity, Human-centered AI, Ethics-by-Design, Knowledge Ecosystems, Digital Sovereignty, Cognitive Infrastructures
Pulse Shape Discrimination Algorithms: Survey and Benchmark
Liu, Haoran, Zhan, Yihan, Liu, Mingzhe, Liu, Yanhua, Li, Peng, Zuo, Zhuo, Liu, Bingqi, Liu, Runxi
This review presents a comprehensive survey and benchmark of pulse shape discrimination (PSD) algorithms for radiation detection, classifying nearly sixty methods into statistical (time-domain, frequency-domain, neural network-based) and prior-knowledge (machine learning, deep learning) paradigms. We implement and evaluate all algorithms on two standardized datasets: an unlabeled set from a 241Am-9Be source and a time-of-flight labeled set from a 238Pu-9Be source, using metrics including Figure of Merit (FOM), F1-score, ROC-AUC, and inter-method correlations. Our analysis reveals that deep learning models, particularly Multi-Layer Perceptrons (MLPs) and hybrid approaches combining statistical features with neural regression, often outperform traditional methods. We discuss architectural suitabilities, the limitations of FOM, alternative evaluation metrics, and performance across energy thresholds. Accompanying this work, we release an open-source toolbox in Python and MATLAB, along with the datasets, to promote reproducibility and advance PSD research.
AnnoSense: A Framework for Physiological Emotion Data Collection in Everyday Settings for AI
Singh, Pragya, Gupta, Ankush, Kumar, Mohan, Singh, Pushpendra
Emotional and mental well-being are vital components of quality of life, and with the rise of smart devices like smartphones, wearables, and artificial intelligence (AI), new opportunities for monitoring emotions in everyday settings have emerged. However, for AI algorithms to be effective, they require high-quality data and accurate annotations. As the focus shifts towards collecting emotion data in real-world environments to capture more authentic emotional experiences, the process of gathering emotion annotations has become increasingly complex. This work explores the challenges of everyday emotion data collection from the perspectives of key stakeholders. We collected 75 survey responses, performed 32 interviews with the public, and 3 focus group discussions (FGDs) with 12 mental health professionals. The insights gained from a total of 119 stakeholders informed the development of our framework, AnnoSense, designed to support everyday emotion data collection for AI. This framework was then evaluated by 25 emotion AI experts for its clarity, usefulness, and adaptability. Lastly, we discuss the potential next steps and implications of AnnoSense for future research in emotion AI, highlighting its potential to enhance the collection and analysis of emotion data in real-world contexts.
Diffusion models for inverse problems
Chung, Hyungjin, Kim, Jeongsol, Ye, Jong Chul
Using diffusion priors to solve inverse problems in imaging have significantly matured over the years. In this chapter, we review the various different approaches that were proposed over the years. We categorize the approaches into the more classic explicit approximation approaches and others, which include variational inference, sequential monte carlo, and decoupled data consistency. We cover the extension to more challenging situations, including blind cases, high-dimensional data, and problems under data scarcity and distribution mismatch. More recent approaches that aim to leverage multimodal information through texts are covered. Through this chapter, we aim to (i) distill the common mathematical threads that connect these algorithms, (ii) systematically contrast their assumptions and performance trade-offs across representative inverse problems, and (iii) spotlight the open theoretical and practical challenges by clarifying the landscape of diffusion model based inverse problem solvers.
Central Limit Theorems for Transition Probabilities of Controlled Markov Chains
Su, Ziwei, Banerjee, Imon, Klabjan, Diego
We develop a central limit theorem (CLT) for the non-parametric estimator of the transition matrices in controlled Markov chains (CMCs) with finite state-action spaces. Our results establish precise conditions on the logging policy under which the estimator is asymptotically normal, and reveal settings in which no CLT can exist. We then build upon it to derive CLTs for the value, Q-, and advantage functions of any stationary stochastic policy, including the optimal policy recovered from the estimated model. Goodness-of-fit tests are derived as a corollary, which enable us to test whether the logged data is stochastic. These results provide new statistical tools for offline policy evaluation and optimal policy recovery, and enable hypothesis tests for transition probabilities.
Contextual Graph Transformer: A Small Language Model for Enhanced Engineering Document Information Extraction
Standard transformer-based language models, while powerful for general text, often struggle with the fine-grained syntax and entity relationships in complex technical, engineering documents. To address this, we propose the Contextual Graph Transformer (CGT), a hybrid neural architecture that combines Graph Neural Networks (GNNs) and Transformers for domain-specific question answering. CGT constructs a dynamic graph over input tokens using sequential, skip-gram, and semantic similarity edges, which is processed by GATv2Conv layers for local structure learning. These enriched embeddings are then passed to a Transformer encoder to capture global dependencies. Unlike generic large models, technical domains often require specialized language models with stronger contextualization and structure awareness. CGT offers a parameter-efficient solution for such use cases. Integrated into a Retrieval-Augmented Generation (RAG) pipeline, CGT outperforms baselines like GPT-2 and BERT, achieving 24.7% higher accuracy than GPT-2 with 62.4% fewer parameters. This gain stems from CGTs ability to jointly model structural token interactions and long-range semantic coherence. The model is trained from scratch using a two-phase approach: pretraining on general text followed by fine-tuning on domain-specific manuals. This highlights CGTs adaptability to technical language, enabling better grounding, entity tracking, and retrieval-augmented responses in real-world applications.
CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use Models
Pham, Tung-Thuy, Luong, Duy-Quan, Duong, Minh-Quan, Nguyen, Trung-Hieu, Nguyen, Thu-Trang, Nguyen, Son, Vo, Hieu Dinh
Composable AI offers a scalable and effective paradigm for tackling complex AI tasks by decomposing them into sub-tasks and solving each sub-task using ready-to-use well-trained models. However, systematically evaluating methods under this setting remains largely unexplored. In this paper, we introduce CABENCH, the first public benchmark comprising 70 realistic composable AI tasks, along with a curated pool of 700 models across multiple modalities and domains. We also propose an evaluation framework to enable end-to-end assessment of composable AI solutions. To establish initial baselines, we provide human-designed reference solutions and compare their performance with two LLM-based approaches. Our results illustrate the promise of composable AI in addressing complex real-world problems while highlighting the need for methods that can fully unlock its potential by automatically generating effective execution pipelines.