AITopics | multimodal input

Collaborating Authors

multimodal input

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

Neural Information Processing SystemsApr-29-2026, 13:58:00 GMT

The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.86)

Add feedback

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation Marco Bellagente 4 Manuel Brack 2, 3 Hannah Teufel 1 Felix Friedrich

Neural Information Processing SystemsFeb-16-2026, 19:26:57 GMT

The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users.

artificial intelligence, machine learning, usion, (16 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.86)

Add feedback

Fusion-Augmented Large Language Models: Boosting Diagnostic Trustworthiness via Model Consensus

Siam, Md Kamrul, Faruk, Md Jobair Hossain, Cheng, Jerry Q., Gu, Huanying

arXiv.org Artificial IntelligenceOct-21-2025

Abstract--This study presents a novel multi-model fusion framework leveraging two state-of-the-art large language models (LLMs), ChatGPT and Claude, to enhance the reliability of chest X-ray interpretation on the CheXpert dataset. From the full CheXpert corpus of 224,316 chest radiographs, we randomly selected 234 radiologist-annotated studies to evaluate unimodal performance using image-only prompts. In this setting, ChatGPT and Claude achieved diagnostic accuracies of 62.8% and 76.9%, respectively. A similarity-based consensus approach, using a 95% output similarity threshold, improved accuracy to 77.6%. T o assess the impact of multimodal inputs, we then generated synthetic clinical notes following the MIMIC-CXR template and evaluated a separate subset of 50 randomly selected cases paired with both images and synthetic text. On this multimodal cohort, performance improved to 84% for ChatGPT and 76% for Claude, while consensus accuracy reached 91.3%. Across both experimental conditions, agreement-based fusion consistently outperformed individual models. These findings highlight the utility of integrating complementary modalities and using output-level consensus to improve the trustworthiness and clinical utility of AI-assisted radiological diagnosis, offering a practical path to reduce diagnostic errors with minimal computational overhead.

accuracy, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2510.16057

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Forecasting Clicks in Digital Advertising: Multimodal Inputs and Interpretable Outputs

Gangopadhyay, Briti, Wang, Zhao, Takamatsu, Shingo

arXiv.org Artificial IntelligenceSep-15-2025

Forecasting click volume is a key task in digital advertising, influencing both revenue and campaign strategy. Traditional time series models rely solely on numerical data, often overlooking rich contextual information embedded in textual elements, such as keyword updates. We present a multimodal forecasting framework that combines click data with textual logs from real-world ad campaigns and generates human-interpretable explanations alongside numeric predictions. Reinforcement learning is used to improve comprehension of textual information and enhance fusion of modalities. Experiments on a large-scale industry dataset show that our method outperforms baselines in both accuracy and reasoning quality.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.09683

Country: Asia > Japan (0.15)

Genre: Research Report (0.40)

Industry: Marketing (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

KeyMPs: One-Shot Vision-Language Guided Motion Generation by Sequencing DMPs for Occlusion-Rich Tasks

Anarossi, Edgar, Kwon, Yuhwan, Tahara, Hirotaka, Tanaka, Shohei, Shirai, Keisuke, Hamaya, Masashi, Beltran-Hernandez, Cristian C., Hashimoto, Atsushi, Matsubara, Takamitsu

arXiv.org Artificial IntelligenceAug-5-2025

Dynamic Movement Primitives (DMPs) provide a flexible framework wherein smooth robotic motions are encoded into modular parameters. However, they face challenges in integrating multimodal inputs commonly used in robotics like vision and language into their framework. To fully maximize DMPs' potential, enabling them to handle multimodal inputs is essential. In addition, we also aim to extend DMPs' capability to handle object-focused tasks requiring one-shot complex motion generation, as observation occlusion could easily happen mid-execution in such tasks (e.g., knife occlusion in cake icing, hand occlusion in dough kneading, etc.). A promising approach is to leverage Vision-Language Models (VLMs), which process multimodal data and can grasp high-level concepts. However, they typically lack enough knowledge and capabilities to directly infer low-level motion details and instead only serve as a bridge between high-level instructions and low-level control. To address this limitation, we propose Keyword Labeled Primitive Selection and Keypoint Pairs Generation Guided Movement Primitives (KeyMPs), a framework that combines VLMs with sequencing of DMPs. KeyMPs use VLMs' high-level reasoning capability to select a reference primitive through \emph{keyword labeled primitive selection} and VLMs' spatial awareness to generate spatial scaling parameters used for sequencing DMPs by generalizing the overall motion through \emph{keypoint pairs generation}, which together enable one-shot vision-language guided motion generation that aligns with the intent expressed in the multimodal input. We validate our approach through experiments on two occlusion-rich tasks: object cutting, conducted in both simulated and real-world environments, and cake icing, performed in simulation. These evaluations demonstrate superior performance over other DMP-based methods that integrate VLM support.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ACCESS.2025.3588975

2504.10011

Country: Asia > Japan > Honshū (0.46)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context

Yang, Qize, Yao, Shimin, Chen, Weixuan, Fu, Shenghao, Bai, Detao, Zhao, Jiaxing, Sun, Boyuan, Yin, Bowen, Wei, Xihan, Zhou, Jingren

arXiv.org Artificial IntelligenceJun-27-2025

With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2506.21277

Country: North America > United States > Oregon > Jackson County > Central Point (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Graph-MLLM: Harnessing Multimodal Large Language Models for Multimodal Graph Learning

Liu, Jiajin, Fan, Dongzhe, Shen, Jiacheng, Ji, Chuanhao, Zha, Daochen, Tan, Qiaoyu

arXiv.org Artificial IntelligenceJun-13-2025

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in representing and understanding diverse modalities. However, they typically focus on modality alignment in a pairwise manner while overlooking structural relationships across data points. Integrating multimodality with structured graph information (i.e., multimodal graphs, MMGs) is essential for real-world applications such as social networks, healthcare, and recommendation systems. Existing MMG learning methods fall into three paradigms based on how they leverage MLLMs: Encoder, Aligner, and Predictor. MLLM-as-Encoder focuses on enhancing graph neural networks (GNNs) via multimodal feature fusion; MLLM-as-Aligner aligns multimodal attributes in language or hidden space to enable LLM-based graph reasoning; MLLM-as-Predictor treats MLLMs as standalone reasoners with in-context learning or fine-tuning. Despite their advances, the MMG field lacks a unified benchmark to fairly evaluate across these approaches, making it unclear what progress has been made. To bridge this gap, we present Graph-MLLM, a comprehensive benchmark for multimodal graph learning by systematically evaluating these three paradigms across six datasets with different domains. Through extensive experiments, we observe that jointly considering the visual and textual attributes of the nodes benefits graph learning, even when using pre-trained text-to-image alignment models (e.g., CLIP) as encoders. We also find that converting visual attributes into textual descriptions further improves performance compared to directly using visual inputs. Moreover, we observe that fine-tuning MLLMs on specific MMGs can achieve state-of-the-art results in most scenarios, even without explicit graph structure information. We hope that our open-sourced library will facilitate rapid, equitable evaluation and inspire further innovative research in this field.

information, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2506.10282

Country: North America > United States (0.46)

Genre:

Research Report > Promising Solution (0.46)
Research Report > New Finding (0.46)

Industry:

Media (0.70)
Information Technology > Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs

Neural Information Processing SystemsMay-27-2025, 20:34:35 GMT

Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. They are the de facto building block for Large Multimodal Models (LMMs), yet, we still lack a proper understanding of their success. In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation with the attempt to understand their generalization beyond textual inputs. Our work provides the following findings. Perceptual tokens (1) are easily distinguishable from textual ones inside LLMs, with significantly different representations (e.g. Yet, (2) both perceptual and textual tokens activate similar LLM weights.

implicit multimodal alignment, llm, multimodal input, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Beyond Words: Multimodal LLM Knows When to Speak

Liao, Zikai, Ouyang, Yi, Lee, Yi-Lun, Yu, Chen-Ping, Tsai, Yi-Hsuan, Yin, Zhaozheng

arXiv.org Artificial IntelligenceMay-21-2025

While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.14654

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > New York > Suffolk County > Stony Brook (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

Add feedback

Mozualization: Crafting Music and Visual Representation with Multimodal AI

Xu, Wanfang, Zhao, Lixiang, Song, Haiwen, Song, Xinheng, Lu, Zhaolin, Liu, Yu, Chen, Min, Lim, Eng Gee, Yu, Lingyun

arXiv.org Artificial IntelligenceApr-22-2025

In this work, we introduce Mozualization, a music generation and editing tool that creates multi-style embedded music by integrating diverse inputs, such as keywords, images, and sound clips (e.g., segments from various pieces of music or even a playful cat's meow). Our work is inspired by the ways people express their emotions -- writing mood-descriptive poems or articles, creating drawings with warm or cool tones, or listening to sad or uplifting music. Building on this concept, we developed a tool that transforms these emotional expressions into a cohesive and expressive song, allowing users to seamlessly incorporate their unique preferences and inspirations. To evaluate the tool and, more importantly, gather insights for its improvement, we conducted a user study involving nine music enthusiasts. The study assessed user experience, engagement, and the impact of interacting with and listening to the generated music.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2504.13891

Country:

Asia (0.74)
North America > United States > Louisiana (0.14)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Data Science (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback