AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.30)

Neural Information Processing SystemsFeb-12-2026, 09:51:15 GMT

5d570ed1708bbe19cb60f7a7aff60575-Paper-Conference.pdf

large language model, machine learning, systematic failure, (18 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
Asia > Middle East > Israel (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)

Neural Information Processing SystemsDec-25-2025, 13:06:28 GMT

Mass-Producing Failures of Multimodal Systems with Language Models

Deployed multimodal models can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MultiMon, a system that automatically identifies systematic failures---generalizable, natural-language descriptions that describe categories of individual failures. To uncover systematic failures, MultiMon scrapes for examples of erroneous agreement: inputs that produce the same output, but should not. It then prompts a language model to identify common categories and describe them in natural language. We use MultiMon to find 14 systematic failures (e.g.ignores quantifiers'') of the CLIP text-encoder, each comprising hundreds of distinct inputs (e.g.a shelf with a few/many books''). Because CLIP is the backbone for most state-of-the-art multimodal models, these inputs produce failures in Midjourney 5.1, DALL-E, VideoFusion, and others. MultiMon can also steer towards failures relevant to specific use cases, such as self-driving cars. We see MultiMon as a step towards evaluation that autonomously explores the long-tail of potential system failures.

mass-producing failure, multimodal system, name change, (5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.88)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.60)

arXiv.org Artificial IntelligenceNov-17-2025

Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games

Zhao, Juntu, Zhang, Jialing, Li, Chongxuan, Wang, Dequan

Recent closed-source multimodal systems have made great advances, but their hidden language for understanding the world remains opaque because of their black-box architectures. In this paper, we use the systems' preference bias to study their hidden language: During the process of compressing the input images (typically containing multiple concepts) into texts and then reconstructing them into images, the systems' inherent preference bias introduces specific shifts in the outputs, disrupting the original input concept co-occurrence. We employ the multi-round "telephone game" to strategically leverage this bias. By observing the co-occurrence frequencies of concepts in telephone games, we quantitatively investigate the concept connection strength in the understanding of multimodal systems, i.e., "hidden language." We also contribute Telescope, a dataset of 10,000+ concept pairs, as the database of our telephone game framework. Our telephone game is test-time scalable: By iteratively running telephone games, we can construct a global map of concept connections in multimodal systems' understanding. Here we can identify preference bias inherited from training, assess generalization capability advancement, and discover more stable pathways for fragile concept connections. Furthermore, we use Reasoning-LLMs to uncover unexpected concept relationships that transcend textual and visual similarities, inferring how multimodal systems understand and simulate the world. This study offers a new perspective on the hidden language of multimodal systems and lays the foundation for future research on the interpretability and controllability of multimodal systems.

large language model, machine learning, multimodal system, (18 more...)

2511.1069

Country: Asia > China (0.29)

Genre: Research Report (1.00)

Industry: Transportation (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Vision (0.90)
(2 more...)

Neural Information Processing SystemsOct-8-2025, 18:35:48 GMT

Mass-Producing Failures of Multimodal Systems with Language Models Shengbang Tong Erik Jones

Deployed multimodal systems can fail in ways that evaluators did not anticipate.

large language model, machine learning, systematic failure, (18 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
Asia > Middle East > Israel (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)

Yaqoot, Yasheerah, Mustafa, Muhammad Ahsan, Sautenkov, Oleg, Tsetserukou, Dzmitry

UAV-VLRR: Vision-Language Informed NMPC for Rapid Response in UAV Search and Rescue

arXiv.org Artificial IntelligenceMar-4-2025

Abstract--Emergency search and rescue (SAR) operations often require rapid and precise target identification in complex environments where traditional manual drone control is inefficient. This system consists of two aspects: 1) A multimodal system which harnesses the power of Visual Language Model (VLM) and the natural language processing capabilities of ChatGPT-4o (LLM) for scene interpretation. This work aims at improving response times in emergency SAR operations by providing a more intuitive and natural approach to the operator to plan the SAR mission while allowing the drone to carry out that mission in a rapid and safe manner. When tested, our approach was faster on an average by 33.75% when compared with an off-the-shelf autopilot and 54.6% when compared with a human pilot. Search and rescue (SAR) operations in disaster-stricken and hazardous environments require fast and efficient situational assessment to locate survivors and critical infrastructure.

arxiv preprint arxiv, multimodal system, target point, (14 more...)

2503.02465

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.05)
Asia > Russia (0.05)
Europe > Switzerland > Zürich > Zürich (0.04)

Genre: Research Report (0.50)

Industry:

Information Technology > Robotics & Automation (0.47)
Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsJan-18-2025, 16:09:25 GMT

Mass-Producing Failures of Multimodal Systems with Language Models

Deployed multimodal models can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MultiMon, a system that automatically identifies systematic failures---generalizable, natural-language descriptions that describe categories of individual failures. To uncover systematic failures, MultiMon scrapes for examples of erroneous agreement: inputs that produce the same output, but should not. It then prompts a language model to identify common categories and describe them in natural language. We use MultiMon to find 14 systematic failures (e.g."ignores quantifiers'') of the CLIP text-encoder, each comprising hundreds of distinct inputs (e.g."a shelf with a few/many books''). Because CLIP is the backbone for most state-of-the-art multimodal models, these inputs produce failures in Midjourney 5.1, DALL-E, VideoFusion, and others.

mass-producing failure, multimodal system, systematic failure, (3 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

arXiv.org Artificial IntelligenceJul-13-2024

Predictive Dynamic Fusion

Cao, Bing, Xia, Yinan, Ding, Yi, Zhang, Changqing, Hu, Qinghua

Multimodal fusion is crucial in joint decision-making systems for rendering holistic judgments. Since multimodal data changes in open environments, dynamic fusion has emerged and achieved remarkable progress in numerous applications. However, most existing dynamic multimodal fusion methods lack theoretical guarantees and easily fall into suboptimal problems, yielding unreliability and instability. To address this issue, we propose a Predictive Dynamic Fusion (PDF) framework for multimodal learning. We proceed to reveal the multimodal fusion from a generalization perspective and theoretically derive the predictable Collaborative Belief (Co-Belief) with Mono- and Holo-Confidence, which provably reduces the upper bound of generalization error. Accordingly, we further propose a relative calibration strategy to calibrate the predicted Co-Belief for potential uncertainty. Extensive experiments on multiple benchmarks confirm our superiority. Our code is available at https://github.com/Yinan-Xia/PDF.

fusion, generalization error, modality, (16 more...)

2406.04802

Country:

Europe > Austria > Vienna (0.14)
Asia > China > Tianjin Province > Tianjin (0.04)
North America > United States > California (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Data Science (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Elabd, Mazen, Jaf, Sardar

A Simple Attention-Based Mechanism for Bimodal Emotion Classification

arXiv.org Artificial IntelligenceJun-28-2024

Big data contain rich information for machine learning algorithms to utilize when learning important features during classification tasks. Human beings express their emotion using certain words, speech (tone, pitch, speed) or facial expression. Artificial Intelligence approach to emotion classification are largely based on learning from textual information. However, public datasets containing text and speech data provide sufficient resources to train machine learning algorithms for the tack of emotion classification. In this paper, we present novel bimodal deep learning-based architectures enhanced with attention mechanism trained and tested on text and speech data for emotion classification. We report details of different deep learning based architectures and show the performance of each architecture including rigorous error analyses. Our finding suggests that deep learning based architectures trained on different types of data (text and speech) outperform architectures trained only on text or speech. Our proposed attention-based bimodal architecture outperforms several state-of-the-art systems in emotion classification.

architecture, emotion, emotion classification, (14 more...)

2407.00134

Country:

Europe > United Kingdom > England > Tyne and Wear > Sunderland (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Hadizadeh, Hadi, Yeganli, S. Faegheh, Rashidi, Bahador, Bajić, Ivan V.

Mutual Information Analysis in Multimodal Learning Systems

arXiv.org Artificial IntelligenceMay-20-2024

In recent years, there has been a significant increase in applications of multimodal signal processing and analysis, largely driven by the increased availability of multimodal datasets and the rapid progress in multimodal learning systems. Well-known examples include autonomous vehicles, audiovisual generative systems, vision-language systems, and so on. Such systems integrate multiple signal modalities: text, speech, images, video, LiDAR, etc., to perform various tasks. A key issue for understanding such systems is the relationship between various modalities and how it impacts task performance. In this paper, we employ the concept of mutual information (MI) to gain insight into this issue. Taking advantage of the recent progress in entropy modeling and estimation, we develop a system called InfoMeter to estimate MI between modalities in a multimodal learning system. We then apply InfoMeter to analyze a multimodal 3D object detection system over a large-scale dataset for autonomous driving. Our experiments on this system suggest that a lower MI between modalities is beneficial for detection accuracy. This new insight may facilitate improvements in the development of future multimodal learning systems.

camera and lidar modality, dataset, modality, (15 more...)

2405.12456

Country:

North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.14)
North America > United States > California > Santa Clara County > San Jose (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Burnaby (0.04)

Genre: Research Report (0.65)

Industry:

Transportation > Ground > Road (0.35)
Information Technology > Robotics & Automation (0.35)
Automobiles & Trucks (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.87)