Goto

Collaborating Authors

 Zhao, Xian


Mind with Eyes: from Language Reasoning to Multimodal Reasoning

arXiv.org Artificial Intelligence

Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities. This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning. The former encompasses one-pass visual perception and active visual perception, where vision primarily serves a supporting role in language reasoning. The latter involves action generation and state update within reasoning process, enabling a more dynamic interaction between modalities. Furthermore, we analyze the technical evolution of these methods, discuss their inherent challenges, and introduce key benchmark tasks and evaluation metrics for assessing multimodal reasoning performance. Finally, we provide insights into future research directions from the following two perspectives: (i) from visual-language reasoning to omnimodal reasoning and (ii) from multimodal reasoning to multimodal agents. This survey aims to provide a structured overview that will inspire further advancements in multimodal reasoning research.


An Experimental Study of Semantic Continuity for Deep Learning Models

arXiv.org Artificial Intelligence

Deep learning models can achieve state-of-the-art performance across a wide range of computer vision tasks. From supervised learning and unsupervised learning to the now popular self-supervised learning, new training paradigms have progressively improved the efficiency of utilizing training data. However, the existence of issues such as adversarial examples makes us realize that the current training paradigms still do not make sufficient use of datasets. Adversarial images, which appear nearly identical to the original images, can cause significant changes in model output. In this paper, we find that many common non-semantic perturbations can also lead to semantic-level interference in model outputs, as illustrated in Figure 1. This phenomenon indicates that the representations learned by deep learning models are discontinuous in semantic space. Ideally, derived samples with the same semantic information should be located in the neighborhood of the original samples, but they are often mapped far from the original samples in the model's output space.


Benign Adversarial Attack: Tricking Algorithm for Goodness

arXiv.org Artificial Intelligence

In spite of the successful application in many fields, machine learning algorithms today suffer from notorious problems like vulnerability to adversarial examples. Beyond falling into the cat-and-mouse game between adversarial attack and defense, this paper provides alternative perspective to consider adversarial example and explore whether we can exploit it in benign applications. We first propose a novel taxonomy of visual information along task-relevance and semantic-orientation. The emergence of adversarial example is attributed to algorithm's utilization of task-relevant non-semantic information. While largely ignored in classical machine learning mechanisms, task-relevant non-semantic information enjoys three interesting characteristics as (1) exclusive to algorithm, (2) reflecting common weakness, and (3) utilizable as features. Inspired by this, we present brave new idea called benign adversarial attack to exploit adversarial examples for goodness in three directions: (1) adversarial Turing test, (2) rejecting malicious algorithm, and (3) adversarial data augmentation. Each direction is positioned with motivation elaboration, justification analysis and prototype applications to showcase its potential.