AITopics | Jiang, Deqiang

Collaborating Authors

Jiang, Deqiang

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models

Wang, Yubo, Liu, Chaohu, Qu, Yanqiu, Cao, Haoyu, Jiang, Deqiang, Xu, Linli

arXiv.org Artificial IntelligenceOct-9-2024

Large vision-language models (LVLMs) integrate visual information into large language models, showcasing remarkable multi-modal conversational capabilities. However, the visual modules introduces new challenges in terms of robustness for LVLMs, as attackers can craft adversarial images that are visually clean but may mislead the model to generate incorrect answers. In general, LVLMs rely on vision encoders to transform images into visual tokens, which are crucial for the language models to perceive image contents effectively. Therefore, we are curious about one question: Can LVLMs still generate correct responses when the encoded visual tokens are attacked and disrupting the visual information? To this end, we propose a non-targeted attack method referred to as VT-Attack (Visual Tokens Attack), which constructs adversarial examples from multiple perspectives, with the goal of comprehensively disrupting feature representations and inherent relationships as well as the semantic properties of visual tokens output by image encoders. Using only access to the image encoder in the proposed attack, the generated adversarial examples exhibit transferability across diverse LVLMs utilizing the same image encoder and generality across different tasks. Extensive experiments validate the superior attack performance of the VT-Attack over baseline methods, demonstrating its effectiveness in attacking LVLMs with image encoders, which in turn can provide guidance on the robustness of LVLMs, particularly in terms of the stability of the visual feature space.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.06699

Country:

Europe (0.93)
North America > United States > California (0.46)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (0.66)
Transportation (0.46)
Government > Military (0.42)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

Fu, Chaoyou, Zhang, Renrui, Wang, Zihan, Huang, Yubo, Zhang, Zhengye, Qiu, Longtian, Ye, Gaoxiang, Shen, Yunhang, Zhang, Mengdan, Chen, Peixian, Zhao, Sirui, Lin, Shaohui, Jiang, Deqiang, Yin, Di, Gao, Peng, Li, Ke, Li, Hongsheng, Sun, Xing

arXiv.org Artificial IntelligenceDec-20-2023

The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2312.12436

Country:

North America > United States (0.92)
Europe > France > Île-de-France > Paris > Paris (0.14)
Asia > China > Guangdong Province (0.14)

Genre: Research Report (1.00)

Industry:

Transportation > Passenger (1.00)
Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)
(20 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Cao, Haoyu, Bao, Changcun, Liu, Chaohu, Chen, Huang, Yin, Kun, Liu, Hao, Liu, Yinsong, Jiang, Deqiang, Sun, Xing

arXiv.org Artificial IntelligenceSep-3-2023

We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the visual tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks. SeRum represents a substantial advancement towards enabling efficient and effective end-to-end document understanding.

machine learning, natural language, pattern recognition, (19 more...)

arXiv.org Artificial Intelligence

2309.01131

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.14)
Asia > Middle East > Israel (0.14)
Europe > Middle East > Malta (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
(2 more...)

Add feedback