AITopics | Gu, Chaochen

Collaborating Authors

Gu, Chaochen

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Seeing Clearly by Layer Two: Enhancing Attention Heads to Alleviate Hallucination in LVLMs

Zhang, Xiaofeng, Quan, Yihao, Gu, Chaochen, Shen, Chen, Yuan, Xiaosong, Yan, Shaotian, Cheng, Hao, Wu, Kaijie, Ye, Jieping

arXiv.org Artificial IntelligenceNov-15-2024

The hallucination problem in multimodal large language models (MLLMs) remains a common issue. Although image tokens occupy a majority of the input sequence of MLLMs, there is limited research to explore the relationship between image tokens and hallucinations. In this paper, we analyze the distribution of attention scores for image tokens across each layer and head of the model, revealing an intriguing and common phenomenon: most hallucinations are closely linked to the pattern of attention sinks in the self-attention matrix of image tokens, where shallow layers exhibit dense attention sinks and deeper layers show sparse attention sinks. We further analyze the attention heads of different layers and find that heads with high-density attention sink in the image part play a positive role in alleviating hallucinations. In this paper, we propose a training-free method named \textcolor{red}{\textbf{E}}nhancing \textcolor{red}{\textbf{A}}ttention \textcolor{red}{\textbf{H}}eads (EAH), an approach designed to enhance the convergence of image tokens attention sinks in the shallow layers. EAH identifies the attention head that shows the vision sink in a shallow layer and extracts its attention matrix. This attention map is then broadcast to other heads in the layer, thereby strengthening the layer to pay more attention to the image itself. With extensive experiments, EAH shows significant hallucination-mitigating performance on different MLLMs and metrics, proving its effectiveness and generality.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2411.09968

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

From Redundancy to Relevance: Enhancing Explainability in Multimodal Large Language Models

Zhang, Xiaofeng, Shen, Chen, Yuan, Xiaosong, Yan, Shaotian, Xie, Liang, Wang, Wenxiao, Gu, Chaochen, Tang, Hao, Ye, Jieping

arXiv.org Artificial IntelligenceJun-13-2024

Recently, multimodal large language models have exploded with an endless variety, most of the popular Large Vision Language Models (LVLMs) depend on sequential visual representation, where images are converted into hundreds or thousands of tokens before being input into the Large Language Model (LLM) along with language prompts. The black-box design hinders the interpretability of visual-language models, especially regarding more complex reasoning tasks. To explore the interaction process between image and text in complex reasoning tasks, we introduce the information flow method to visualize the interaction mechanism. By analyzing the dynamic flow of the information flow, we find that the information flow appears to converge in the shallow layer. Further investigation revealed a redundancy of the image token in the shallow layer. Consequently, a truncation strategy was introduced to aggregate image tokens within these shallow layers. This approach has been validated through experiments across multiple models, yielding consistent improvements.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2406.06579

Country:

North America > Canada (0.29)
North America > United States (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Enlighten-Your-Voice: When Multimodal Meets Zero-shot Low-light Image Enhancement

Zhang, Xiaofeng, Xu, Zishan, Tang, Hao, Gu, Chaochen, Chen, Wei, Zhu, Shanying, Guan, Xinping

arXiv.org Artificial IntelligenceFeb-1-2024

Low-light image enhancement is a crucial visual task, and many unsupervised methods tend to overlook the degradation of visible information in low-light scenes, which adversely affects the fusion of complementary information and hinders the generation of satisfactory results. To address this, our study introduces "Enlighten-Your-Voice", a multimodal enhancement framework that innovatively enriches user interaction through voice and textual commands. This approach does not merely signify a technical leap but also represents a paradigm shift in user engagement. Our model is equipped with a Dual Collaborative Attention Module (DCAM) that meticulously caters to distinct content and color discrepancies, thereby facilitating nuanced enhancements. Complementarily, we introduce a Semantic Feature Fusion (SFM) plug-and-play module that synergizes semantic context with low-light enhancement operations, sharpening the algorithm's efficacy. Crucially, "Enlighten-Your-Voice" showcases remarkable generalization in unsupervised zero-shot scenarios. The source code can be accessed from https://github.com/zhangbaijin/Enlighten-Your-Voice

artificial intelligence, human computer interaction, meet zero-shot low-light image enhancement, (1 more...)

arXiv.org Artificial Intelligence

2312.10109

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (0.60)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.60)

Add feedback

Enlighten Anything: When Segment Anything Model Meets Low-Light Image Enhancement

Zhao, Qihan, Zhang, Xiaofeng, Tang, Hao, Gu, Chaochen, Zhu, Shanying

arXiv.org Artificial IntelligenceJul-31-2023

Image restoration is a low-level visual task, and most CNN methods are designed as black boxes, lacking transparency and intrinsic aesthetics. Many unsupervised approaches ignore the degradation of visible information in low-light scenes, which will seriously affect the aggregation of complementary information and also make the fusion algorithm unable to produce satisfactory fusion results under extreme conditions. In this paper, we propose Enlighten-anything, which is able to enhance and fuse the semantic intent of SAM segmentation with low-light images to obtain fused images with good visual perception. The generalization ability of unsupervised learning is greatly improved, and experiments on LOL dataset are conducted to show that our method improves 3db in PSNR over baseline and 8 in SSIM. Zero-shot learning of SAM introduces a powerful aid for unsupervised low-light enhancement. The source code of Enlighten Anything can be obtained from https://github.com/zhangbaijin/enlighten-anything

artificial intelligence, information management, model meet low-light image enhancement

arXiv.org Artificial Intelligence

2306.10286

Genre: Research Report (0.66)

Technology: Information Technology > Artificial Intelligence > Vision (0.40)

Add feedback

An Improved Algorithm of Robot Path Planning in Complex Environment Based on Double DQN

Zhang, Fei, Gu, Chaochen, Yang, Feng

arXiv.org Artificial IntelligenceJul-23-2021

Deep Q Network (DQN) has several limitations when applied in planning a path in environment with a number of dilemmas according to our experiment. The reward function may be hard to model, and successful experience transitions are difficult to find in experience replay. In this context, this paper proposes an improved Double DQN (DDQN) to solve the problem by reference to A* and Rapidly-Exploring Random Tree (RRT). In order to achieve the rich experiments in experience replay, the initialization of robot in each training round is redefined based on RRT strategy. In addition, reward for the free positions is specially designed to accelerate the learning process according to the definition of position cost in A*. The simulation experimental results validate the efficiency of the improved DDQN, and robot could successfully learn the ability of obstacle avoidance and optimal path planning in which DQN or DDQN has no effect.

ddqn, neural network, planning & scheduling, (19 more...)

arXiv.org Artificial Intelligence

2107.11245

Country: Asia > China (0.15)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback