DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
Ma, Xinyu, Ding, Ziyang, Luo, Zhicong, Chen, Chi, Guo, Zonghao, Wong, Derek F., Feng, Xiaoyi, Sun, Maosong
–arXiv.org Artificial Intelligence
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features, a capability that remains underdeveloped in current Multimodal Large Language Models (MLLMs). Despite possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning into visual perception, often generating direct responses without deeper analysis. T o bridge this gap, we introduce knowledge-intensive visual grounding (KVG), a novel visual grounding task that requires both fine-grained perception and domain-specific knowledge integration. T o address the challenges of KVG, we propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities. Our approach consists of (1) an automated data synthesis pipeline that generates high-quality, knowledge-aligned training samples, and (2) a two-stage training framework combining supervised fine-tuning for cognitive reasoning scaffolding and reinforcement learning to optimize perception-cognition synergy. T o benchmark performance, we introduce KVG-Bench, a comprehensive dataset spanning 10 domains with 1.3K manually curated test cases. Experimental results demonstrate that DeepPerception significantly outperforms direct fine-tuning, achieving +8.08% accuracy improvements on KVG-Bench and exhibiting +4.60% superior cross-domain generalization over baseline approaches. Our findings highlight the importance of integrating cognitive processes into MLLMs for human-like visual perception and open new directions for multimodal reasoning research. The data, codes, and models are released at https://github.com/thunlp/
arXiv.org Artificial Intelligence
Mar-18-2025
- Country:
- Europe > Switzerland
- North America > United States
- Colorado (0.14)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Aerospace & Defense (0.68)
- Transportation (0.68)
- Technology:
- Information Technology > Artificial Intelligence
- Cognitive Science (1.00)
- Machine Learning > Neural Networks
- Deep Learning (0.47)
- Natural Language > Large Language Model (1.00)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence