DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Ma, Xinyu, Ding, Ziyang, Luo, Zhicong, Chen, Chi, Guo, Zonghao, Wong, Derek F., Feng, Xiaoyi, Sun, Maosong

Mar-18-2025–arXiv.org Artificial Intelligence

Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features, a capability that remains underdeveloped in current Multimodal Large Language Models (MLLMs). Despite possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning into visual perception, often generating direct responses without deeper analysis. T o bridge this gap, we introduce knowledge-intensive visual grounding (KVG), a novel visual grounding task that requires both fine-grained perception and domain-specific knowledge integration. T o address the challenges of KVG, we propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities. Our approach consists of (1) an automated data synthesis pipeline that generates high-quality, knowledge-aligned training samples, and (2) a two-stage training framework combining supervised fine-tuning for cognitive reasoning scaffolding and reinforcement learning to optimize perception-cognition synergy. T o benchmark performance, we introduce KVG-Bench, a comprehensive dataset spanning 10 domains with 1.3K manually curated test cases. Experimental results demonstrate that DeepPerception significantly outperforms direct fine-tuning, achieving +8.08% accuracy improvements on KVG-Bench and exhibiting +4.60% superior cross-domain generalization over baseline approaches. Our findings highlight the importance of integrating cognitive processes into MLLMs for human-like visual perception and open new directions for multimodal reasoning research. The data, codes, and models are released at https://github.com/thunlp/

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Mar-18-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland
  - Zürich > Zürich (0.14)
- North America > United States
  - Colorado (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Aerospace & Defense (0.68)
- Transportation (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning (1.00)
  - Vision (1.00)