Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

Le, Quang-Hung, Dang, Long Hoang, Le, Ngan, Tran, Truyen, Le, Thao Minh

Dec-19-2024–arXiv.org Artificial Intelligence

Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks. The code is available at: https://github.com/lqh52/PromViL.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Dec-19-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Problem Solving (0.46)
  - Machine Learning (1.00)
  - Natural Language > Large Language Model (0.31)
  - Vision (1.00)