PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

Wahed, Muntasir, Nguyen, Kiet A., Juvekar, Adheesh Sunil, Li, Xinzhuo, Zhou, Xiaona, Shah, Vedant, Yu, Tianjiao, Yanardag, Pinar, Lourentzou, Ismini

Dec-19-2024–arXiv.org Artificial Intelligence

Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by $25.3\%$. To support training and evaluation, we curate $M^4Seg$, a new reasoning segmentation benchmark consisting of $\sim$224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.

large language model, machine learning, segmentation, (18 more...)

arXiv.org Artificial Intelligence

Dec-19-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Virginia (0.04)
  - Illinois > Champaign County
    - Urbana (0.04)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Vision > Image Understanding (0.49)
  - Cognitive Science > Problem Solving (0.48)
  - Natural Language > Large Language Model (0.48)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found