Large Language Models are Visual Reasoning Coordinators

Jan-20-2025, 00:27:13 GMT–Neural Information Processing Systems

Visual reasoning requires multimodal perception and commonsense cognition of the world. Recently, multiple vision-language models (VLMs) have been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to aggregate these models with the desired higher-order communications. In this work, we propose Cola, a novel paradigm that coordinates multiple VLMs for visual reasoning.

coordinate multiple vlm, language model, visual reasoning coordinator, (2 more...)

Neural Information Processing Systems

Jan-20-2025, 00:27:13 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.63)