VisualAnchorsAreStrongInformationAggregators ForMultimodalLargeLanguageModel

Neural Information Processing Systems 

IntherealmofMultimodal LargeLanguage Models(MLLMs), vision-language connector plays acrucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite itsimportance, thevision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs toachievehigh accuracywhile maintainlowcomputationcost.