Goto

Collaborating Authors

 Large Language Model




VisualAnchorsAreStrongInformationAggregators ForMultimodalLargeLanguageModel

Neural Information Processing Systems

IntherealmofMultimodal LargeLanguage Models(MLLMs), vision-language connector plays acrucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite itsimportance, thevision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs toachievehigh accuracywhile maintainlowcomputationcost.






1e89c12621c0315373f20f0aeabe5dbe-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

Therearetwoupdatingstrategies: 1) mimicking strategy to generate similar samples based on original data, preserving stylistic and contextual essence, and 2) extending strategy that further expands existing samples at varying cognitive levels by adapting Bloom's taxonomy ofeducational objectives. Extensiveexperiments onupdated MMLU andBIG-Bench demonstrate thestability oftheproposed strategiesandfindthat the mimicking strategy can effectively alleviate issues of overestimation from benchmark leakage. In cases where the efficient mimicking strategy fails, our extending strategystill showspromising results.