Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight 1 Biao Gong

Neural Information Processing Systems 

This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales.