Enhancing Large Vision Language Models with Self-Training on Image Comprehension 1,3, Fan Yin

Neural Information Processing Systems 

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the model's perception capability to understand image inputs and conduct subsequent reasoning for different queries. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found