Enhancing Large Vision Language Models with Self-Training on Image Comprehension 1,3, Fan Yin
–Neural Information Processing Systems
Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the model's perception capability to understand image inputs and conduct subsequent reasoning for different queries. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension.
Neural Information Processing Systems
Mar-27-2025, 13:58:15 GMT
- Country:
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (1.00)
- Research Report
- Industry:
- Education (1.00)
- Leisure & Entertainment > Sports (0.46)
- Technology: