Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment
Cui, Chenhang, Zhang, An, Zhou, Yiyang, Chen, Zhaorun, Deng, Gelei, Yao, Huaxiu, Chua, Tat-Seng
–arXiv.org Artificial Intelligence
The recent advancements in large language models (LLMs) and pre-trained vision models have accelerated the development of vision-language large models (VLLMs), enhancing the interaction between visual and linguistic modalities. Despite their notable success across various domains, VLLMs face challenges in modality alignment, which can lead to issues like hallucinations and unsafe content generation. Current alignment techniques often rely on coarse feedback and external datasets, limiting scalability and performance. In this paper, we propose FiSAO (Fine-Grained Self-Alignment Optimization), a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve visionlanguage alignment without the need for additional data. By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data. Through both theoretical analysis and experimental validation, we demonstrate that FiSAO effectively addresses the misalignment problem in VLLMs, marking the first instance of token-level rewards being applied to such models. The advent of large language models (LLMs) (Brown et al., 2020; Touvron et al., 2023; Yang et al., 2024) and pre-trained vision models (Radford et al., 2021a; Liu et al., 2023c) has propelled visionlanguage large models (VLLMs) by advancing connections between visual and linguistic modalities through linear projection (Li et al., 2023b) or q-former (Dai et al., 2023b). These VLLMs have demonstrated notable capabilities across diverse domains such as medical applications (Liu et al., 2023b), autonomous driving (Zhou et al., 2023a), and embodied intelligence (Peng et al., 2023). However, challenges remain in precisely aligning vision and language modalities for integrated inference due to their independent pre-training (Jang et al., 2023; Liu et al., 2024a). This pre-training process often results in incompatible modality-specific representations, hindering the formation of a coherent aligned representation space during joint training (Jang et al., 2023). Misalignment between modalities can lead to safety risks such as biased or inappropriate content generation (Gong et al., 2023; Tu et al., 2023) and hallucinations, where outputs are not grounded in visual input (Wang et al., 2023). These risks are particularly concerning in tasks like visual question answering (Cui et al., 2023; Fan et al., 2024), OCR (Shi et al., 2023), and image captioning (Gunjal et al., 2024), where precise alignment is critical. To address these misalignment issues, recent works have explored strategies such as instruction tuning (Liu et al., 2023a; Chen et al., 2024b), preference tuning (Yu et al., 2023a), and post-processing methods (Zhou et al., 2023b; Yin et al., 2023).
arXiv.org Artificial Intelligence
Nov-18-2024
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Health & Medicine (0.34)
- Transportation (0.48)
- Technology: