GiVE: Guiding Visual Encoder to Perceive Overlooked Information
Li, Junjie, Ma, Jianghong, Zhang, Xiaofeng, Li, Yuhang, Shi, Jianyang
–arXiv.org Artificial Intelligence
Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval, and the Multi-Object Instruction (MOInst) dataset. Experiments show our approach achieves state-of-the-art performance.
arXiv.org Artificial Intelligence
Oct-26-2024
- Country:
- North America > Canada
- Europe
- United Kingdom > Scotland
- City of Glasgow > Glasgow (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- France > Île-de-France
- United Kingdom > Scotland
- Asia > China
- Heilongjiang Province > Harbin (0.04)
- Guangdong Province > Shenzhen (0.04)
- Genre:
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (0.34)
- Industry:
- Technology: