GiVE: Guiding Visual Encoder to Perceive Overlooked Information

Li, Junjie, Ma, Jianghong, Zhang, Xiaofeng, Li, Yuhang, Shi, Jianyang

Oct-26-2024–arXiv.org Artificial Intelligence

Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval, and the Multi-Object Instruction (MOInst) dataset. Experiments show our approach achieves state-of-the-art performance.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-26-2024

arXiv.org PDF

Add feedback

Country:
- North America > Canada
  - British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - United Kingdom > Scotland
    - City of Glasgow > Glasgow (0.04)
  - Switzerland > Zürich
    - Zürich (0.04)
  - France > Île-de-France
    - Paris > Paris (0.04)
- Asia > China
  - Heilongjiang Province > Harbin (0.04)
  - Guangdong Province > Shenzhen (0.04)

Genre:
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (0.34)

Industry:
- Automobiles & Trucks > Manufacturer (0.46)
- Transportation
  - Passenger (0.67)
  - Ground > Road (0.67)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Natural Language > Large Language Model (1.00)
    - Machine Learning > Neural Networks (0.93)