Grounding Language Models for Visual Entity Recognition

Xiao, Zilin, Gong, Ming, Cascante-Bonilla, Paola, Zhang, Xingyao, Wu, Jie, Ordonez, Vicente

Feb-28-2024–arXiv.org Artificial Intelligence

We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises from 32.7% to 61.5%. It also demonstrates superior performance on the unseen and query splits by a substantial double-digit margin.

grounding language model, language model, recognition, (10 more...)

arXiv.org Artificial Intelligence

Feb-28-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.04)
- South America > Colombia
  - Meta Department > Villavicencio (0.04)
- North America
  - Dominican Republic (0.04)
  - Canada (0.04)
  - United States
    - California (0.04)
    - New York > New York County
      - New York City (0.04)
    - Hawaii > Honolulu County
      - Honolulu (0.04)
- Europe
  - Austria (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - France > Occitanie
    - Haute-Garonne > Toulouse (0.04)

Genre:
- Research Report (0.82)

Industry:
- Automobiles & Trucks > Manufacturer (1.00)
- Aerospace & Defense > Aircraft (1.00)
- Transportation
  - Passenger (1.00)
  - Ground > Road (1.00)
  - Air (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Information Retrieval (0.87)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found