Grounding Language Models for Visual Entity Recognition
Xiao, Zilin, Gong, Ming, Cascante-Bonilla, Paola, Zhang, Xingyao, Wu, Jie, Ordonez, Vicente
–arXiv.org Artificial Intelligence
We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises from 32.7% to 61.5%. It also demonstrates superior performance on the unseen and query splits by a substantial double-digit margin.
arXiv.org Artificial Intelligence
Feb-28-2024
- Country:
- Asia > China (0.04)
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- North America
- Dominican Republic (0.04)
- Canada (0.04)
- United States
- California (0.04)
- New York > New York County
- New York City (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Europe
- Austria (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- France > Occitanie
- Haute-Garonne > Toulouse (0.04)
- Genre:
- Research Report (0.82)
- Industry:
- Automobiles & Trucks > Manufacturer (1.00)
- Aerospace & Defense > Aircraft (1.00)
- Transportation
- Technology: