LocateBench: Evaluating the Locating Ability of Vision Language Models

Chiang, Ting-Rui, Robinson, Joshua, Yu, Xinyan Velocity, Yogatama, Dani

Oct-17-2024–arXiv.org Artificial Intelligence

The ability to locate an object in an image according to natural language instructions is crucial for many real-world applications. In this work we propose LocateBench, a high-quality benchmark dedicated to evaluating this ability. We experiment with multiple prompting approaches, and measure the accuracy of several large vision language models. We find that even the accuracy of the strongest model, GPT-4o, lags behind human accuracy by more than 10%.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-17-2024

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)
- North America > United States
  - California (0.28)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.90)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Vision (1.00)