BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita

Jul-29-2024–arXiv.org Artificial Intelligence

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.

caption, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Jul-29-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Italy (0.04)
- North America
  - Canada > Alberta
    - Census Division No. 13 > Lac Ste. Anne County (0.04)
    - Census Division No. 14 > Yellowhead County (0.04)
  - United States > Mississippi (0.04)
- Oceania > Australia
  - Western Australia > North West Shelf (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)
    - Natural Language
      - Chatbot (0.93)
      - Large Language Model (1.00)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)