GroundCap: A Visually Grounded Image Captioning Dataset

Oliveira, Daniel A. P., Teodoro, Lourenço, de Matos, David Martins

Feb-19-2025–arXiv.org Artificial Intelligence

Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking, and present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and segmentation of background elements through K-means clustering. Human evaluation demonstrates our approach's effectiveness in producing verifiable descriptions with coherent object references. Introduction One of the primary goals combining computer vision and natural language processing is to enable machines to understand and communicate about visual scenes. This objective encompasses numerous tasks, including recognizing objects, describing their attributes and relationships, and providing contextually relevant descriptions of scenes [1]. While significant progress has been made in image classification, object detection, and image captioning, a critical aspect of human visual communication remains under-explored: the ability to ground language to specific elements within an image. Consider a scenario where two people are discussing a crowded street scene. One might say, "Look at that car." to which the other might respond, "Which one?". The first person would likely point to the specific car they're referring to while simultaneously describing it with more detail.

caption, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

Feb-19-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Qatar (0.14)
- Europe (0.93)
- North America > United States
  - Michigan (0.14)
  - Pennsylvania (0.14)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.69)
    - Performance Analysis > Accuracy (0.68)
    - Statistical Learning (0.87)
  - Natural Language > Large Language Model (0.70)
  - Representation & Reasoning > Object-Oriented Architecture (0.69)
  - Vision (1.00)