Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation

Cornia, Marcella, Baraldi, Lorenzo, Fiameni, Giuseppe, Cucchiara, Rita

Nov-24-2021–arXiv.org Artificial Intelligence

While captioning models have obtained compelling results in describing natural images, they still do not cover the entire long-tail distribution of real-world concepts. In this paper, we address the task of generating human-like descriptions with in-the-wild concepts by training on web-scale automatically collected datasets. To this end, we propose a model which can exploit noisy image-caption pairs while maintaining the descriptive style of traditional human-annotated datasets like COCO. Our model separates content from style through the usage of keywords and stylistic tokens, employing a single objective of prompt language modeling and being simpler than other recent proposals. Experimentally, our model consistently outperforms existing methods in terms of caption quality and capability of describing long-tail concepts, also in zero-shot settings. According to the CIDEr metric, we obtain a new state of the art on both COCO and nocaps when using external data.

captioner, universal captioner, vinvl, (14 more...)

arXiv.org Artificial Intelligence

Nov-24-2021

arXiv.org PDF

Add feedback

Country:
- Pacific Ocean > North Pacific Ocean
  - San Francisco Bay > Golden Gate (0.04)
- North America > United States
  - New York (0.04)
  - California > San Francisco County
    - San Francisco (0.04)
- Europe
  - Italy (0.04)
  - Poland (0.04)
- Asia > Middle East
  - UAE > Dubai Emirate > Dubai (0.04)

Genre:
- Research Report (0.50)

Industry:
- Leisure & Entertainment > Sports (1.00)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (1.00)
- Media (0.93)
- Automobiles & Trucks > Manufacturer (0.92)
- Transportation
  - Passenger (1.00)
  - Ground > Road (1.00)
  - Air (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)