Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation
Cornia, Marcella, Baraldi, Lorenzo, Fiameni, Giuseppe, Cucchiara, Rita
–arXiv.org Artificial Intelligence
While captioning models have obtained compelling results in describing natural images, they still do not cover the entire long-tail distribution of real-world concepts. In this paper, we address the task of generating human-like descriptions with in-the-wild concepts by training on web-scale automatically collected datasets. To this end, we propose a model which can exploit noisy image-caption pairs while maintaining the descriptive style of traditional human-annotated datasets like COCO. Our model separates content from style through the usage of keywords and stylistic tokens, employing a single objective of prompt language modeling and being simpler than other recent proposals. Experimentally, our model consistently outperforms existing methods in terms of caption quality and capability of describing long-tail concepts, also in zero-shot settings. According to the CIDEr metric, we obtain a new state of the art on both COCO and nocaps when using external data.
arXiv.org Artificial Intelligence
Nov-24-2021
- Country:
- Pacific Ocean > North Pacific Ocean
- San Francisco Bay > Golden Gate (0.04)
- North America > United States
- New York (0.04)
- California > San Francisco County
- San Francisco (0.04)
- Europe
- Asia > Middle East
- UAE > Dubai Emirate > Dubai (0.04)
- Pacific Ocean > North Pacific Ocean
- Genre:
- Research Report (0.50)
- Industry:
- Leisure & Entertainment > Sports (1.00)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (1.00)
- Media (0.93)
- Automobiles & Trucks > Manufacturer (0.92)
- Transportation
- Technology: