FlexCap: Describe Anything in Images in Controllable Detail
–Neural Information Processing Systems
We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap's effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap's localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap's utility for tasks including image labeling, object attribute recognition, and visual dialog.
Neural Information Processing Systems
Mar-27-2025, 07:47:58 GMT
- Country:
- Asia > Middle East
- Israel (0.14)
- Europe > Switzerland
- Asia > Middle East
- Genre:
- Research Report > Experimental Study (0.93)
- Industry:
- Leisure & Entertainment > Sports > Tennis (0.68)
- Technology: