RONA: Pragmatically Diverse Image Captioning with Coherence Relations
Ramakrishnan, Aashish Anantha, Ramakrishnan, Aadarsh Anantha, Lee, Dongwon
–arXiv.org Artificial Intelligence
Writing Assistants (e.g., Grammarly, Microsoft Copilot) traditionally generate diverse image captions by employing syntactic and semantic variations to describe image components. However, human-written captions prioritize conveying a central message alongside visual descriptions using pragmatic cues. To enhance pragmatic diversity, it is essential to explore alternative ways of communicating these messages in conjunction with visual content. To address this challenge, we propose RONA, a novel prompting strategy for Multi-modal Large Language Models (MLLM) that leverages Coherence Relations as an axis for variation. We demonstrate that RONA generates captions with better overall diversity and ground-truth alignment, compared to MLLM baselines across multiple domains. Our code is available at: https://github.com/aashish2000/RONA
arXiv.org Artificial Intelligence
Mar-13-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Germany > North Rhine-Westphalia
- Upper Bavaria > Munich (0.04)
- Monaco (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- Germany > North Rhine-Westphalia
- North America
- Dominican Republic (0.04)
- United States
- California > San Mateo County
- Menlo Park (0.04)
- New York > New York County
- New York City (0.04)
- Pennsylvania (0.04)
- California > San Mateo County
- Asia > Middle East
- Genre:
- Research Report (0.64)
- Technology: