A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

Open in new window