Towards Multimodal Vision-Language Models Generating Non-Generic Text

Open in new window