Grounding Language Models to Images for Multimodal Inputs and Outputs

Open in new window