Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions

Open in new window