Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions