One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

Open in new window