One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks