On the Benefits of Early Fusion in Multimodal Representation Learning