CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?