Improving fine-grained understanding in image-text pre-training