Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness

Open in new window