Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

Open in new window