Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization