NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization
Wei, Xiyuan, Lin, Chih-Jen, Yang, Tianbao
–arXiv.org Artificial Intelligence
Accurately estimating the normalization term (also known as the partition function) in the contrastive loss is a central challenge for training Contrastive Language-Image Pre-training (CLIP) models. Conventional methods rely on large batches for approximation, demanding substantial computational resources. To mitigate this issue, prior works introduced per-sample normalizer estimators, which are updated at each epoch in a blockwise coordinate manner to keep track of updated encoders. To overcome this limitation, we propose NeuCLIP, a novel and elegant optimization framework based on two key ideas: (i) reformulating the contrastive loss for each sample via convex analysis into a minimization problem with an auxiliary variable representing its log-normalizer; and (ii) transforming the resulting minimization over n auxiliary variables (where n is the dataset size) via variational analysis into the minimization over a compact neural network that predicts the log-normalizers. We design an alternating optimization algorithm that jointly trains the CLIP model and the auxiliary network. By employing a tailored architecture and acceleration techniques for the auxiliary network, NeuCLIP achieves more accurate normalizer estimation, leading to improved performance compared with previous methods. Extensive experiments on large-scale CLIP training, spanning datasets from millions to billions of samples, demonstrate that NeuCLIP outperforms previous methods. Since its introduction, Contrastive Language-Image Pretraining (CLIP) (Radford et al., 2021) has emerged as the de facto standard for vision-language representation learning.
arXiv.org Artificial Intelligence
Nov-12-2025
- Country:
- Asia > Taiwan (0.04)
- Europe > Switzerland (0.04)
- North America > United States
- Texas > Brazos County > College Station (0.04)
- Genre:
- Research Report (0.64)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (1.00)
- Natural Language (1.00)
- Representation & Reasoning > Optimization (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence