NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization

Wei, Xiyuan, Lin, Chih-Jen, Yang, Tianbao

Nov-12-2025–arXiv.org Artificial Intelligence

Accurately estimating the normalization term (also known as the partition function) in the contrastive loss is a central challenge for training Contrastive Language-Image Pre-training (CLIP) models. Conventional methods rely on large batches for approximation, demanding substantial computational resources. To mitigate this issue, prior works introduced per-sample normalizer estimators, which are updated at each epoch in a blockwise coordinate manner to keep track of updated encoders. To overcome this limitation, we propose NeuCLIP, a novel and elegant optimization framework based on two key ideas: (i) reformulating the contrastive loss for each sample via convex analysis into a minimization problem with an auxiliary variable representing its log-normalizer; and (ii) transforming the resulting minimization over n auxiliary variables (where n is the dataset size) via variational analysis into the minimization over a compact neural network that predicts the log-normalizers. We design an alternating optimization algorithm that jointly trains the CLIP model and the auxiliary network. By employing a tailored architecture and acceleration techniques for the auxiliary network, NeuCLIP achieves more accurate normalizer estimation, leading to improved performance compared with previous methods. Extensive experiments on large-scale CLIP training, spanning datasets from millions to billions of samples, demonstrate that NeuCLIP outperforms previous methods. Since its introduction, Contrastive Language-Image Pretraining (CLIP) (Radford et al., 2021) has emerged as the de facto standard for vision-language representation learning.

contrastive loss, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Nov-12-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning > Optimization (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found