Goto

Collaborating Authors

 infonce loss




Test-Time Distribution Normalization for Contrastively Learned Visual-language Models

Neural Information Processing Systems

Advances in the field of visual-language contrastive learning have made it possible for many downstream applications to be carried out efficiently and accurately by simply taking the dot product between image and text representations. One of the most representative approaches proposed recently known as CLIP has quickly garnered widespread adoption due to its effectiveness. CLIP is trained with an InfoNCE loss that takes into account both positive and negative samples to help learn a much more robust representation space. This paper however reveals that the common downstream practice of taking a dot product is only a zeroth-order approximation of the optimization goal, resulting in a loss of information during test-time. Intuitively, since the model has been optimized based on the InfoNCE loss, test-time procedures should ideally also be in alignment. The question lies in how one can retrieve any semblance of negative samples information during inference in a computationally efficient way. We propose Distribution Normalization (DN), where we approximate the mean representation of a batch of test samples and use such a mean to represent what would be analogous to negative samples in the InfoNCE loss. DN requires no retraining or fine-tuning and can be effortlessly applied during inference. Extensive experiments on a wide variety of downstream tasks exhibit a clear advantage of DN over the dot product on top of other existing test-time augmentation methods.


A Gradient Accumulation Method for Dense Retriever under Memory Constraint

Neural Information Processing Systems

InfoNCE loss is commonly used to train dense retriever in information retrieval tasks. It is well known that a large batch is essential to stable and effective training with InfoNCE loss, which requires significant hardware resources. Due to the dependency of large batch, dense retriever has bottleneck of application and research. Recently, memory reduction methods have been broadly adopted to resolve the hardware bottleneck by decomposing forward and backward or using a memory bank. However, current methods still suffer from slow and unstable train. To address these issues, we propose Contrastive Accumulation (ContAccum), a stable and efficient memory reduction method for dense retriever trains that uses a dual memory bank structure to leverage previously generated query and passage representations. Experiments on widely used five information retrieval datasets indicate that ContAccum can surpass not only existing memory reduction methods but also high-resource scenarios. Moreover, theoretical analysis and experimental results confirm that ContAccum provides more stable dual-encoder training than current memory bank utilization methods.


FANoise: Singular Value-Adaptive Noise Modulation for Robust Multimodal Representation Learning

Li, Jiaoyang, Fang, Jun, Gao, Tianhao, Zhang, Xiaohui, Liu, Zhiyuan, Liu, Chao, Liu, Pengzhang, Jiang, Qixia

arXiv.org Artificial Intelligence

Representation learning is fundamental to modern machine learning, powering applications such as text retrieval and multimodal understanding. However, learning robust and generalizable representations remains challenging. While prior work has demonstrated that active noise injection, a form of data augmentation, can enhance encoding performance, most existing methods rely on heuristic or static noise, overlooking the dynamic nature of feature distributions during training. In this work, we systematically study the role of noise in representation learning from both gradient-based and feature distribution perspectives, using InfoNCE loss as a representative example. Focusing on multimodal representation learning, we propose FANoise, a novel feature-adaptive noise injection strategy. By leveraging the dynamics of contrastive learning, FANoise effectively mitigates the negative impacts of noise while preserving its benefits. Under this theoretically grounded framework, comprehensive experiments demonstrate that FANoise consistently improves overall performance on multimodal tasks across various base VLM models.



Compressed Video Contrastive Learning

Neural Information Processing Systems

Existing state-of-the-art methods [Han et al. , 2020b; Tao et al. , 2020; Huo et al. , 2021] mainly focus More details can be found in Table 1. This clearly hinders large-scale video self-supervised training.


ifm

Joshua Robinson

Neural Information Processing Systems

In this section we give proofs for all the results in Sec. 2, which explores the phenomenon of feature We invite the reader to consult Sec. For this purpose we found this strong notion of distinguishing to suffice. The encoder must learn color features in order to identify this positive pair. Proposition 2. F or a set S [ n ] of features let L By Prop 2.3, we know that for each This section gives detailed derivations of two simple but key facts used in the development of IFM. The first result derives an analytic expression for the gradient of the InfoNCE loss with respect to positive sample in latent space, and the second result computes the gradient with respect to an arbitrary negative sample.


ifm

Joshua Robinson

Neural Information Processing Systems

Representations trained with contrastive learning are adept at solving various vision tasks including classification, object detection, instance segmentation, and more [ 5, 15, 44 ].


WithAnyone: Towards Controllable and ID Consistent Image Generation

Xu, Hengyuan, Cheng, Wei, Xing, Peng, Fang, Yixiao, Wu, Shuhan, Wang, Rui, Zeng, Xianfang, Jiang, Daxin, Yu, Gang, Ma, Xingjun, Jiang, Yu-Gang

arXiv.org Artificial Intelligence

Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.