Goto

Collaborating Authors

 Qiu, Zi-Hao


To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO

arXiv.org Artificial Intelligence

The temperature parameter plays a profound role during training and/or inference with large foundation models (LFMs) such as large language models (LLMs) and CLIP models. Particularly, it adjusts the logits in the softmax function in LLMs, which is crucial for next token generation, and it scales the similarities in the contrastive loss for training CLIP models. A significant question remains: Is it viable to learn a neural network to predict a personalized temperature of any input data for enhancing LFMs"? In this paper, we present a principled framework for learning a small yet generalizable temperature prediction network (TempNet) to improve LFMs. Our solution is composed of a novel learning framework with a robust loss underpinned by constrained distributionally robust optimization (DRO), and a properly designed TempNet with theoretical inspiration. TempNet can be trained together with a large foundation model from scratch or learned separately given a pretrained foundation model. It is not only useful for predicting personalized temperature to promote the training of LFMs but also generalizable and transferable to new tasks. Our experiments on LLMs and CLIP models demonstrate that TempNet greatly improves the performance of existing solutions or models, e.g. Table 1. The code to reproduce the experimental results in this paper can be found at https://github.com/zhqiu/TempNet.


LibAUC: A Deep Learning Library for X-Risk Optimization

arXiv.org Artificial Intelligence

The Torch [36] have dramatically reduced the efforts of developers motivation of developing LibAUC is to address the convergence and researchers for implementing different DL methods without issues of existing libraries for solving these problems. In particular, worrying about low-level computations (e.g., automatic differentiation, existing libraries may not converge or require very large mini-batch tensor operations, etc). Based on these platforms, plenty sizes in order to attain good performance for these problems, due of DL libraries have been developed for different purposes, which to the usage of the standard mini-batch technique in the empirical can be organized into different categories including (i) supporting risk minimization (ERM) framework. Our library is for deep X-risk specific tasks [15, 35], e.g., TF-Ranking for LTR [35], VISSL for optimization (DXO) that has achieved great success in solving a variety self-supervised learning (SSL) [15], (ii) supporting specific data, of tasks for CID, LTR and CLR. The contributions of this paper e.g., DGL and DIG for graphs [31, 55]; (iii) supporting specific models include: (1) It introduces a new mini-batch based pipeline for implementing [13, 58, 59], e.g., Transformers for transformer models [59]. DXO algorithms, which differs from existing DL pipeline in However, it has been observed that these existing platforms and the design of controlled data samplers and dynamic mini-batch losses; libraries have encountered some unique challenges when solving (2) It provides extensive benchmarking experiments for ablation some classical and emerging problems in AI, including classification studies and comparison with existing libraries. The LibAUC library for imbalanced data (CID), learning to rank (LTR), contrastive features scalable performance for millions of items to be contrasted, learning of representations (CLR). In particular, prior works have faster and better convergence than existing libraries for optimizing observed that large mini-batch sizes are necessary to attain good X-risks, seamless PyTorch deployment and versatile APIs for various performance for these problems [4, 5, 7, 37, 43, 46], which restricts loss optimization. Our library is available to the open source the capabilities of these AI models in the real-world.


Blockwise Stochastic Variance-Reduced Methods with Parallel Speedup for Multi-Block Bilevel Optimization

arXiv.org Artificial Intelligence

In this paper, we consider non-convex multi-block bilevel optimization (MBBO) problems, which involve $m\gg 1$ lower level problems and have important applications in machine learning. Designing a stochastic gradient and controlling its variance is more intricate due to the hierarchical sampling of blocks and data and the unique challenge of estimating hyper-gradient. We aim to achieve three nice properties for our algorithm: (a) matching the state-of-the-art complexity of standard BO problems with a single block; (b) achieving parallel speedup by sampling $I$ blocks and sampling $B$ samples for each sampled block per-iteration; (c) avoiding the computation of the inverse of a high-dimensional Hessian matrix estimator. However, it is non-trivial to achieve all of these by observing that existing works only achieve one or two of these properties. To address the involved challenges for achieving (a, b, c), we propose two stochastic algorithms by using advanced blockwise variance-reduction techniques for tracking the Hessian matrices (for low-dimensional problems) or the Hessian-vector products (for high-dimensional problems), and prove an iteration complexity of $O(\frac{m\epsilon^{-3}\mathbb{I}(I


Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature Individualization

arXiv.org Artificial Intelligence

In this paper, we aim to optimize a contrastive loss with individualized temperatures in a principled and systematic manner for self-supervised learning. The common practice of using a global temperature parameter $\tau$ ignores the fact that ``not all semantics are created equal", meaning that different anchor data may have different numbers of samples with similar semantics, especially when data exhibits long-tails. First, we propose a new robust contrastive loss inspired by distributionally robust optimization (DRO), providing us an intuition about the effect of $\tau$ and a mechanism for automatic temperature individualization. Then, we propose an efficient stochastic algorithm for optimizing the robust contrastive loss with a provable convergence guarantee without using large mini-batch sizes. Theoretical and experimental results show that our algorithm automatically learns a suitable $\tau$ for each sample. Specifically, samples with frequent semantics use large temperatures to keep local semantic structures, while samples with rare semantics use small temperatures to induce more separable features. Our method not only outperforms prior strong baselines (e.g., SimCLR, CLIP) on unimodal and bimodal datasets with larger improvements on imbalanced data but also is less sensitive to hyper-parameters. To our best knowledge, this is the first methodical approach to optimizing a contrastive loss with individualized temperatures.


Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence

arXiv.org Artificial Intelligence

NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at https://libauc.org/.