Goto

Collaborating Authors

 Yuan, Zhuoning


CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models

arXiv.org Artificial Intelligence

Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.


LibAUC: A Deep Learning Library for X-Risk Optimization

arXiv.org Artificial Intelligence

The Torch [36] have dramatically reduced the efforts of developers motivation of developing LibAUC is to address the convergence and researchers for implementing different DL methods without issues of existing libraries for solving these problems. In particular, worrying about low-level computations (e.g., automatic differentiation, existing libraries may not converge or require very large mini-batch tensor operations, etc). Based on these platforms, plenty sizes in order to attain good performance for these problems, due of DL libraries have been developed for different purposes, which to the usage of the standard mini-batch technique in the empirical can be organized into different categories including (i) supporting risk minimization (ERM) framework. Our library is for deep X-risk specific tasks [15, 35], e.g., TF-Ranking for LTR [35], VISSL for optimization (DXO) that has achieved great success in solving a variety self-supervised learning (SSL) [15], (ii) supporting specific data, of tasks for CID, LTR and CLR. The contributions of this paper e.g., DGL and DIG for graphs [31, 55]; (iii) supporting specific models include: (1) It introduces a new mini-batch based pipeline for implementing [13, 58, 59], e.g., Transformers for transformer models [59]. DXO algorithms, which differs from existing DL pipeline in However, it has been observed that these existing platforms and the design of controlled data samplers and dynamic mini-batch losses; libraries have encountered some unique challenges when solving (2) It provides extensive benchmarking experiments for ablation some classical and emerging problems in AI, including classification studies and comparison with existing libraries. The LibAUC library for imbalanced data (CID), learning to rank (LTR), contrastive features scalable performance for millions of items to be contrasted, learning of representations (CLR). In particular, prior works have faster and better convergence than existing libraries for optimizing observed that large mini-batch sizes are necessary to attain good X-risks, seamless PyTorch deployment and versatile APIs for various performance for these problems [4, 5, 7, 37, 43, 46], which restricts loss optimization. Our library is available to the open source the capabilities of these AI models in the real-world.


Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature Individualization

arXiv.org Artificial Intelligence

In this paper, we aim to optimize a contrastive loss with individualized temperatures in a principled and systematic manner for self-supervised learning. The common practice of using a global temperature parameter $\tau$ ignores the fact that ``not all semantics are created equal", meaning that different anchor data may have different numbers of samples with similar semantics, especially when data exhibits long-tails. First, we propose a new robust contrastive loss inspired by distributionally robust optimization (DRO), providing us an intuition about the effect of $\tau$ and a mechanism for automatic temperature individualization. Then, we propose an efficient stochastic algorithm for optimizing the robust contrastive loss with a provable convergence guarantee without using large mini-batch sizes. Theoretical and experimental results show that our algorithm automatically learns a suitable $\tau$ for each sample. Specifically, samples with frequent semantics use large temperatures to keep local semantic structures, while samples with rare semantics use small temperatures to induce more separable features. Our method not only outperforms prior strong baselines (e.g., SimCLR, CLIP) on unimodal and bimodal datasets with larger improvements on imbalanced data but also is less sensitive to hyper-parameters. To our best knowledge, this is the first methodical approach to optimizing a contrastive loss with individualized temperatures.


Memory-Based Optimization Methods for Model-Agnostic Meta-Learning and Personalized Federated Learning

arXiv.org Artificial Intelligence

In recent years, model-agnostic meta-learning (MAML) has become a popular research area. However, the stochastic optimization of MAML is still underdeveloped. Existing MAML algorithms rely on the "episode" idea by sampling a few tasks and data points to update the meta-model at each iteration. Nonetheless, these algorithms either fail to guarantee convergence with a constant mini-batch size or require processing a large number of tasks at every iteration, which is unsuitable for continual learning or cross-device federated learning where only a small number of tasks are available per iteration or per round. To address these issues, this paper proposes memory-based stochastic algorithms for MAML that converge with vanishing error. The proposed algorithms require sampling a constant number of tasks and data samples per iteration, making them suitable for the continual learning scenario. Moreover, we introduce a communication-efficient memory-based MAML algorithm for personalized federated learning in cross-device (with client sampling) and cross-silo (without client sampling) settings. Our theoretical analysis improves the optimization theory for MAML, and our empirical results corroborate our theoretical findings. Interested readers can access our code at https://github.com/bokun-wang/moml.


Federated Deep AUC Maximization for Heterogeneous Data with a Constant Communication Complexity

arXiv.org Machine Learning

Deep AUC (area under the ROC curve) Maximization (DAM) has attracted much attention recently due to its great potential for imbalanced data classification. However, the research on Federated Deep AUC Maximization (FDAM) is still limited. Compared with standard federated learning (FL) approaches that focus on decomposable minimization objectives, FDAM is more complicated due to its minimization objective is non-decomposable over individual examples. In this paper, we propose improved FDAM algorithms for heterogeneous data by solving the popular non-convex strongly-concave min-max formulation of DAM in a distributed fashion. A striking result of this paper is that the communication complexity of the proposed algorithm is a constant independent of the number of machines and also independent of the accuracy level, which improves an existing result by orders of magnitude. Of independent interest, the proposed algorithm can also be applied to a class of non-convex-strongly-concave min-max problems. The experiments have demonstrated the effectiveness of our FDAM algorithm on benchmark datasets, and on medical chest X-ray images from different organizations. Our experiment shows that the performance of FDAM using data from multiple hospitals can improve the AUC score on testing data from a single hospital for detecting life-threatening diseases based on chest radiographs.


Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification

arXiv.org Machine Learning

Deep AUC Maximization (DAM) is a paradigm for learning a deep neural network by maximizing the AUC score of the model on a dataset. Most previous works of AUC maximization focus on the perspective of optimization by designing efficient stochastic algorithms, and studies on generalization performance of DAM on difficult tasks are missing. In this work, we aim to make DAM more practical for interesting real-world applications (e.g., medical image classification). First, we propose a new margin-based surrogate loss function for the AUC score (named as the AUC margin loss). It is more robust than the commonly used AUC square loss, while enjoying the same advantage in terms of large-scale stochastic optimization. Second, we conduct empirical studies of our DAM method on difficult medical image classification tasks, namely classification of chest x-ray images for identifying many threatening diseases and classification of images of skin lesions for identifying melanoma. Our DAM method has achieved great success on these difficult tasks, i.e., the 1st place on Stanford CheXpert competition (by the paper submission date) and Top 1% rank (rank 33 out of 3314 teams) on Kaggle 2020 Melanoma classification competition. We also conduct extensive ablation studies to demonstrate the advantages of the new AUC margin loss over the AUC square loss on benchmark datasets. To the best of our knowledge, this is the first work that makes DAM succeed on large-scale medical image datasets.


Fast Objective & Duality Gap Convergence for Nonconvex-Strongly-Concave Min-Max Problems

arXiv.org Machine Learning

This paper focuses on stochastic methods for solving smooth non-convex strongly-concave min-max problems, which have received increasing attention due to their potential applications in deep learning (e.g., deep AUC maximization). However, most of the existing algorithms are slow in practice, and their analysis revolves around the convergence to a nearly stationary point. We consider leveraging the Polyak-\L ojasiewicz (PL) condition to design faster stochastic algorithms with stronger convergence guarantee. Although PL condition has been utilized for designing many stochastic minimization algorithms, their applications for non-convex min-max optimization remains rare. In this paper, we propose and analyze proximal epoch-based methods, and establish fast convergence in terms of both {\bf the primal objective gap and the duality gap}. Our analysis is interesting in threefold: (i) it is based on a novel Lyapunov function that consists of the primal objective gap and the duality gap of a regularized function; (ii) it only requires a weaker PL condition for establishing the primal objective convergence than that required for the duality gap convergence; (iii) it yields the optimal dependence on the accuracy level $\epsilon$, i.e., $O(1/\epsilon)$. We also make explicit the dependence on the problem parameters and explore regions of weak convexity parameter that lead to improved dependence on condition numbers. Experiments on deep AUC maximization demonstrate the effectiveness of our methods. Our method (MaxAUC) achieved an AUC of 0.922 on private testing set on {\bf CheXpert competition}.


Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks

arXiv.org Machine Learning

In this paper, we study distributed algorithms for large-scale AUC maximization with a deep neural network as a predictive model. Although distributed learning techniques have been investigated extensively in deep learning, they are not directly applicable to stochastic AUC maximization with deep neural networks due to its striking differences from standard loss minimization problems (e.g., cross-entropy). Towards addressing this challenge, we propose and analyze a communication-efficient distributed optimization algorithm based on a {\it non-convex concave} reformulation of the AUC maximization, in which the communication of both the primal variable and the dual variable between each worker and the parameter server only occurs after multiple steps of gradient-based updates in each worker. Compared with the naive parallel version of an existing algorithm that computes stochastic gradients at individual machines and averages them for updating the model parameters, our algorithm requires a much less number of communication rounds and still achieves a linear speedup in theory. To the best of our knowledge, this is the \textbf{first} work that solves the {\it non-convex concave min-max} problem for AUC maximization with deep neural networks in a communication-efficient distributed manner while still maintaining the linear speedup property in theory. Our experiments on several benchmark datasets show the effectiveness of our algorithm and also confirm our theory.


Stochastic AUC Maximization with Deep Neural Networks

arXiv.org Machine Learning

Stochastic AUC maximization has garnered an increasing interest due to better fit to imbalanced data classification. However, existing works are limited to stochastic AUC maximization with a linear predictive model, which restricts its predictive power when dealing with extremely complex data. In this paper, we consider stochastic AUC maximization problem with a deep neural network as the predictive model. Building on the saddle point reformulation of a surrogated loss of AUC, the problem can be cast into a {\it non-convex concave} min-max problem. The main contribution made in this paper is to make stochastic AUC maximization more practical for deep neural networks and big data with theoretical insights as well. In particular, we propose to explore Polyak-\L{}ojasiewicz (PL) condition that has been proved and observed in deep learning, which enables us to develop new stochastic algorithms with even faster convergence rate and more practical step size scheme. An AdaGrad-style algorithm is also analyzed under the PL condition with adaptive convergence rate. Our experimental results demonstrate the effectiveness of the proposed algorithms.


Why Does Stagewise Training Accelerate Convergence of Testing Error Over SGD?

arXiv.org Machine Learning

Stagewise training strategy is commonly used for learning neural networks, which uses a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomial decaying step size in terms of both training error and testing error. But how to explain this phenomenon has been largely ignored by existing studies. This paper provides some theoretical evidence for explaining this faster convergence. In particular, we consider the stagewise training strategy for minimizing empirical risk that satisfies the Polyak-Łojasiewicz condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and "nicebehaviored" non-convexloss functions that are close to a convex function (namely weakly convex functions), we establish faster convergence of stagewise training than the vanilla SGD under the same condition on both training error and testing error. Indeed, the proposed algorithm has additional favorable features that come with theoretical guarantee for the considered non-convex optimization problems, including using explicit algorithmic regularization at each stage, using stagewise averaged solution for restarting, and returning the last stagewise averaged solution as the final solution. To differentiate from commonly used stagewise SGD, we refer to our algorithm as stagewise regularized training algorithm or Start. Of independent interest, the proved testing error bounds for a family of nonconvex lossfunctions are dimensionality and norm independent.