Pham, Tung
Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization
Luo, Haocheng, Truong, Tuan, Pham, Tung, Harandi, Mehrtash, Phung, Dinh, Le, Trung
Sharpness-Aware Minimization (SAM) has attracted significant attention for its effectiveness in improving generalization across various tasks. However, its underlying principles remain poorly understood. In this work, we analyze SAM's training dynamics using the maximum eigenvalue of the Hessian as a measure of sharpness, and propose a third-order stochastic differential equation (SDE), which reveals that the dynamics are driven by a complex mixture of second- and third-order terms. We show that alignment between the perturbation vector and the top eigenvector is crucial for SAM's effectiveness in regularizing sharpness, but find that this alignment is often inadequate in practice, limiting SAM's efficiency. Building on these insights, we introduce Eigen-SAM, an algorithm that explicitly aims to regularize the top Hessian eigenvalue by aligning the perturbation vector with the leading eigenvector. We validate the effectiveness of our theory and the practical advantages of our proposed approach through comprehensive experiments. Code is available at https://github.com/RitianLuo/EigenSAM.
On Barycenter Computation: Semi-Unbalanced Optimal Transport-based Method on Gaussians
Nguyen, Ngoc-Hai, Le, Dung, Nguyen, Hoang-Phi, Pham, Tung, Ho, Nhat
We explore a robust version of the barycenter problem among $n$ centered Gaussian probability measures, termed Semi-Unbalanced Optimal Transport (SUOT)-based Barycenter, wherein the barycenter remains fixed while the others are relaxed using Kullback-Leibler divergence. We develop optimization algorithms on Bures-Wasserstein manifold, named the Exact Geodesic Gradient Descent and Hybrid Gradient Descent algorithms. While the Exact Geodesic Gradient Descent method is based on computing the exact closed form of the first-order derivative of the objective function of the barycenter along a geodesic on the Bures manifold, the Hybrid Gradient Descent method utilizes optimizer components when solving the SUOT problem to replace outlier measures before applying the Riemannian Gradient Descent. We establish the theoretical convergence guarantees for both methods and demonstrate that the Exact Geodesic Gradient Descent algorithm attains a dimension-free convergence rate. Finally, we conduct experiments to compare the normal Wasserstein Barycenter with ours and perform an ablation study.
Diversity-Aware Agnostic Ensemble of Sharpness Minimizers
Bui, Anh, Vo, Vy, Pham, Tung, Phung, Dinh, Le, Trung
There has long been plenty of theoretical and empirical evidence supporting the success of ensemble learning. Deep ensembles in particular take advantage of training randomness and expressivity of individual neural networks to gain prediction diversity, ultimately leading to better generalization, robustness and uncertainty estimation. In respect of generalization, it is found that pursuing wider local minima result in models being more robust to shifts between training and testing sets. A natural research question arises out of these two approaches as to whether a boost in generalization ability can be achieved if ensemble learning and loss sharpness minimization are integrated. Our work investigates this connection and proposes DASH - a learning algorithm that promotes diversity and flatness within deep ensembles. More concretely, DASH encourages base learners to move divergently towards low-loss regions of minimal sharpness. We provide a theoretical backbone for our method along with extensive empirical evidence demonstrating an improvement in ensemble generalizability.
Robust Contrastive Learning With Theory Guarantee
Tran, Ngoc N., Tran, Lam, Phan, Hoang, Bui, Anh, Pham, Tung, Tran, Toan, Phung, Dinh, Le, Trung
Contrastive learning (CL) allows us to create meaningful features without any label information. In the first phase, CL approaches learn the features, which are then classified by a linear classifier that has been learned from labeled data. While existing theoretical works have studied the connection between the supervised loss in the second phase and the unsupervised loss in the first phase to explain why the unsupervised loss can support the supervised loss, there has been no theoretical examination of the connection between the unsupervised loss in the first phase and the robust supervised loss in the second phase, which can shed light on how to establish an effective unsupervised loss in the first phase. To fill this gap, our paper develops rigorous theories to identify which components in the supervised loss can aid the robust supervised loss. Finally, we conduct experiments to verify our findings. All code used in this work is available at https://anonymous.4open.science/r/rosa.
Understanding the Robustness of Randomized Feature Defense Against Query-Based Adversarial Attacks
Nguyen, Quang H., Lao, Yingjie, Pham, Tung, Wong, Kok-Seng, Doan, Khoa D.
Recent works have shown that deep neural networks are vulnerable to adversarial examples that find samples close to the original image but can make the model misclassify. Even with access only to the model's output, an attacker can employ black-box attacks to generate such adversarial examples. In this work, we propose a simple and lightweight defense against black-box attacks by adding random noise to hidden features at intermediate layers of the model at inference time. Our theoretical analysis confirms that this method effectively enhances the model's resilience against both score-based and decision-based black-box attacks. Importantly, our defense does not necessitate adversarial training and has minimal impact on accuracy, rendering it applicable to any pre-trained model. Our analysis also reveals the significance of selectively adding noise to different parts of the model based on the gradient of the adversarial objective function, which can be varied during the attack. We demonstrate the robustness of our defense against multiple black-box attacks through extensive empirical experiments involving diverse models with various architectures.
RSAM: Learning on manifolds with Riemannian Sharpness-aware Minimization
Truong, Tuan, Nguyen, Hoang-Phi, Pham, Tung, Tran, Minh-Tuan, Harandi, Mehrtash, Phung, Dinh, Le, Trung
Nowadays, understanding the geometry of the loss landscape shows promise in enhancing a model's generalization ability. In this work, we draw upon prior works that apply geometric principles to optimization and present a novel approach to improve robustness and generalization ability for constrained optimization problems. Indeed, this paper aims to generalize the Sharpness-Aware Minimization (SAM) optimizer to Riemannian manifolds. In doing so, we first extend the concept of sharpness and introduce a novel notion of sharpness on manifolds. To support this notion of sharpness, we present a theoretical analysis characterizing generalization capabilities with respect to manifold sharpness, which demonstrates a tighter bound on the generalization gap, a result not known before. Motivated by this analysis, we introduce our algorithm, Riemannian Sharpness-Aware Minimization (RSAM). To demonstrate RSAM's ability to enhance generalization ability, we evaluate and contrast our algorithm on a broad set of problems, such as image classification and contrastive learning across different datasets, including CIFAR100, CIFAR10, and FGVCAircraft. Our code is publicly available at \url{https://t.ly/RiemannianSAM}.
Entropic Gromov-Wasserstein between Gaussian Distributions
Le, Khang, Le, Dung, Nguyen, Huy, Do, Dat, Pham, Tung, Ho, Nhat
We study the entropic Gromov-Wasserstein and its unbalanced version between (unbalanced) Gaussian distributions with different dimensions. When the metric is the inner product, which we refer to as inner product Gromov-Wasserstein (IGW), we demonstrate that the optimal transportation plans of entropic IGW and its unbalanced variant are (unbalanced) Gaussian distributions. Via an application of von Neumann's trace inequality, we obtain closed-form expressions for the entropic IGW between these Gaussian distributions. Finally, we consider an entropic inner product Gromov-Wasserstein barycenter of multiple Gaussian distributions. We prove that the barycenter is Gaussian distribution when the entropic regularization parameter is small. We further derive closed-form expressions for the covariance matrix of the barycenter.
An Efficient Mini-batch Method via Partial Transportation
Nguyen, Khai, Nguyen, Dang, Pham, Tung, Ho, Nhat
Mini-batch optimal transport (m-OT) has been widely used recently to deal with the memory issue of OT in large-scale applications. Despite their practicality, m-OT suffers from misspecified mappings, namely, mappings that are optimal on the mini-batch level but do not exist in the optimal transportation plan between the original measures. To address the misspecified mappings issue, we propose a novel mini-batch method by using partial optimal transport (POT) between mini-batch empirical measures, which we refer to as mini-batch partial optimal transport (m-POT). Leveraging the insight from the partial transportation, we explain the source of misspecified mappings from the m-OT and motivate why limiting the amount of transported masses among mini-batches via POT can alleviate the incorrect mappings. Finally, we carry out extensive experiments on various applications to compare m-POT with m-OT and recently proposed mini-batch method, mini-batch unbalanced optimal transport (m-UOT). We observe that m-POT is better than m-OT deep domain adaptation applications while having comparable performance with m-UOT. On other applications, such as deep generative model, gradient flow, and color transfer, m-POT yields more favorable performance than both m-OT and m-UOT.
On Multimarginal Partial Optimal Transport: Equivalent Forms and Computational Complexity
Le, Khang, Nguyen, Huy, Pham, Tung, Ho, Nhat
The recent advances in computation of optimal transport (OT) [34, 11, 24, 12, 1] has brought new applications of optimal transport in machine learning and data science to the fore. Examples of these applications include generative models [2, 33, 16, 10], unsupervised learning [17, 18], computer vision [32, 25], and other applications [29, 27, 6]. However, due to the marginal constraints of transportation plans, optimal transport is only defined between balanced measures, namely, measures with equal masses. When measures are unbalanced, i.e., they can have different masses, there are two popular approaches for defining divergences between these measures. The first approach is unbalanced optimal transport [9, 8]. The main idea of unbalanced optimal transport is to regularize the objective function of optimal transport based on certain divergences between marginal constraints of transportation plan and the masses of measures. Despite its favorable computational complexity [28] and practical applications [31, 14, 19, 3], the optimal transportation plan from unbalanced optimal transport is often non-trivial to interpret in practice. The second approach for defining divergence between unbalanced measures is partial optimal transport (POT) [5, 13], which was originally used to analyze partial differential equations.
On Robust Optimal Transport: Computational Complexity, Low-rank Approximation, and Barycenter Computation
Le, Khang, Nguyen, Huy, Nguyen, Quang, Ho, Nhat, Pham, Tung, Bui, Hung
The recent advance in computation with optimal transport (OT) problem [12, 3, 13, 7, 20, 23, 17, 18] has led to a surge of interest in using that tool in various domains of machine learning and statistics. The range of its applications is broad, including deep generative models [4, 14, 32], scalable Bayes [29, 30], mixture and hierarchical models [21], and other applications [28, 25, 10, 15, 33, 31, 8]. The goal of optimal transport is to find a minimal cost of moving masses between (supports of) probability distributions. It is known that the estimation of transport cost is not robust when there are outliers. To deal with this issue, [34] proposed a trimmed version of optimal transport. In particular, they search for the truncated probability distributions such that the optimal transport cost between them is minimized. However, their trimmed optimal transport is non-trivial to compute, which hinders its usage in practical applications. Another line of works proposed using unbalanced optimal transport (UOT) to solve the sensitivity of optimal transport to outliers [5, 26]. More specifically, their idea is to assign as small as possible mass to outliers by relaxing the marginal constraints of OT through a penalty function such as the KL divergence.