Goto

Collaborating Authors

 Wang, Bohan


Mitigating Graph Covariate Shift via Score-based Out-of-distribution Augmentation

arXiv.org Artificial Intelligence

Distribution shifts between training and testing datasets significantly impair the model performance on graph learning. A commonly-taken causal view in graph invariant learning suggests that stable predictive features of graphs are causally associated with labels, whereas varying environmental features lead to distribution shifts. In particular, covariate shifts caused by unseen environments in test graphs underscore the critical need for out-of-distribution (OOD) generalization. Existing graph augmentation methods designed to address the covariate shift often disentangle the stable and environmental features in the input space, and selectively perturb or mixup the environmental features. However, such perturbationbased methods heavily rely on an accurate separation of stable and environmental features, and their exploration ability is confined to existing environmental features in the training distribution. To overcome these limitations, we introduce a novel approach using score-based graph generation strategies that synthesize unseen environmental features while preserving the validity and stable features of overall graph patterns. Our comprehensive empirical evaluations demonstrate the enhanced effectiveness of our method in improving graph OOD generalization. Deep learning algorithms have become predominant in the analysis of graph-structured data. However, a common limitation of existing methods is the assumption that both training and testing graphs are independently and identically distributed (i.i.d.). This assumption often falls short in real-world scenarios, where shifts in data distribution frequently occur, leading to significant degradation in model performance.


On the Convergence of Adam under Non-uniform Smoothness: Separability from SGDM and Beyond

arXiv.org Artificial Intelligence

This paper aims to clearly distinguish between Stochastic Gradient Descent with Momentum (SGDM) and Adam in terms of their convergence rates. We demonstrate that Adam achieves a faster convergence compared to SGDM under the condition of non-uniformly bounded smoothness. Our findings reveal that: (1) in deterministic environments, Adam can attain the known lower bound for the convergence rate of deterministic first-order optimizers, whereas the convergence rate of Gradient Descent with Momentum (GDM) has higher order dependence on the initial function value; (2) in stochastic setting, Adam's convergence rate upper bound matches the lower bounds of stochastic first-order optimizers, considering both the initial function value and the final error, whereas there are instances where SGDM fails to converge with any learning rate. These insights distinctly differentiate Adam and SGDM regarding their convergence rates. Additionally, by introducing a novel stopping-time based technique, we further prove that if we consider the minimum gradient norm during iterations, the corresponding convergence rate can match the lower bounds across all problem hyperparameters. The technique can also help proving that Adam with a specific hyperparameter scheduler is parameter-agnostic, which hence can be of independent interest.


Continuous-time Riemannian SGD and SVRG Flows on Wasserstein Probabilistic Space

arXiv.org Artificial Intelligence

Recently, optimization on the Riemannian manifold has provided new insights to the optimization community. In this regard, the manifold taken as the probability measure metric space equipped with the second-order Wasserstein distance is of particular interest, since optimization on it can be linked to practical sampling processes. In general, the oracle (continuous) optimization method on Wasserstein space is Riemannian gradient flow (i.e., Langevin dynamics when minimizing KL divergence). In this paper, we aim to enrich the continuous optimization methods in the Wasserstein space by extending the gradient flow into the stochastic gradient descent (SGD) flow and stochastic variance reduction gradient (SVRG) flow. The two flows on Euclidean space are standard stochastic optimization methods, while their Riemannian counterparts are not explored yet. By leveraging the structures in Wasserstein space, we construct a stochastic differential equation (SDE) to approximate the discrete dynamics of desired stochastic methods in the corresponded random vector space. Then, the flows of probability measures are naturally obtained by applying Fokker-Planck equation to such SDE. Furthermore, the convergence rates of the proposed Riemannian stochastic flows are proven, and they match the results in Euclidean space.


Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study

arXiv.org Machine Learning

Although gradient descent with momentum is widely used in modern deep learning, a concrete understanding of its effects on the training trajectory still remains elusive. In this work, we empirically show that momentum gradient descent with a large learning rate and learning rate warmup displays large catapults, driving the iterates towards flatter minima than those found by gradient descent. We then provide empirical evidence and theoretical intuition that the large catapult is caused by momentum "amplifying" the self-stabilization effect (Damian et al., 2023).B.1


On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training

arXiv.org Machine Learning

Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that the optimal class-to-sample ratio (#classes / #samples per class) is invariant to the size of the pre-training dataset, which motivates an application of predicting the optimal number of pre-training classes. We demonstrate the effectiveness of this application by an improvement of around 2 points on the downstream tasks when using ImageNet as the pre-training dataset.


Closing the Gap Between the Upper Bound and the Lower Bound of Adam's Iteration Complexity

arXiv.org Artificial Intelligence

Recently, Arjevani et al. [1] established a lower bound of iteration complexity for the first-order optimization under an $L$-smooth condition and a bounded noise variance assumption. However, a thorough review of existing literature on Adam's convergence reveals a noticeable gap: none of them meet the above lower bound. In this paper, we close the gap by deriving a new convergence guarantee of Adam, with only an $L$-smooth condition and a bounded noise variance assumption. Our results remain valid across a broad spectrum of hyperparameters. Especially with properly chosen hyperparameters, we derive an upper bound of the iteration complexity of Adam and show that it meets the lower bound for first-order optimizers. To the best of our knowledge, this is the first to establish such a tight upper bound for Adam's convergence. Our proof utilizes novel techniques to handle the entanglement between momentum and adaptive learning rate and to convert the first-order term in the Descent Lemma to the gradient norm, which may be of independent interest.


Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions

arXiv.org Artificial Intelligence

We provide a simple convergence proof for AdaGrad optimizing non-convex objectives under only affine noise variance and bounded smoothness assumptions. The proof is essentially based on a novel auxiliary function $\xi$ that helps eliminate the complexity of handling the correlation between the numerator and denominator of AdaGrad's update. Leveraging simple proofs, we are able to obtain tighter results than existing results \citep{faw2022power} and extend the analysis to several new and important cases. Specifically, for the over-parameterized regime, we show that AdaGrad needs only $\mathcal{O}(\frac{1}{\varepsilon^2})$ iterations to ensure the gradient norm smaller than $\varepsilon$, which matches the rate of SGD and significantly tighter than existing rates $\mathcal{O}(\frac{1}{\varepsilon^4})$ for AdaGrad. We then discard the bounded smoothness assumption and consider a realistic assumption on smoothness called $(L_0,L_1)$-smooth condition, which allows local smoothness to grow with the gradient norm. Again based on the auxiliary function $\xi$, we prove that AdaGrad succeeds in converging under $(L_0,L_1)$-smooth condition as long as the learning rate is lower than a threshold. Interestingly, we further show that the requirement on learning rate under the $(L_0,L_1)$-smooth condition is necessary via proof by contradiction, in contrast with the case of uniform smoothness conditions where convergence is guaranteed regardless of learning rate choices. Together, our analyses broaden the understanding of AdaGrad and demonstrate the power of the new auxiliary function in the investigations of AdaGrad.


How Can Large Language Models Help Humans in Design and Manufacturing?

arXiv.org Artificial Intelligence

Advances in computational design and manufacturing (CDaM) have already permeated and transformed numerous industries, including aerospace, architecture, electronics, dental, and digital media, among others. Nevertheless, the full potential of the CDaM workflow is still limited by a number of barriers, such as the extensive domainspecific knowledge that is often required to use CDaM software packages or integrate CDaM solutions into existing workflows. Generative AI tools such as Large Language Models (LLMs) have the potential to remove these barriers, by expediting the CDaM process and providing an intuitive, unified, and user-friendly interface that connects each stage of the pipeline. However, to date, generative AI and LLMs have predominantly been applied to non-engineering domains. In this study, we show how these tools can also be used to develop new design and manufacturing workflows.


Fast Conditional Mixing of MCMC Algorithms for Non-log-concave Distributions

arXiv.org Artificial Intelligence

MCMC algorithms offer empirically efficient tools for sampling from a target distribution $\pi(x) \propto \exp(-V(x))$. However, on the theory side, MCMC algorithms suffer from slow mixing rate when $\pi(x)$ is non-log-concave. Our work examines this gap and shows that when Poincar\'e-style inequality holds on a subset $\mathcal{X}$ of the state space, the conditional distribution of MCMC iterates over $\mathcal{X}$ mixes fast to the true conditional distribution. This fast mixing guarantee can hold in cases when global mixing is provably slow. We formalize the statement and quantify the conditional mixing rate. We further show that conditional mixing can have interesting implications for sampling from mixtures of Gaussians, parameter estimation for Gaussian mixture models and Gibbs-sampling with well-connected local minima.


When and Why Momentum Accelerates SGD:An Empirical Study

arXiv.org Artificial Intelligence

Momentum has become a crucial component in deep learning optimizers, necessitating a comprehensive understanding of when and why it accelerates stochastic gradient descent (SGD). To address the question of ''when'', we establish a meaningful comparison framework that examines the performance of SGD with Momentum (SGDM) under the \emph{effective learning rates} $\eta_{ef}$, a notion unifying the influence of momentum coefficient $\mu$ and batch size $b$ over learning rate $\eta$. In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when $\eta_{ef}$ is small, SGDM and SGD experience almost the same empirical training losses; when $\eta_{ef}$ surpasses a certain threshold, SGDM begins to perform better than SGD. Furthermore, we observe that the advantage of SGDM over SGD becomes more pronounced with a larger batch size. For the question of ``why'', we find that the momentum acceleration is closely related to \emph{abrupt sharpening} which is to describe a sudden jump of the directional Hessian along the update direction. Specifically, the misalignment between SGD and SGDM happens at the same moment that SGD experiences abrupt sharpening and converges slower. Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening. Together, this study unveils the interplay between momentum, learning rates, and batch sizes, thus improving our understanding of momentum acceleration.