Mahoney, Michael
Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior
Subramanian, Shashank, Harrington, Peter, Keutzer, Kurt, Bhimji, Wahid, Morozov, Dmitriy, Mahoney, Michael, Gholami, Amir
Pre-trained machine learning (ML) models have shown great performance for a wide range of applications, in particular in natural language processing (NLP) and computer vision (CV). Here, we study how pre-training could be used for scientific machine learning (SciML) applications, specifically in the context of transfer learning. We study the transfer behavior of these models as (i) the pre-trained model size is scaled, (ii) the downstream training dataset size is scaled, (iii) the physics parameters are systematically pushed out of distribution, and (iv) how a single model pre-trained on a mixture of different physics problems can be adapted to various downstream applications. We find that-when fine-tuned appropriately-transfer learning can help reach desired accuracy levels with orders of magnitude fewer downstream examples (across different tasks that can even be out-of-distribution) than training from scratch, with consistent behavior across a wide range of downstream examples. We also find that fine-tuning these models yields more performance gains as model size increases, compared to training from scratch on new downstream tasks. These results hold for a broad range of PDE learning tasks. All in all, our results demonstrate the potential of the "pre-train and fine-tune" paradigm for SciML problems, demonstrating a path towards building SciML foundation models. We open-source our code for reproducibility.
GACT: Activation Compressed Training for Generic Network Architectures
Liu, Xiaoxuan, Zheng, Lianmin, Wang, Dequan, Cen, Yukuo, Chen, Weize, Han, Xu, Chen, Jianfei, Liu, Zhiyuan, Tang, Jie, Gonzalez, Joey, Mahoney, Michael, Cheung, Alvin
Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. This paper presents GACT, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge. By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of GACT without prior knowledge on operator type or model architecture. To make training stable, we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time. We implement GACT as a PyTorch library that readily applies to any NN architecture. GACT reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1x, enabling training with a 4.2x to 24.7x larger batch size, with negligible accuracy loss. We implement GACT as a PyTorch library at https://github.com/LiuXiaoxuanPKU/GACT-ICML.
LocalNewton: Reducing Communication Bottleneck for Distributed Learning
Gupta, Vipul, Ghosh, Avishek, Derezinski, Michal, Khanna, Rajiv, Ramchandran, Kannan, Mahoney, Michael
To address the communication bottleneck problem in distributed optimization within a master-worker framework, we propose LocalNewton, a distributed second-order algorithm with local averaging. In LocalNewton, the worker machines update their model in every iteration by finding a suitable second-order descent direction using only the data and model stored in their own local memory. We let the workers run multiple such iterations locally and communicate the models to the master node only once every few (say L) iterations. LocalNewton is highly practical since it requires only one hyperparameter, the number L of local iterations. We use novel matrix concentration-based techniques to obtain theoretical guarantees for LocalNewton, and we validate them with detailed empirical evaluation. To enhance practicability, we devise an adaptive scheme to choose L, and we show that this reduces the number of local iterations in worker machines between two model synchronizations as the training proceeds, successively refining the model quality at the master. Via extensive experiments using several real-world datasets with AWS Lambda workers and an AWS EC2 master, we show that LocalNewton requires fewer than 60% of the communication rounds (between master and workers) and less than 40% of the end-to-end running time, compared to state-of-the-art algorithms, to reach the same training~loss.
ANODEV2: A Coupled Neural ODE Evolution Framework
Zhang, Tianjun, Yao, Zhewei, Gholami, Amir, Keutzer, Kurt, Gonzalez, Joseph, Biros, George, Mahoney, Michael
It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE). This observation motivated the introduction of so-called Neural ODEs, which allow more general discretization schemes with adaptive time stepping. Here, we propose ANODEV2, which is an extension of this approach that also allows evolution of the neural network parameters, in a coupled ODE-based formulation. The Neural ODE method introduced earlier is in fact a special case of this new more general framework. We present the formulation of ANODEV2, derive optimality conditions, and implement a coupled reaction-diffusion-advection version of this framework in PyTorch. We present empirical results using several different configurations of ANODEV2, testing them on multiple models on CIFAR-10. We report results showing that this coupled ODE-based framework is indeed trainable, and that it achieves higher accuracy, as compared to the baseline models as well as the recently-proposed Neural ODE approach.
Trust Region Based Adversarial Attack on Neural Networks
Yao, Zhewei, Gholami, Amir, Xu, Peng, Keutzer, Kurt, Mahoney, Michael
Deep Neural Networks are quite vulnerable to adversarial perturbations. Current state-of-the-art adversarial attack methods typically require very time consuming hyper-parameter tuning, or require many iterations to solve an optimization based adversarial attack. To address this problem, we present a new family of trust region based adversarial attacks, with the goal of computing adversarial perturbations efficiently. We propose several attacks based on variants of the trust region optimization method. We test the proposed methods on Cifar-10 and ImageNet datasets using several different models including AlexNet, ResNet-50, VGG-16, and DenseNet-121 models. Our methods achieve comparable results with the Carlini-Wagner (CW) attack, but with significant speed up of up to $37\times$, for the VGG-16 model on a Titan Xp GPU. For the case of ResNet-50 on ImageNet, we can bring down its classification accuracy to less than 0.1\% with at most $1.5\%$ relative $L_\infty$ (or $L_2$) perturbation requiring only $1.02$ seconds as compared to $27.04$ seconds for the CW attack. We have open sourced our method which can be accessed at [1].
Large batch size training of neural networks with adversarial training and second-order information
Yao, Zhewei, Gholami, Amir, Keutzer, Kurt, Mahoney, Michael
Stochastic Gradient Descent (SGD) methods using randomly selected batches are widely-used to train neural network (NN) models. Performing design exploration to find the best NN for a particular task often requires extensive training with different models on a large dataset, which is very computationally expensive. The most straightforward method to accelerate this computation is to distribute the batch of SGD over multiple processors. To keep the distributed processors fully utilized requires commensurately growing the batch size; however, large batch training often times leads to degradation in accuracy, poor generalization, and even poor robustness to adversarial attacks. Existing solutions for large batch training either significantly degrade accuracy or require massive hyper-parameter tuning. To address this issue, we propose a novel large batch training method which combines recent results in adversarial training (to regularize against `sharp minima') and second order optimization (to use curvature information to change batch size adaptively during training). We extensively evaluate our method on Cifar-10/100, SVHN, TinyImageNet, and ImageNet datasets, using multiple NNs, including residual networks as well as smaller networks for mobile applications such as SqueezeNext. Our new approach exceeds the performance of the existing solutions in terms of both accuracy and the number of SGD iterations (up to 1\% and $5\times$, respectively). We emphasize that this is achieved without any additional hyper-parameter tuning to tailor our proposed method in any of these experiments.
A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares
Raskutti, Garvesh, Mahoney, Michael
We consider statistical as well as algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. For a LS problem with input data $(X, Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n$, sketching algorithms use a sketching matrix, $S\in\mathbb{R}^{r \times n}$ with $r \ll n$. Then, rather than solving the LS problem using the full data $(X,Y)$, sketching algorithms solve the LS problem using only the sketched data $(SX, SY)$. Prior work has typically adopted an algorithmic perspective, in that it has made no statistical assumptions on the input $X$ and $Y$, and instead it has been assumed that the data $(X,Y)$ are fixed and worst-case (WC). Prior results show that, when using sketching matrices such as random projections and leverage-score sampling algorithms, with $p < r \ll n$, the WC error is the same as solving the original problem, up to a small constant. From a statistical perspective, we typically consider the mean-squared error performance of randomized sketching algorithms, when data $(X, Y)$ are generated according to a statistical model $Y = X \beta + \epsilon$, where $\epsilon$ is a noise process. We provide a rigorous comparison of both perspectives leading to insights on how they differ. To do this, we first develop a framework for assessing algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical prediction efficiency (PE) and the statistical residual efficiency (RE) of the sketched LS estimator; and we use our framework to provide upper bounds for several types of random projection and random sampling sketching algorithms. Among other results, we show that the RE can be upper bounded when $p < r \ll n$ while the PE typically requires the sample size $r$ to be substantially larger. Lower bounds developed in subsequent results show that our upper bounds on PE can not be improved.
Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels
Avron, Haim, Sindhwani, Vikas, Yang, Jiyan, Mahoney, Michael
We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large datasets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shift-invariant kernel functions (e.g., Gaussian kernel). In this paper, we propose to use Quasi-Monte Carlo (QMC) approximations instead, where the relevant integrands are evaluated on a low-discrepancy sequence of points as opposed to random point sets as in the Monte Carlo approach. We derive a new discrepancy measure called box discrepancy based on theoretical characterizations of the integration error with respect to a given sequence. We then propose to learn QMC sequences adapted to our setting based on explicit box discrepancy minimization. Our theoretical analyses are complemented with empirical results that demonstrate the effectiveness of classical and adaptive QMC techniques for this problem.
Statistical and Algorithmic Perspectives on Randomized Sketching for Ordinary Least-Squares -- ICML
Raskutti, Garvesh, Mahoney, Michael
We consider statistical and algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. Prior results show that, from an \emph{algorithmic perspective}, when using sketching matrices constructed from random projections and leverage-score sampling, if the number of samples $r$ much smaller than the original sample size $n$, then the worst-case (WC) error is the same as solving the original problem, up to a very small relative error. From a \emph{statistical perspective}, one typically considers the mean-squared error performance of randomized sketching algorithms, when data are generated according to a statistical linear model. In this paper, we provide a rigorous comparison of both perspectives leading to insights on how they differ. To do this, we first develop a framework for assessing, in a unified manner, algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical prediction efficiency (PE) and the statistical residual efficiency (RE) of the sketched LS estimator; and we use our framework to provide upper bounds for several types of random projection and random sampling algorithms. Among other results, we show that the RE can be upper bounded when $r$ is much smaller than $n$, while the PE typically requires the number of samples $r$ to be substantially larger. Lower bounds developed in subsequent work show that our upper bounds on PE can not be improved.