AITopics | Mahoney, Michael

Collaborating Authors

Mahoney, Michael

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior

Subramanian, Shashank, Harrington, Peter, Keutzer, Kurt, Bhimji, Wahid, Morozov, Dmitriy, Mahoney, Michael, Gholami, Amir

arXiv.org Artificial IntelligenceMay-31-2023

Pre-trained machine learning (ML) models have shown great performance for a wide range of applications, in particular in natural language processing (NLP) and computer vision (CV). Here, we study how pre-training could be used for scientific machine learning (SciML) applications, specifically in the context of transfer learning. We study the transfer behavior of these models as (i) the pre-trained model size is scaled, (ii) the downstream training dataset size is scaled, (iii) the physics parameters are systematically pushed out of distribution, and (iv) how a single model pre-trained on a mixture of different physics problems can be adapted to various downstream applications. We find that-when fine-tuned appropriately-transfer learning can help reach desired accuracy levels with orders of magnitude fewer downstream examples (across different tasks that can even be out-of-distribution) than training from scratch, with consistent behavior across a wide range of downstream examples. We also find that fine-tuning these models yields more performance gains as model size increases, compared to training from scratch on new downstream tasks. These results hold for a broad range of PDE learning tasks. All in all, our results demonstrate the potential of the "pre-train and fine-tune" paradigm for SciML problems, demonstrating a path towards building SciML foundation models. We open-source our code for reproducibility.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2306.00258

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.86)

Industry: Energy > Oil & Gas > Upstream (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Model-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

GACT: Activation Compressed Training for Generic Network Architectures

Liu, Xiaoxuan, Zheng, Lianmin, Wang, Dequan, Cen, Yukuo, Chen, Weize, Han, Xu, Chen, Jianfei, Liu, Zhiyuan, Tang, Jie, Gonzalez, Joey, Mahoney, Michael, Cheung, Alvin

arXiv.org Artificial IntelligenceSep-3-2022

Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. This paper presents GACT, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge. By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of GACT without prior knowledge on operator type or model architecture. To make training stable, we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time. We implement GACT as a PyTorch library that readily applies to any NN architecture. GACT reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1x, enabling training with a 4.2x to 24.7x larger batch size, with negligible accuracy loss. We implement GACT as a PyTorch library at https://github.com/LiuXiaoxuanPKU/GACT-ICML.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2206.11357

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

LocalNewton: Reducing Communication Bottleneck for Distributed Learning

Gupta, Vipul, Ghosh, Avishek, Derezinski, Michal, Khanna, Rajiv, Ramchandran, Kannan, Mahoney, Michael

arXiv.org Machine LearningMay-15-2021

To address the communication bottleneck problem in distributed optimization within a master-worker framework, we propose LocalNewton, a distributed second-order algorithm with local averaging. In LocalNewton, the worker machines update their model in every iteration by finding a suitable second-order descent direction using only the data and model stored in their own local memory. We let the workers run multiple such iterations locally and communicate the models to the master node only once every few (say L) iterations. LocalNewton is highly practical since it requires only one hyperparameter, the number L of local iterations. We use novel matrix concentration-based techniques to obtain theoretical guarantees for LocalNewton, and we validate them with detailed empirical evaluation. To enhance practicability, we devise an adaptive scheme to choose L, and we show that this reduces the number of local iterations in worker machines between two model synchronizations as the training proceeds, successively refining the model quality at the master. Via extensive experiments using several real-world datasets with AWS Lambda workers and an AWS EC2 master, we show that LocalNewton requires fewer than 60% of the communication rounds (between master and workers) and less than 40% of the end-to-end running time, compared to state-of-the-art algorithms, to reach the same training~loss.

artificial intelligence, localnewton, optimization problem, (17 more...)

arXiv.org Machine Learning

2105.0732

Country: North America > United States > California (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Information Technology > Communications (0.69)

Add feedback

ANODEV2: A Coupled Neural ODE Evolution Framework

Zhang, Tianjun, Yao, Zhewei, Gholami, Amir, Keutzer, Kurt, Gonzalez, Joseph, Biros, George, Mahoney, Michael

arXiv.org Machine LearningJun-9-2019

It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE). This observation motivated the introduction of so-called Neural ODEs, which allow more general discretization schemes with adaptive time stepping. Here, we propose ANODEV2, which is an extension of this approach that also allows evolution of the neural network parameters, in a coupled ODE-based formulation. The Neural ODE method introduced earlier is in fact a special case of this new more general framework. We present the formulation of ANODEV2, derive optimality conditions, and implement a coupled reaction-diffusion-advection version of this framework in PyTorch. We present empirical results using several different configurations of ANODEV2, testing them on multiple models on CIFAR-10. We report results showing that this coupled ODE-based framework is indeed trainable, and that it achieves higher accuracy, as compared to the baseline models as well as the recently-proposed Neural ODE approach.

deep learning, model parameter, neural network, (19 more...)

arXiv.org Machine Learning

1906.04596

Country:

North America > United States > Texas (0.14)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Trust Region Based Adversarial Attack on Neural Networks

Yao, Zhewei, Gholami, Amir, Xu, Peng, Keutzer, Kurt, Mahoney, Michael

arXiv.org Machine LearningDec-15-2018

Deep Neural Networks are quite vulnerable to adversarial perturbations. Current state-of-the-art adversarial attack methods typically require very time consuming hyper-parameter tuning, or require many iterations to solve an optimization based adversarial attack. To address this problem, we present a new family of trust region based adversarial attacks, with the goal of computing adversarial perturbations efficiently. We propose several attacks based on variants of the trust region optimization method. We test the proposed methods on Cifar-10 and ImageNet datasets using several different models including AlexNet, ResNet-50, VGG-16, and DenseNet-121 models. Our methods achieve comparable results with the Carlini-Wagner (CW) attack, but with significant speed up of up to $37\times$, for the VGG-16 model on a Titan Xp GPU. For the case of ResNet-50 on ImageNet, we can bring down its classification accuracy to less than 0.1\% with at most $1.5\%$ relative $L_\infty$ (or $L_2$) perturbation requiring only $1.02$ seconds as compared to $27.04$ seconds for the CW attack. We have open sourced our method which can be accessed at [1].

deep learning, neural network, perturbation, (16 more...)

arXiv.org Machine Learning

1812.06371

Country: North America > United States > California (0.28)

Genre: Research Report (0.64)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Large batch size training of neural networks with adversarial training and second-order information

Yao, Zhewei, Gholami, Amir, Keutzer, Kurt, Mahoney, Michael

arXiv.org Artificial IntelligenceOct-1-2018

Stochastic Gradient Descent (SGD) methods using randomly selected batches are widely-used to train neural network (NN) models. Performing design exploration to find the best NN for a particular task often requires extensive training with different models on a large dataset, which is very computationally expensive. The most straightforward method to accelerate this computation is to distribute the batch of SGD over multiple processors. To keep the distributed processors fully utilized requires commensurately growing the batch size; however, large batch training often times leads to degradation in accuracy, poor generalization, and even poor robustness to adversarial attacks. Existing solutions for large batch training either significantly degrade accuracy or require massive hyper-parameter tuning. To address this issue, we propose a novel large batch training method which combines recent results in adversarial training (to regularize against `sharp minima') and second order optimization (to use curvature information to change batch size adaptively during training). We extensively evaluate our method on Cifar-10/100, SVHN, TinyImageNet, and ImageNet datasets, using multiple NNs, including residual networks as well as smaller networks for mobile applications such as SqueezeNext. Our new approach exceeds the performance of the existing solutions in terms of both accuracy and the number of SGD iterations (up to 1\% and $5\times$, respectively). We emphasize that this is achieved without any additional hyper-parameter tuning to tailor our proposed method in any of these experiments.

batch size, deep learning, neural network, (17 more...)

arXiv.org Artificial Intelligence

1810.01021

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares

Raskutti, Garvesh, Mahoney, Michael

arXiv.org Machine LearningAug-25-2015

We consider statistical as well as algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. For a LS problem with input data $(X, Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n$, sketching algorithms use a sketching matrix, $S\in\mathbb{R}^{r \times n}$ with $r \ll n$. Then, rather than solving the LS problem using the full data $(X,Y)$, sketching algorithms solve the LS problem using only the sketched data $(SX, SY)$. Prior work has typically adopted an algorithmic perspective, in that it has made no statistical assumptions on the input $X$ and $Y$, and instead it has been assumed that the data $(X,Y)$ are fixed and worst-case (WC). Prior results show that, when using sketching matrices such as random projections and leverage-score sampling algorithms, with $p < r \ll n$, the WC error is the same as solving the original problem, up to a small constant. From a statistical perspective, we typically consider the mean-squared error performance of randomized sketching algorithms, when data $(X, Y)$ are generated according to a statistical model $Y = X \beta + \epsilon$, where $\epsilon$ is a noise process. We provide a rigorous comparison of both perspectives leading to insights on how they differ. To do this, we first develop a framework for assessing algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical prediction efficiency (PE) and the statistical residual efficiency (RE) of the sketched LS estimator; and we use our framework to provide upper bounds for several types of random projection and random sampling sketching algorithms. Among other results, we show that the RE can be upper bounded when $p < r \ll n$ while the PE typically requires the sample size $r$ to be substantially larger. Lower bounds developed in subsequent results show that our upper bounds on PE can not be improved.

artificial intelligence, leverage score, survey article, (19 more...)

arXiv.org Machine Learning

1406.5986

Country:

North America > United States > Wisconsin (0.14)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels

Avron, Haim, Sindhwani, Vikas, Yang, Jiyan, Mahoney, Michael

arXiv.org Machine LearningAug-9-2015

We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large datasets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shift-invariant kernel functions (e.g., Gaussian kernel). In this paper, we propose to use Quasi-Monte Carlo (QMC) approximations instead, where the relevant integrands are evaluated on a low-discrepancy sequence of points as opposed to random point sets as in the Monte Carlo approach. We derive a new discrepancy measure called box discrepancy based on theoretical characterizations of the integration error with respect to a given sequence. We then propose to learn QMC sequences adapted to our setting based on explicit box discrepancy minimization. Our theoretical analyses are complemented with empirical results that demonstrate the effectiveness of classical and adaptive QMC techniques for this problem.

artificial intelligence, sequence, survey article, (17 more...)

arXiv.org Machine Learning

1412.8293

Country: North America > United States > California > Santa Clara County (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining (0.93)

Add feedback

Statistical and Algorithmic Perspectives on Randomized Sketching for Ordinary Least-Squares -- ICML

Raskutti, Garvesh, Mahoney, Michael

arXiv.org Machine LearningMay-25-2015

We consider statistical and algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. Prior results show that, from an \emph{algorithmic perspective}, when using sketching matrices constructed from random projections and leverage-score sampling, if the number of samples $r$ much smaller than the original sample size $n$, then the worst-case (WC) error is the same as solving the original problem, up to a very small relative error. From a \emph{statistical perspective}, one typically considers the mean-squared error performance of randomized sketching algorithms, when data are generated according to a statistical linear model. In this paper, we provide a rigorous comparison of both perspectives leading to insights on how they differ. To do this, we first develop a framework for assessing, in a unified manner, algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical prediction efficiency (PE) and the statistical residual efficiency (RE) of the sketched LS estimator; and we use our framework to provide upper bounds for several types of random projection and random sampling algorithms. Among other results, we show that the RE can be upper bounded when $r$ is much smaller than $n$, while the PE typically requires the number of samples $r$ to be substantially larger. Lower bounds developed in subsequent work show that our upper bounds on PE can not be improved.

algorithm, artificial intelligence, survey article, (16 more...)

arXiv.org Machine Learning

1505.06659

Country:

North America > United States > Wisconsin > Dane County > Madison (0.14)
North America > United States > California > Alameda County > Berkeley (0.14)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.67)

Add feedback