AITopics | Mackey, Lester

Collaborating Authors

Mackey, Lester

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Generalized Kernel Thinning

Dwivedi, Raaz, Mackey, Lester

arXiv.org Machine LearningOct-18-2021

The kernel thinning (KT) algorithm of Dwivedi and Mackey (2021) compresses a probability distribution more effectively than independent sampling by targeting a reproducing kernel Hilbert space (RKHS) and leveraging a less smooth square-root kernel. Here we provide four improvements. First, we show that KT applied directly to the target RKHS yields tighter, dimension-free guarantees for any kernel, any distribution, and any fixed function in the RKHS. Second, we show that, for analytic kernels like Gaussian, inverse multiquadric, and sinc, target KT admits maximum mean discrepancy (MMD) guarantees comparable to or better than those of square-root KT without making explicit use of a square-root kernel. Third, we prove that KT with a fractional power kernel yields better-than-Monte-Carlo MMD guarantees for non-smooth kernels, like Laplace and Mat\'ern, that do not have square-roots. Fourth, we establish that KT applied to a sum of the target and power kernels (a procedure we call KT+) simultaneously inherits the improved MMD guarantees of power KT and the tighter individual function guarantees of target KT. In our experiments with target KT and KT+, we witness significant improvements in integration error even in $100$ dimensions and when compressing challenging differential equation posteriors.

artificial intelligence, health & medicine, machine learning, (18 more...)

arXiv.org Machine Learning

2110.01593

Country:

North America > United States > Virginia (0.14)
North America > United States > Massachusetts (0.14)

Genre: Research Report (0.84)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Learned Benchmarks for Subseasonal Forecasting

Mouatadid, Soukayna, Orenstein, Paulo, Flaspohler, Genevieve, Oprescu, Miruna, Cohen, Judah, Wang, Franklyn, Knight, Sean, Geogdzhayeva, Maria, Levang, Sam, Fraenkel, Ernest, Mackey, Lester

arXiv.org Machine LearningSep-21-2021

We develop a subseasonal forecasting toolkit of simple learned benchmark models that outperform both operational practice and state-of-the-art machine learning and deep learning methods. Our new models include (a) Climatology++, an adaptive alternative to climatology that, for precipitation, is 9% more accurate and 250% more skillful than the United States operational Climate Forecasting System (CFSv2); (b) CFSv2++, a learned CFSv2 correction that improves temperature and precipitation accuracy by 7-8% and skill by 50-275%; and (c) Persistence++, an augmented persistence model that combines CFSv2 forecasts with lagged measurements to improve temperature and precipitation accuracy by 6-9% and skill by 40-130%. Across the contiguous U.S., our Climatology++, CFSv2++, and Persistence++ toolkit consistently outperforms standard meteorological baselines, state-of-the-art machine and deep learning methods, and the European Centre for Medium-Range Weather Forecasts ensemble. Overall, we find that augmenting traditional forecasting approaches with learned enhancements yields an effective and computationally inexpensive strategy for building the next generation of subseasonal forecasting benchmarks.

educational method, mentoring method, precipitation, (29 more...)

arXiv.org Machine Learning

2109.10399

Country:

North America > United States > Massachusetts > Middlesex County (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.50)

Industry: Government > Regional Government > North America Government > United States Government (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Social Norm Bias: Residual Harms of Fairness-Aware Algorithms

Cheng, Myra, De-Arteaga, Maria, Mackey, Lester, Kalai, Adam Tauman

arXiv.org Artificial IntelligenceAug-25-2021

Many modern learning algorithms mitigate bias by enforcing fairness across coarsely-defined groups related to a sensitive attribute like gender or race. However, the same algorithms seldom account for the within-group biases that arise due to the heterogeneity of group members. In this work, we characterize Social Norm Bias (SNoB), a subtle but consequential type of discrimination that may be exhibited by automated decision-making systems, even when these systems achieve group fairness objectives. We study this issue through the lens of gender bias in occupation classification from biographies. We quantify SNoB by measuring how an algorithm's predictions are associated with conformity to gender norms, which is measured using a machine learning approach. This framework reveals that for classification tasks related to male-dominated occupations, fairness-aware classifiers favor biographies written in ways that align with masculine gender norms. We compare SNoB across fairness intervention techniques and show that post-processing interventions do not mitigate this type of bias at all.

artificial intelligence, machine learning, obj filter flatedecode length 149, (7 more...)

arXiv.org Artificial Intelligence

2108.11056

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Near-optimal inference in adaptive linear regression

Khamaru, Koulik, Deshpande, Yash, Mackey, Lester, Wainwright, Martin J.

arXiv.org Machine LearningJul-13-2021

When data is collected in an adaptive manner, even simple methods like ordinary least squares can exhibit non-normal asymptotic behavior. As an undesirable consequence, hypothesis tests and confidence intervals based on asymptotic normality can lead to erroneous results. We propose an online debiasing estimator to correct these distributional anomalies in least squares estimation. Our proposed method takes advantage of the covariance structure present in the dataset and provides sharper estimates in directions for which more information has accrued. We establish an asymptotic normality property for our proposed online debiasing estimator under mild conditions on the data collection process, and provide asymptotically exact confidence intervals. We additionally prove a minimax lower bound for the adaptive linear regression problem, thereby providing a baseline by which to compare estimators. There are various conditions under which our proposed estimator achieves the minimax lower bound up to logarithmic factors. We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.

artificial intelligence, estimator, machine learning, (16 more...)

arXiv.org Machine Learning

2107.02266

Country:

North America > United States > California (0.14)
Europe > United Kingdom > England (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.72)

Add feedback

Online Learning with Optimism and Delay

Flaspohler, Genevieve, Orabona, Francesco, Cohen, Judah, Mouatadid, Soukayna, Oprescu, Miruna, Orenstein, Paulo, Mackey, Lester

arXiv.org Machine LearningJun-16-2021

Inspired by the demands of real-time climate and weather forecasting, we develop optimistic online learning algorithms that require no parameter tuning and have optimal regret guarantees under delayed feedback. Our algorithms -- DORM, DORM+, and AdaHedgeD -- arise from a novel reduction of delayed online learning to optimistic online learning that reveals how optimistic hints can mitigate the regret penalty caused by delay. We pair this delay-as-optimism perspective with a new analysis of optimistic learning that exposes its robustness to hinting errors and a new meta-algorithm for learning effective hinting strategies in the presence of delay. We conclude by benchmarking our algorithms on four subseasonal climate forecasting tasks, demonstrating low regret relative to state-of-the-art forecasting models.

algorithm, computer based training, educational technology, (19 more...)

arXiv.org Machine Learning

2106.06885

Country:

North America > United States > Massachusetts (0.14)
North America > Canada > Ontario > Toronto (0.14)
Europe > United Kingdom > Scotland (0.14)
Europe > United Kingdom > England (0.14)

Genre:

Overview (0.67)
Research Report (0.63)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Kernel Thinning

Dwivedi, Raaz, Mackey, Lester

arXiv.org Machine LearningMay-12-2021

We introduce kernel thinning, a simple algorithm for generating better-than-Monte-Carlo approximations to distributions $\mathbb{P}$ on $\mathbb{R}^d$. Given $n$ input points, a suitable reproducing kernel $\mathbf{k}$, and $\mathcal{O}(n^2)$ time, kernel thinning returns $\sqrt{n}$ points with comparable integration error for every function in the associated reproducing kernel Hilbert space. With high probability, the maximum discrepancy in integration error is $\mathcal{O}_d(n^{-\frac{1}{2}}\sqrt{\log n})$ for compactly supported $\mathbb{P}$ and $\mathcal{O}_d(n^{-\frac{1}{2}} \sqrt{(\log n)^{d+1}\log\log n})$ for sub-exponential $\mathbb{P}$. In contrast, an equal-sized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{-\frac14})$ integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$ and a wide range of common kernels. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat\'ern, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning.

artificial intelligence, kernel, machine learning, (18 more...)

arXiv.org Machine Learning

2105.05842

Country:

Europe (0.28)
North America > United States > Virginia (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Initialization and Regularization of Factorized Neural Layers

Khodak, Mikhail, Tenenholtz, Neil, Mackey, Lester, Fusi, Nicolò

arXiv.org Artificial IntelligenceMay-3-2021

Factorized layers--operations parameterized by products of two or more matrices--occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head selfattention architectures. We study how to initialize and regularize deep nets containing such layers, examining two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance. The guiding insight is to design optimization routines for these networks that are as close as possible to that of their well-tuned, non-decomposed counterparts; we back this intuition with an analysis of how the initialization and regularization schemes impact training with gradient descent, drawing on modern attempts to understand the interplay of weight-decay and batch-normalization. Empirically, we highlight the benefits of spectral initialization and Frobenius decay across a variety of settings. In model compression, we show that they enable low-rank methods to significantly outperform both unstructured sparsity and tensor methods on the task of training low-memory residual networks; analogs of the schemes also improve the performance of tensor decomposition techniques. For knowledge distillation, Frobenius decay enables a simple, overcomplete baseline that yields a compact model from over-parameterized training without requiring retraining with or pruning a teacher network. Finally, we show how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training. Most neural network layers consist of matrix-parameterized functions followed by simple operations such as activation or normalization.

deep learning, neural network, proceedings, (18 more...)

arXiv.org Artificial Intelligence

2105.01029

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.86)

Add feedback

Knowledge Distillation as Semiparametric Inference

Dao, Tri, Kamath, Govinda M, Syrgkanis, Vasilis, Mackey, Lester

arXiv.org Machine LearningApr-19-2021

A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data. To explain and enhance this phenomenon, we cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate. By adapting modern semiparametric tools, we derive new guarantees for the prediction error of standard distillation and develop two enhancements--cross-fitting and loss correction--to mitigate the impact of teacher overfitting and underfitting on student performance. We validate our findings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements. Knowledge distillation (KD) (Craven & Shavlik, 1996; Breiman & Shang, 1996; Bucila et al., 2006; Li et al., 2014; Ba & Caruana, 2014; Hinton et al., 2015) is a widely used model compression technique that enables the deployment of highly accurate predictive models on devices such as phones, watches, and virtual assistants (Stock et al., 2020).

neural network, probability, survey article, (15 more...)

arXiv.org Machine Learning

2104.09732

Country: North America > United States > California > Alameda County > Berkeley (0.14)

Genre: Research Report > New Finding (0.66)

Industry:

Information Technology > Security & Privacy (0.67)
Education > Assessment & Standards > Student Performance (0.34)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Cross-validation Confidence Intervals for Test Error

Bayle, Pierre, Bayle, Alexandre, Janson, Lucas, Mackey, Lester

arXiv.org Machine LearningOct-31-2020

This work develops central limit theorems for cross-validation and consistent estimators of its asymptotic variance under weak stability conditions on the learning algorithm. Together, these results provide practical, asymptotically-exact confidence intervals for $k$-fold test error and valid, powerful hypothesis tests of whether one learning algorithm has smaller $k$-fold test error than another. These results are also the first of their kind for the popular choice of leave-one-out cross-validation. In our real-data experiments with diverse learning algorithms, the resulting intervals and tests outperform the most popular alternative methods from the literature.

artificial intelligence, health & medicine, test error, (19 more...)

arXiv.org Machine Learning

2007.12671

Country:

North America > United States > Illinois (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.69)

Add feedback

Stochastic Stein Discrepancies

Gorham, Jackson, Raj, Anant, Mackey, Lester

arXiv.org Machine LearningOct-22-2020

Stein discrepancies (SDs) monitor convergence and non-convergence in approximate inference when exact integration and sampling are intractable. However, the computation of a Stein discrepancy can be prohibitive if the Stein operator - often a sum over likelihood terms or potentials - is expensive to evaluate. To address this deficiency, we show that stochastic Stein discrepancies (SSDs) based on subsampled approximations of the Stein operator inherit the convergence control properties of standard SDs with probability 1. Along the way, we establish the convergence of Stein variational gradient descent (SVGD) on unbounded domains, resolving an open question of Liu (2017). In our experiments with biased Markov chain Monte Carlo (MCMC) hyperparameter tuning, approximate MCMC sampler selection, and stochastic SVGD, SSDs deliver comparable inferences to standard SDs with orders of magnitude fewer likelihood evaluations.

artificial intelligence, machine learning, stein discrepancy, (16 more...)

arXiv.org Machine Learning

2007.02857

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback