Goto

Collaborating Authors

 LeJeune, Daniel


Fine-grained Analysis and Faster Algorithms for Iteratively Solving Linear Systems

arXiv.org Artificial Intelligence

While effective in practice, iterative methods for solving large systems of linear equations can be significantly affected by problem-dependent condition number quantities. This makes characterizing their time complexity challenging, particularly when we wish to make comparisons between deterministic and stochastic methods, that may or may not rely on preconditioning and/or fast matrix multiplication. In this work, we consider a fine-grained notion of complexity for iterative linear solvers which we call the spectral tail condition number, $\kappa_\ell$, defined as the ratio between the $\ell$th largest and the smallest singular value of the matrix representing the system. Concretely, we prove the following main algorithmic result: Given an $n\times n$ matrix $A$ and a vector $b$, we can find $\tilde{x}$ such that $\|A\tilde{x}-b\|\leq\epsilon\|b\|$ in time $\tilde{O}(\kappa_\ell\cdot n^2\log 1/\epsilon)$ for any $\ell = O(n^{\frac1{\omega-1}})=O(n^{0.729})$, where $\omega \approx 2.372$ is the current fast matrix multiplication exponent. This guarantee is achieved by Sketch-and-Project with Nesterov's acceleration. Some of the implications of our result, and of the use of $\kappa_\ell$, include direct improvement over a fine-grained analysis of the Conjugate Gradient method, suggesting a stronger separation between deterministic and stochastic iterative solvers; and relating the complexity of iterative solvers to the ongoing algorithmic advances in fast matrix multiplication, since the bound on $\ell$ improves with $\omega$. Our main technical contributions are new sharp characterizations for the first and second moments of the random projection matrix that commonly arises in sketching algorithms, building on a combination of techniques from combinatorial sampling via determinantal point processes and Gaussian universality results from random matrix theory.


An Adaptive Tangent Feature Perspective of Neural Networks

arXiv.org Artificial Intelligence

In order to better understand feature learning in neural networks, we propose a framework for understanding linear models in tangent feature space where the features are allowed to be transformed during training. We consider linear transformations of features, resulting in a joint optimization over parameters and transformations with a bilinear interpolation constraint. We show that this optimization problem has an equivalent linearly constrained optimization with structured regularization that encourages approximately low rank solutions. Specializing to neural network structure, we gain insights into how the features and thus the kernel function change, providing additional nuance to the phenomenon of kernel alignment when the target function is poorly represented using tangent features. In addition to verifying our theoretical observations in real neural networks on a simple regression problem, we empirically show that an adaptive feature implementation of tangent feature classification has an order of magnitude lower sample complexity than the fixed tangent feature model on MNIST and CIFAR-10.


Asymptotically free sketched ridge ensembles: Risks, cross-validation, and tuning

arXiv.org Machine Learning

We employ random matrix theory to establish consistency of generalized cross validation (GCV) for estimating prediction risks of sketched ridge regression ensembles, enabling efficient and consistent tuning of regularization and sketching parameters. Our results hold for a broad class of asymptotically free sketches under very mild data assumptions. For squared prediction risk, we provide a decomposition into an unsketched equivalent implicit ridge bias and a sketching-based variance, and prove that the risk can be globally optimized by only tuning sketch size in infinite ensembles. For general subquadratic prediction risk functionals, we extend GCV to construct consistent risk estimators, and thereby obtain distributional convergence of the GCV-corrected predictions in Wasserstein-2 metric. This in particular allows construction of prediction intervals with asymptotically correct coverage conditional on the training data. We also propose an "ensemble trick" whereby the risk for unsketched ridge regression can be efficiently estimated via GCV using small sketched ridge ensembles. We empirically validate our theoretical results using both synthetic and real large-scale datasets with practical sketches including CountSketch and subsampled randomized discrete cosine transforms.


The Common Intuition to Transfer Learning Can Win or Lose: Case Studies for Linear Regression

arXiv.org Artificial Intelligence

We study a fundamental transfer learning process from source to target linear regression tasks, including overparameterized settings where there are more learned parameters than data samples. The target task learning is addressed by using its training data together with the parameters previously computed for the source task. We define a transfer learning approach to the target task as a linear regression optimization with a regularization on the distance between the to-be-learned target parameters and the already-learned source parameters. We analytically characterize the generalization performance of our transfer learning approach and demonstrate its ability to resolve the peak in generalization errors in double descent phenomena of the minimum L2-norm solution to linear regression. Moreover, we show that for sufficiently related tasks, the optimally tuned transfer learning approach can outperform the optimally tuned ridge regression method, even when the true parameter vector conforms to an isotropic Gaussian prior distribution. Namely, we demonstrate that transfer learning can beat the minimum mean square error (MMSE) solution of the independent target task. Our results emphasize the ability of transfer learning to extend the solution space to the target task and, by that, to have an improved MMSE solution. We formulate the linear MMSE solution to our transfer learning setting and point out its key differences from the common design philosophy to transfer learning.


Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization

arXiv.org Artificial Intelligence

Machine learning models are typically evaluated by shuffling a set of labeled data, splitting it into training and test sets, and evaluating the model trained on the training set on the test set. This measures how well the model performs on the distribution the model was trained on. However, in practice a model is most commonly not applied to such in-distribution data, but rather to outof-distribution data that is almost always at least slightly different. In order to understand the performance of machine learning methods in practice, it is therefore important to understand how out-of-distribution performance relates to in-distribution performance. While there are settings in which models with similar in-distribution performance have different out-of-distribution performance (McCoy et al., 2020), a series of recent empirical studies have shown that often, the in-distribution and out-of-distribution performances of models are strongly correlated: Recht et al. (2019), Yadav and Bottou (2019), and Miller et al. (2020) constructed new test sets for the popular CIFAR-10, ImageNet, and MNIST image classification problems and for the SQuAD question answering datasets by following the original data collection and labeling process as closely as possible. For CIFAR-10 and ImageNet the performance drops significantly when evaluated on the new test set, indicating that even when following the original data collection and labeling process, a significant distribution shift can occur. In addition, for all four distribution shifts, the in-and out-of-distribution errors are strongly linearly correlated.


Self-Consuming Generative Models Go MAD

arXiv.org Artificial Intelligence

Seismic advances in generative AI algorithms for imagery, text, and other data types has led to the temptation to use synthetic data to train next-generation models. Repeating this process creates an autophagous ("self-consuming") loop whose properties are poorly understood. We conduct a thorough analytical and empirical analysis using state-of-the-art generative image models of three families of autophagous loops that differ in how fixed or fresh real training data is available through the generations of training and in whether the samples from previousgeneration models have been biased to trade off data quality versus diversity. Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease.


A Blessing of Dimensionality in Membership Inference through Regularization

arXiv.org Artificial Intelligence

Is overparameterization a privacy liability? In this work, we study the effect that the number of parameters has on a classifier's vulnerability to membership inference attacks. We first demonstrate how the number of parameters of a model can induce a privacy--utility trade-off: increasing the number of parameters generally improves generalization performance at the expense of lower privacy. However, remarkably, we then show that if coupled with proper regularization, increasing the number of parameters of a model can actually simultaneously increase both its privacy and performance, thereby eliminating the privacy--utility trade-off. Theoretically, we demonstrate this curious phenomenon for logistic regression with ridge regularization in a bi-level feature ensemble setting. Pursuant to our theoretical exploration, we develop a novel leave-one-out analysis tool to precisely characterize the vulnerability of a linear classifier to the optimal membership inference attack. We empirically exhibit this "blessing of dimensionality" for neural networks on a variety of tasks using early stopping as the regularizer.


The Flip Side of the Reweighted Coin: Duality of Adaptive Dropout and Regularization

arXiv.org Machine Learning

Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$\eta$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy that adapts to the weights in a monotonic way corresponds to an effective subquadratic regularization penalty, and therefore leads to sparse solutions. We obtain the effective penalties for several popular sparsification strategies, which are remarkably similar to classical penalties commonly used in sparse optimization. Considering variational dropout as a case study, we demonstrate similar empirical behavior between the adaptive dropout method and classical methods on the task of deep network sparsification, validating our theory.


The Implicit Regularization of Ordinary Least Squares Ensembles

arXiv.org Machine Learning

Ensemble methods (Breiman, 1996; Amit and Geman, 1997; Josse and Wager, 2016) are an oft-used strategy used successfully in a broad range of problems in machine learning and statistics, in which one combines a number of weak predictors together to obtain one powerful predictor. This is accomplished by giving each weak learner a different view of the training data. Various strategies for changing this training data view exist, among which many are simple sampling-based techniques in which each predictor is (independently) given access to a subsampling the rows (examples) and columns (features) of the training data matrix, such as bagging (Breiman, 1996; B uhlmann and Yu, 2002). Another noteworthy technique is boosting (Freund and Schapire, 1997; Breiman, 1998), in which the training data examples are reweighted adaptively according to how badly they have been misclassified while buliding the ensemble. In this work, we consider the former class of techniques--those that train each weak predictor using an independent subsampling of the training data. Ensemble methods based on independent example and feature subsampling are attractive for two reasons. First, they are computationally appealing in that they are massively parallelizable, and since each member of the ensemble uses only part of the data, they are able to overcome memory limitations faced by other methods (Louppe and Geurts, 2012). Second, ensemble methods are known to achieve lower risk due to the fact that combining several different predictors reduces variance (B uhlmann and Yu, 2002; Wager et al., 2014; Scornet et al., 2015), and empirically they have been found to perform very well. Random forests (Breiman, 2001; Athey et al., 2019; Friedberg et al., 2018), for example, ensemble methods that combine example and feature subsampling with shallow decision tress, remain among the best-performing off-the-shelf machine learning methods available (Cutler and Zhao, 2001; Fern andez-Delgado et al., 2014; Wyner et al., 2017).


Thresholding Graph Bandits with GrAPL

arXiv.org Machine Learning

Systems that recommend products, services, or other attention-targets have become indispensable in the effective curation of information. Such personalization and recommendation techniques have become ubiquitous not only in product/content recommendation and ad placements but also in a wide range of applications like drug testing, spatial sampling, environmental monitoring, and rate adaptation in communication networks; see e.g., Villar et al. (2015); Combes et al. (2014); Srinivas et al. (2010). These are often modeled as sequential decision making or bandit problems, where an algorithm needs to choose among a set of decisions (or arms) sequentially to maximize a desired performance criterion. Recently, an important variant of the bandit problem was proposed by Locatelli et al. (2016) and Gotovos et al. (2013), where the goal is to rapidly identify all arms that are above (and below) a fixed threshold. This thresholding bandit framework, which may be thought of as a version of the combinatorial pure exploration problem (Chen et al., 2014), is useful in various applications like environmental monitoring, where one might want to identify the hypoxic (low-oxygen-content) regions in a lake; like crowd-sourcing, where one might want to keep all workers whose productivity trumps the cost to hire them; or like political polling, where one wants to identify which political candidate individual voting districts prefer.