Javadi, Hamid
A Blessing of Dimensionality in Membership Inference through Regularization
Tan, Jasper, LeJeune, Daniel, Mason, Blake, Javadi, Hamid, Baraniuk, Richard G.
Is overparameterization a privacy liability? In this work, we study the effect that the number of parameters has on a classifier's vulnerability to membership inference attacks. We first demonstrate how the number of parameters of a model can induce a privacy--utility trade-off: increasing the number of parameters generally improves generalization performance at the expense of lower privacy. However, remarkably, we then show that if coupled with proper regularization, increasing the number of parameters of a model can actually simultaneously increase both its privacy and performance, thereby eliminating the privacy--utility trade-off. Theoretically, we demonstrate this curious phenomenon for logistic regression with ridge regularization in a bi-level feature ensemble setting. Pursuant to our theoretical exploration, we develop a novel leave-one-out analysis tool to precisely characterize the vulnerability of a linear classifier to the optimal membership inference attack. We empirically exhibit this "blessing of dimensionality" for neural networks on a variety of tasks using early stopping as the regularizer.
The Flip Side of the Reweighted Coin: Duality of Adaptive Dropout and Regularization
LeJeune, Daniel, Javadi, Hamid, Baraniuk, Richard G.
Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$\eta$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy that adapts to the weights in a monotonic way corresponds to an effective subquadratic regularization penalty, and therefore leads to sparse solutions. We obtain the effective penalties for several popular sparsification strategies, which are remarkably similar to classical penalties commonly used in sparse optimization. Considering variational dropout as a case study, we demonstrate similar empirical behavior between the adaptive dropout method and classical methods on the task of deep network sparsification, validating our theory.
Porcupine Neural Networks: Approximating Neural Network Landscapes
Feizi, Soheil, Javadi, Hamid, Zhang, Jesse, Tse, David
Neural networks have been used prominently in several machine learning and statistics applications. In general, the underlying optimization of neural networks is non-convex which makes analyzing their performance challenging. In this paper, we take another approach to this problem by constraining the network such that the corresponding optimization landscape has good theoretical properties without significantly compromising performance. In particular, for two-layer neural networks we introduce Porcupine Neural Networks (PNNs) whose weight vectors are constrained to lie over a finite set of lines. We show that most local optima of PNN optimizations are global while we have a characterization of regions where bad local optimizers may exist.
The Implicit Regularization of Ordinary Least Squares Ensembles
LeJeune, Daniel, Javadi, Hamid, Baraniuk, Richard G.
Ensemble methods (Breiman, 1996; Amit and Geman, 1997; Josse and Wager, 2016) are an oft-used strategy used successfully in a broad range of problems in machine learning and statistics, in which one combines a number of weak predictors together to obtain one powerful predictor. This is accomplished by giving each weak learner a different view of the training data. Various strategies for changing this training data view exist, among which many are simple sampling-based techniques in which each predictor is (independently) given access to a subsampling the rows (examples) and columns (features) of the training data matrix, such as bagging (Breiman, 1996; B uhlmann and Yu, 2002). Another noteworthy technique is boosting (Freund and Schapire, 1997; Breiman, 1998), in which the training data examples are reweighted adaptively according to how badly they have been misclassified while buliding the ensemble. In this work, we consider the former class of techniques--those that train each weak predictor using an independent subsampling of the training data. Ensemble methods based on independent example and feature subsampling are attractive for two reasons. First, they are computationally appealing in that they are massively parallelizable, and since each member of the ensemble uses only part of the data, they are able to overcome memory limitations faced by other methods (Louppe and Geurts, 2012). Second, ensemble methods are known to achieve lower risk due to the fact that combining several different predictors reduces variance (B uhlmann and Yu, 2002; Wager et al., 2014; Scornet et al., 2015), and empirically they have been found to perform very well. Random forests (Breiman, 2001; Athey et al., 2019; Friedberg et al., 2018), for example, ensemble methods that combine example and feature subsampling with shallow decision tress, remain among the best-performing off-the-shelf machine learning methods available (Cutler and Zhao, 2001; Fern andez-Delgado et al., 2014; Wyner et al., 2017).
A Hessian Based Complexity Measure for Deep Networks
Javadi, Hamid, Balestriero, Randall, Baraniuk, Richard
Deep (neural) networks have been applied productively in a wide range of supervised and unsupervised learning tasks. Unlike classical machine learning algorithms, deep networks typically operate in the overparameterized regime, where the number of parameters is larger than the number of training data points. Consequently, understanding the generalization properties and role of (explicit or implicit) regularization in these networks is of great importance. Inspired by the seminal work of Donoho and Grimes in manifold learning, we develop a new measure for the complexity of the function generated by a deep network based on the integral of the norm of the tangent Hessian. This complexity measure can be used to quantify the irregularity of the function a deep network fits to training data or as a regularization penalty for deep network learning. Indeed, we show that the oft-used heuristic of data augmentation imposes an implicit Hessian regularization during learning. We demonstrate the utility of our new complexity measure through a range of learning experiments.
Porcupine Neural Networks: Approximating Neural Network Landscapes
Feizi, Soheil, Javadi, Hamid, Zhang, Jesse, Tse, David
Neural networks have been used prominently in several machine learning and statistics applications. In general, the underlying optimization of neural networks is non-convex which makes analyzing their performance challenging. In this paper, we take another approach to this problem by constraining the network such that the corresponding optimization landscape has good theoretical properties without significantly compromising performance. In particular, for two-layer neural networks we introduce Porcupine Neural Networks (PNNs) whose weight vectors are constrained to lie over a finite set of lines. We show that most local optima of PNN optimizations are global while we have a characterization of regions where bad local optimizers may exist. Moreover, our theoretical and empirical results suggest that an unconstrained neural network can be approximated using a polynomially-large PNN.
Porcupine Neural Networks: Approximating Neural Network Landscapes
Feizi, Soheil, Javadi, Hamid, Zhang, Jesse, Tse, David
Neural networks have been used prominently in several machine learning and statistics applications. In general, the underlying optimization of neural networks is non-convex which makes analyzing their performance challenging. In this paper, we take another approach to this problem by constraining the network such that the corresponding optimization landscape has good theoretical properties without significantly compromising performance. In particular, for two-layer neural networks we introduce Porcupine Neural Networks (PNNs) whose weight vectors are constrained to lie over a finite set of lines. We show that most local optima of PNN optimizations are global while we have a characterization of regions where bad local optimizers may exist. Moreover, our theoretical and empirical results suggest that an unconstrained neural network can be approximated using a polynomially-large PNN.
False Discovery Rate Control via Debiased Lasso
Javanmard, Adel, Javadi, Hamid
We consider the problem of variable selection in high-dimensional statistical models where the goal is to report a set of variables, out of many predictors $X_1, \dotsc, X_p$, that are relevant to a response of interest. For linear high-dimensional model, where the number of parameters exceeds the number of samples $(p>n)$, we propose a procedure for variables selection and prove that it controls the \emph{directional} false discovery rate (FDR) below a pre-assigned significance level $q\in [0,1]$. We further analyze the statistical power of our framework and show that for designs with subgaussian rows and a common precision matrix $\Omega\in\mathbb{R}^{p\times p}$, if the minimum nonzero parameter $\theta_{\min}$ satisfies $$\sqrt{n} \theta_{\min} - \sigma \sqrt{2(\max_{i\in [p]}\Omega_{ii})\log\left(\frac{2p}{qs_0}\right)} \to \infty\,,$$ then this procedure achieves asymptotic power one. Our framework is built upon the debiasing approach and assumes the standard condition $s_0 = o(\sqrt{n}/(\log p)^2)$, where $s_0$ indicates the number of true positives among the $p$ features. Notably, this framework achieves exact directional FDR control without any assumption on the amplitude of unknown regression parameters, and does not require any knowledge of the distribution of covariates or the noise level. We test our method in synthetic and real data experiments to asses its performance and to corroborate our theoretical results.
An Instability in Variational Inference for Topic Models
Ghorbani, Behrooz, Javadi, Hamid, Montanari, Andrea
Topic models are Bayesian models that are frequently used to capture the latent structure of certain corpora of documents or images. Each data element in such a corpus (for instance each item in a collection of scientific articles) is regarded as a convex combination of a small number of vectors corresponding to `topics' or `components'. The weights are assumed to have a Dirichlet prior distribution. The standard approach towards approximating the posterior is to use variational inference algorithms, and in particular a mean field approximation. We show that this approach suffers from an instability that can produce misleading conclusions. Namely, for certain regimes of the model parameters, variational inference outputs a non-trivial decomposition into topics. However --for the same parameter values-- the data contain no actual information about the true decomposition, and hence the output of the algorithm is uncorrelated with the true topic decomposition. Among other consequences, the estimated posterior mean is significantly wrong, and estimated Bayesian credible regions do not achieve the nominal coverage. We discuss how this instability is remedied by more accurate mean field approximations.
Tensor Biclustering
Feizi, Soheil, Javadi, Hamid, Tse, David
Consider a dataset where data is collected on multiple features of multiple individuals over multiple times. This type of data can be represented as a three dimensional individual/feature/time tensor and has become increasingly prominent in various areas of science. The tensor biclustering problem computes a subset of individuals and a subset of features whose signal trajectories over time lie in a low-dimensional subspace, modeling similarity among the signal trajectories while allowing different scalings across different individuals or different features. We study the information-theoretic limit of this problem under a generative model. Moreover, we propose an efficient spectral algorithm to solve the tensor biclustering problem and analyze its achievability bound in an asymptotic regime. Finally, we show the efficiency of our proposed method in several synthetic and real datasets.