AITopics

2102.08907

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Singapore (0.04)

Genre:

Research Report (0.50)
Instructional Material (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.57)

Lounici, Karim, Meziani, Katia, Riu, Benjamin

Muddling Labels for Regularization, a novel approach to generalization

arXiv.org Artificial IntelligenceFeb-17-2021

Generalization is a central problem in Machine Learning. Indeed most prediction methods require careful calibration of hyperparameters usually carried out on a hold-out \textit{validation} dataset to achieve generalization. The main goal of this paper is to introduce a novel approach to achieve generalization without any data splitting, which is based on a new risk measure which directly quantifies a model's tendency to overfit. To fully understand the intuition and advantages of this new approach, we illustrate it in the simple linear regression model ($Y=X\beta+\xi$) where we develop a new criterion. We highlight how this criterion is a good proxy for the true generalization risk. Next, we derive different procedures which tackle several structures simultaneously (correlation, sparsity,...). Noticeably, these procedures \textbf{concomitantly} train the model and calibrate the hyperparameters. In addition, these procedures can be implemented via classical gradient descent methods when the criterion is differentiable w.r.t. the hyperparameters. Our numerical experiments reveal that our procedures are computationally feasible and compare favorably to the popular approach (Ridge, LASSO and Elastic-Net combined with grid-search cross-validation) in term of generalization. They also outperform the baseline on two additional tasks: estimation and support recovery of $\beta$. Moreover, our procedures do not require any expertise for the calibration of the initial parameters which remain the same for all the datasets we experimented on.

generalization, procedure, regularization, (17 more...)

arXiv.org Artificial Intelligence

2102.08769

Country:

South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)

Genre: Research Report > Promising Solution (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Analysis of feature learning in weight-tied autoencoders via the mean field lens

Nguyen, Phan-Minh

Autoencoders are among the earliest introduced nonlinear models for unsupervised learning. Although they are widely adopted beyond research, it has been a longstanding open problem to understand mathematically the feature extraction mechanism that trained nonlinear autoencoders provide. In this work, we make progress in this problem by analyzing a class of two-layer weight-tied nonlinear autoencoders in the mean field framework. Upon a suitable scaling, in the regime of a large number of neurons, the models trained with stochastic gradient descent are shown to admit a mean field limiting dynamics. This limiting description reveals an asymptotically precise picture of feature learning by these models: their training dynamics exhibit different phases that correspond to the learning of different principal subspaces of the data, with varying degrees of nonlinear shrinkage dependent on the $\ell_{2}$-regularization and stopping time. While we prove these results under an idealized assumption of (correlated) Gaussian data, experiments on real-life data demonstrate an interesting match with the theory. The autoencoder setup of interests poses a nontrivial mathematical challenge to proving these results. In this setup, the "Lipschitz" constants of the models grow with the data dimension $d$. Consequently an adaptation of previous analyses requires a number of neurons $N$ that is at least exponential in $d$. Our main technical contribution is a new argument which proves that the required $N$ is only polynomial in $d$. We conjecture that $N\gg d$ is sufficient and that $N$ is necessarily larger than a data-dependent intrinsic dimension, a behavior that is fundamentally different from previously studied setups.

autoencoder, nullnull, xnull, (14 more...)

2102.08373

Country:

North America > United States > New York (0.04)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Zhang, Yiliang, Bu, Zhiqi

Efficient Designs of SLOPE Penalty Sequences in Finite Dimension

In linear regression, SLOPE is a new convex analysis method that generalizes the Lasso via the sorted L1 penalty: larger fitted coefficients are penalized more heavily. This magnitude-dependent regularization requires an input of penalty sequence $\lambda$, instead of a scalar penalty as in the Lasso case, thus making the design extremely expensive in computation. In this paper, we propose two efficient algorithms to design the possibly high-dimensional SLOPE penalty, in order to minimize the mean squared error. For Gaussian data matrices, we propose a first order Projected Gradient Descent (PGD) under the Approximate Message Passing regime. For general data matrices, we present a zero-th order Coordinate Descent (CD) to design a sub-class of SLOPE, referred to as the k-level SLOPE. Our CD allows a useful trade-off between the accuracy and the computation speed. We demonstrate the performance of SLOPE with our designs via extensive experiments on synthetic data and real-world datasets.

efficient design, projection, slope penalty sequence, (10 more...)

2102.07211

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
North America > United States > Pennsylvania (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Mishchenko, Konstantin, Wang, Bokun, Kovalev, Dmitry, Richtárik, Peter

IntSGD: Floatless Compression of Stochastic Gradients

We propose a family of lossy integer compressions for Stochastic Gradient Descent (SGD) that do not communicate a single float. This is achieved by multiplying floating-point vectors with a number known to every device and then rounding to an integer number. Our theory shows that the iteration complexity of SGD does not change up to constant factors when the vectors are scaled properly. Moreover, this holds for both convex and non-convex functions, with and without overparameterization. In contrast to other compression-based algorithms, ours preserves the convergence rate of SGD even on non-smooth problems. Finally, we show that when the data is significantly heterogeneous, it may become increasingly hard to keep the integers bounded and propose an alternative algorithm, IntDIANA, to solve this type of problems.

floatless compression, intgd-mavg, intsgd, (13 more...)

2102.08374

Country:

North America > United States > California (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.92)

How to Learn when Data Reacts to Your Model: Performative Gradient Descent

Izzo, Zachary, Ying, Lexing, Zou, James

Performative distribution shift captures the setting where the choice of which ML model is deployed changes the data distribution. For example, a bank which uses the number of open credit lines to determine a customer's risk of default on a loan may induce customers to open more credit lines in order to improve their chances of being approved. Because of the interactions between the model and data distribution, finding the optimal model parameters is challenging. Works in this area have focused on finding stable points, which can be far from optimal. Here we introduce performative gradient descent (PerfGD), which is the first algorithm which provably converges to the performatively optimal point. PerfGD explicitly captures how changes in the model affects the data distribution and is simple to use. We support our findings with theory and experiments.

converge, perfgd, performative loss, (15 more...)

2102.07698

Country:

Asia > Middle East > Republic of Türkiye > Samsun Province > Samsun (0.04)
Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Banking & Finance > Credit (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)

Liang, Jason, Kelly, Keith

Training Stacked Denoising Autoencoders for Representation Learning

arXiv.org Artificial IntelligenceFeb-16-2021

We implement stacked denoising autoencoders, a class of neural networks that are capable of learning powerful representations of high dimensional data. We describe stochastic gradient descent for unsupervised training of autoencoders, as well as a novel genetic algorithm based approach that makes use of gradient information. We analyze the performance of both optimization algorithms and also the representation learning ability of the autoencoder when it is trained on standard image classification datasets. The weight matrix of the decoding stage is the transpose of the weight matrix of the encoding stage. Autoencoders are a method for performing representation learning, an unsupervised pretraining process during which a more useful representation of the input data is automatically determined. Representation learning is important in machine learning since "the performance of machine learning methods is heavily dependent on the choice of data representation (or features) in which they are applied" [1]. For many supervised classification tasks, the high dimensionality of the input data means that the classifier requires an enormous number of training examples in order to generalize well and not overfit. Autoencoders are one such representation learning tool. An autoencoder is a neural network with a single hidden layer and where the output layer and the input layer have the same size. Then we have a neural network as shown in Figure 1. The weight matrix of the decoding stage is the transpose of weight matrix of the encoding stage in order to reduce the number of parameters to learn. After an autoencoder is trained, its decoding stage is discarded and the encoding stage is used to transform the training input examples as a preprocessing step.

algorithm, autoencoder, representation, (16 more...)

arXiv.org Artificial Intelligence

2102.08012

Country: Europe > France (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Lin, Wu, Nielsen, Frank, Khan, Mohammad Emtiyaz, Schmidt, Mark

Tractable structured natural gradient descent using local parameterizations

arXiv.org Machine LearningFeb-15-2021

Natural-gradient descent on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to complicated inverse Fisher-matrix computations. We address this issue for optimization, inference, and search problems by using \emph{local-parameter coordinates}. Our method generalizes an existing evolutionary-strategy method, recovers Newton and Riemannian-gradient methods as special cases, and also yields new tractable natural-gradient algorithms for learning flexible covariance structures of Gaussian and Wishart-based distributions. We show results on a range of applications on deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods via local parameterizations.

local parameterization, natural gradient descent, parameterization, (13 more...)

2102.07405

Country:

Asia > Japan (0.04)
North America > Canada > British Columbia (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)

arXiv.org Machine LearningFeb-15-2021

A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm for Bilevel Optimization

Khanduri, Prashant, Zeng, Siliang, Hong, Mingyi, Wai, Hoi-To, Wang, Zhaoran, Yang, Zhuoran

This paper proposes a new algorithm -- the Momentum-assisted Single-timescale Stochastic Approximation (MSTSA) -- for tackling unconstrained bilevel optimization problems. We focus on bilevel problems where the lower level subproblem is strongly-convex. Unlike prior works which rely on two timescale or double loop techniques that track the optimal solution to the lower level subproblem, we design a stochastic momentum assisted gradient estimator for the upper level subproblem's updates. The latter allows us to gradually control the error in stochastic gradient updates due to inaccurate solution to the lower level subproblem. We show that if the upper objective function is smooth but possibly non-convex (resp. strongly-convex), MSTSA requires $\mathcal{O}(\epsilon^{-2})$ (resp. $\mathcal{O}(\epsilon^{-1})$) iterations (each using constant samples) to find an $\epsilon$-stationary (resp. $\epsilon$-optimal) solution. This achieves the best-known guarantees for stochastic bilevel problems. We validate our theoretical results by showing the efficiency of the MSTSA algorithm on hyperparameter optimization and data hyper-cleaning problems.

algorithm, inequality, lower-level problem, (14 more...)

2102.07367

Country:

Asia > China > Hong Kong (0.04)
North America > United States > New York (0.04)
North America > United States > Minnesota (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

arXiv.org Artificial IntelligenceFeb-15-2021

And/or trade-off in artificial neurons: impact on adversarial robustness

Fontana, Alessandro

Neural networks reached around 2013 human-level performance in image classification tasks, giving rise to the phenomenon of deep learning as we know it today. But accuracy is not everything, and researchers started to wonder how robust these models are, how much neural networks really "understand" and what they actually "see". In an attempt to (also) answer this question, (Szegedy et al., 2013) described two interesting aspects of deep neural networks. The first aspect concerned the way in which networks store information, but it was the second aspect which immediately attracted the attention of the scientific community: given a correctly classified image, it is possible to add artificially crafted noise imperceptible to the human eye, to create a second image which is very likely to be misclassified. This problem is intertwined to that of understanding how information is encoded in neural networks (Samek et al., 2017). In supervised learning guided by stochastic gradient descent (SGD), the features encoded in hidden neurons are learned as a byproduct of the reduction of the classification error in the output layer. The statistical properties of features encoded by intermediate neurons remain poorly understood, as well as their contribution to the classification performance of the network. In recent years, the topic of adversarial examples has grown to become a field in its own right, that is currently witnessing an arms race in which the attackers are ahead: for every defence proposed, new ways to get around it and new attacks are invented on a weekly basis (Yuan et al., 2018).

neural network, neuron, perturbation, (15 more...)

arXiv.org Artificial Intelligence

2102.07389

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)