AITopics

2311.06839

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Chen, Thomas, Ewald, Patricia Muñoz

Non-approximability of constructive global $\mathcal{L}^2$ minimizers by gradient descent in Deep Learning

arXiv.org Machine LearningNov-12-2023

We analyze geometric aspects of the gradient descent algorithm in Deep Learning (DL) networks. In particular, we prove that the globally minimizing weights and biases for the $\mathcal{L}^2$ cost obtained constructively in [Chen-Munoz Ewald 2023] for underparametrized ReLU DL networks can generically not be approximated via the gradient descent flow. We therefore conclude that the method introduced in [Chen-Munoz Ewald 2023] is disjoint from the gradient descent method.

artificial intelligence, machine learning, minimizer, (14 more...)

2311.07065

Country:

North America > United States > Texas > Travis County > Austin (0.15)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Spain > Basque Country > Biscay Province > Bilbao (0.04)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

arXiv.org Machine LearningNov-11-2023

A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity

Li, Hongkang, Wang, Meng, Liu, Sijia, Chen, Pin-yu

Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs.

artificial intelligence, machine learning, softmax, (13 more...)

2302.06015

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > Michigan (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

arXiv.org Machine LearningNov-10-2023

A statistical perspective on algorithm unrolling models for inverse problems

Atchade, Yves, Liu, Xinru, Zhu, Qiuyun

We consider inverse problems where the conditional distribution of the observation ${\bf y}$ given the latent variable of interest ${\bf x}$ (also known as the forward model) is known, and we have access to a data set in which multiple instances of ${\bf x}$ and ${\bf y}$ are both observed. In this context, algorithm unrolling has become a very popular approach for designing state-of-the-art deep neural network architectures that effectively exploit the forward model. We analyze the statistical complexity of the gradient descent network (GDN), an algorithm unrolling architecture driven by proximal gradient descent. We show that the unrolling depth needed for the optimal statistical performance of GDNs is of order $\log(n)/\log(\varrho_n^{-1})$, where $n$ is the sample size, and $\varrho_n$ is the convergence rate of the corresponding gradient descent algorithm. We also show that when the negative log-density of the latent variable ${\bf x}$ has a simple proximal operator, then a GDN unrolled at depth $D'$ can solve the inverse problem at the parametric rate $O(D'/\sqrt{n})$. Our results thus also suggest that algorithm unrolling models are prone to overfitting as the unrolling depth $D'$ increases. We provide several examples to illustrate these results.

algorithm unrolling model, def, inverse problem, (13 more...)

2311.06395

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

arXiv.org Machine LearningNov-10-2023

Solving Kernel Ridge Regression with Gradient-Based Optimization Methods

Allerbo, Oskar

Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. Here, we introduce an equivalent formulation of the objective function of KRR, opening up both for using penalties other than the ridge penalty and for studying kernel ridge regression from the perspective of gradient descent. Using a continuous-time perspective, we derive a closed-form solution for solving kernel regression with gradient descent, something we refer to as kernel gradient flow, KGF, and theoretically bound the differences between KRR and KGF, where, for the latter, regularization is obtained through early stopping. We also generalize KRR by replacing the ridge penalty with the $\ell_1$ and $\ell_\infty$ penalties, respectively, and use the fact that analogous to the similarities between KGF and KRR, $\ell_1$ regularization and forward stagewise regression (also known as coordinate descent), and $\ell_\infty$ regularization and sign gradient descent, follow similar solution paths. We can thus alleviate the need for computationally heavy algorithms based on proximal gradient descent. We show theoretically and empirically how the $\ell_1$ and $\ell_\infty$ penalties, and the corresponding gradient-based optimization algorithms, produce sparse and robust kernel regression solutions, respectively.

artificial intelligence, machine learning, regression, (17 more...)

2306.16838

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.97)

Campbell, Andrew, Liu, Hang, Woldemariam, Leah, Scaglione, Anna

MALCOM-PSGD: Inexact Proximal Stochastic Gradient Descent for Communication-Efficient Decentralized Machine Learning

communication-efficient decentralized machine learning, inexact proximal stochastic gradient descent, malcom-psgd

Recent research indicates that frequent model communication stands as a major bottleneck to the efficiency of decentralized machine learning (ML), particularly for large-scale and over-parameterized neural networks (NNs). In this paper, we introduce MALCOM-PSGD, a new decentralized ML algorithm that strategically integrates gradient compression techniques with model sparsification. MALCOM-PSGD leverages proximal stochastic gradient descent to handle the non-smoothness resulting from the $\ell_1$ regularization in model sparsification. Furthermore, we adapt vector source coding and dithering-based quantization for compressed gradient communication of sparsified models. Our analysis shows that decentralized proximal stochastic gradient descent with compressed communication has a convergence rate of $\mathcal{O}\left(\ln(t)/\sqrt{t}\right)$ assuming a diminishing learning rate and where $t$ denotes the number of iterations. Numerical results verify our theoretical findings and demonstrate that our method reduces communication costs by approximately $75\%$ when compared to the state-of-the-art method.

2311.0576

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Accelerated Shapley Value Approximation for Data Evaluation

Watson, Lauren, Kujawa, Zeno, Andreeva, Rayna, Yang, Hao-Tsung, Elahi, Tariq, Sarkar, Rik

Data valuation has found various applications in machine learning, such as data filtering, efficient learning and incentives for data sharing. The most popular current approach to data valuation is the Shapley value. While popular for its various applications, Shapley value is computationally expensive even to approximate, as it requires repeated iterations of training models on different subsets of data. In this paper we show that the Shapley value of data points can be approximated more efficiently by leveraging the structural properties of machine learning problems. We derive convergence guarantees on the accuracy of the approximate Shapley value for different learning settings including Stochastic Gradient Descent with convex and non-convex loss functions. Our analysis suggests that in fact models trained on small subsets are more important in the context of data valuation. Based on this idea, we describe $\delta$-Shapley -- a strategy of only using small subsets for the approximation. Experiments show that this approach preserves approximate value and rank of data, while achieving speedup of up to 9.9x. In pre-trained networks the approach is found to bring more efficiency in terms of accurate evaluation using small subsets.

algorithm, shapley value, valuation, (15 more...)

2311.05346

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Orange County > Irvine (0.04)

Genre: Research Report (1.00)

Industry: Education (0.34)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Kook, Yunbum, Vempala, Santosh S.

Gaussian Cooling and Dikin Walks: The Interior-Point Method for Logconcave Sampling

The connections between (convex) optimization and (logconcave) sampling have been considerably enriched in the past decade with many conceptual and mathematical analogies. For instance, the Langevin algorithm can be viewed as a sampling analogue of gradient descent and has condition-number-dependent guarantees on its performance. In the early 1990s, Nesterov and Nemirovski developed the Interior-Point Method (IPM) for convex optimization based on self-concordant barriers, providing efficient algorithms for structured convex optimization, often faster than the general method. This raises the following question: can we develop an analogous IPM for structured sampling problems? In 2012, Kannan and Narayanan proposed the Dikin walk for uniformly sampling polytopes, and an improved analysis was given in 2020 by Laddha-Lee-Vempala. The Dikin walk uses a local metric defined by a self-concordant barrier for linear constraints. Here we generalize this approach by developing and adapting IPM machinery together with the Dikin walk for poly-time sampling algorithms. Our IPM-based sampling framework provides an efficient warm start and goes beyond uniform distributions and linear constraints. We illustrate the approach on important special cases, in particular giving the fastest algorithms to sample uniform, exponential, or Gaussian distributions on a truncated PSD cone. The framework is general and can be applied to other sampling algorithms.

diag, dikin walk, lemma 4, (16 more...)

2307.12943

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Transformers learn to implement preconditioned gradient descent for in-context learning

Ahn, Kwangjun, Cheng, Xiang, Daneshmand, Hadi, Sra, Suvrit

Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instances? To our knowledge, we make the first theoretical progress on this question via an analysis of the loss landscape for linear transformers trained over random instances of linear regression. For a single attention layer, we prove the global minimum of the training objective implements a single iteration of preconditioned gradient descent. Notably, the preconditioning matrix not only adapts to the input distribution but also to the variance induced by data inadequacy. For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent. Our results call for future theoretical studies on learning algorithms by training transformers.

gradient descent, matrix, transformer, (15 more...)

2306.00297

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Son, Jaehyeon, Lee, Soochan, Kim, Gunhee

When Meta-Learning Meets Online and Continual Learning: A Survey

arXiv.org Machine LearningNov-9-2023

Over the past decade, deep neural networks have demonstrated significant success using the training scheme that involves mini-batch stochastic gradient descent on extensive datasets. Expanding upon this accomplishment, there has been a surge in research exploring the application of neural networks in other learning scenarios. One notable framework that has garnered significant attention is meta-learning. Often described as "learning to learn," meta-learning is a data-driven approach to optimize the learning algorithm. Other branches of interest are continual learning and online learning, both of which involve incrementally updating a model with streaming data. While these frameworks were initially developed independently, recent works have started investigating their combinations, proposing novel problem settings and learning algorithms. However, due to the elevated complexity and lack of unified terminology, discerning differences between the learning frameworks can be challenging even for experienced researchers. To facilitate a clear understanding, this paper provides a comprehensive survey that organizes various problem settings using consistent terminology and formal descriptions. By offering an overview of these learning paradigms, our work aims to foster further advancements in this promising area of research.

artificial intelligence, bayesian inference, machine learning, (18 more...)

2311.05241

Country:

Asia > South Korea > Seoul > Seoul (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)

Genre:

Overview (1.00)
Research Report (0.81)

Industry: Education > Educational Setting > Online (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)