AITopics | Şimşekli, Umut

Collaborating Authors

Şimşekli, Umut

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Chaotic Regularization and Heavy-Tailed Limits for Deterministic Gradient Descent

Lim, Soon Hoe, Wan, Yijun, Şimşekli, Umut

arXiv.org Artificial IntelligenceOct-22-2022

Recent studies have shown that gradient descent (GD) can achieve improved generalization when its dynamics exhibits a chaotic behavior. However, to obtain the desired effect, the step-size should be chosen sufficiently large, a task which is problem dependent and can be difficult in practice. In this study, we incorporate a chaotic component to GD in a controlled manner, and introduce multiscale perturbed GD (MPGD), a novel optimization framework where the GD recursion is augmented with chaotic perturbations that evolve via an independent dynamical system. We analyze MPGD from three different angles: (i) By building up on recent advances in rough paths theory, we show that, under appropriate assumptions, as the step-size decreases, the MPGD recursion converges weakly to a stochastic differential equation (SDE) driven by a heavy-tailed L\'evy-stable process. (ii) By making connections to recently developed generalization bounds for heavy-tailed processes, we derive a generalization bound for the limiting SDE and relate the worst-case generalization error over the trajectories of the process to the parameters of MPGD. (iii) We analyze the implicit regularization effect brought by the dynamical regularization and show that, in the weak perturbation regime, MPGD introduces terms that penalize the Hessian of the loss function. Empirical results are provided to demonstrate the advantages of MPGD.

artificial intelligence, machine learning, mpgd, (19 more...)

arXiv.org Artificial Intelligence

2205.11361

Country: Europe (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)

Add feedback

Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms

Sefidgaran, Milad, Gohari, Amin, Richard, Gaël, Şimşekli, Umut

arXiv.org Machine LearningJun-29-2022

Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the `compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the `lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a `lossy compression' scheme allows us to link generalization to the rate-distortion dimension -- a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

2203.02474

Country: Europe (0.67)

Genre: Research Report > New Finding (0.86)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Intrinsic Dimension, Persistent Homology and Generalization in Neural Networks

Birdal, Tolga, Lou, Aaron, Guibas, Leonidas, Şimşekli, Umut

arXiv.org Artificial IntelligenceNov-25-2021

Disobeying the classical wisdom of statistical learning theory, modern deep neural networks generalize well even though they typically contain millions of parameters. Recently, it has been shown that the trajectories of iterative optimization algorithms can possess fractal structures, and their generalization error can be formally linked to the complexity of such fractals. This complexity is measured by the fractal's intrinsic dimension, a quantity usually much smaller than the number of parameters in the network. Even though this perspective provides an explanation for why overparametrized networks would not overfit, computing the intrinsic dimension (e.g., for monitoring generalization during training) is a notoriously difficult task, where existing methods typically fail even in moderate ambient dimensions. In this study, we consider this problem from the lens of topological data analysis (TDA) and develop a generic computational tool that is built on rigorous mathematical foundations. By making a novel connection between learning theory and TDA, we first illustrate that the generalization error can be equivalently bounded in terms of a notion called the 'persistent homology dimension' (PHD), where, compared with prior work, our approach does not require any additional geometrical or statistical assumptions on the training dynamics. Then, by utilizing recently established theoretical results and TDA tools, we develop an efficient algorithm to estimate PHD in the scale of modern deep neural networks and further provide visualization tools to help understand generalization in deep learning. Our experiments show that the proposed approach can efficiently compute a network's intrinsic dimension in a variety of settings, which is predictive of the generalization error.

artificial intelligence, dimension, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2111.13171

Country:

Oceania > Australia (0.14)
Europe (0.14)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Generalization Properties of Stochastic Optimizers via Trajectory Analysis

Hodgkinson, Liam, Şimşekli, Umut, Khanna, Rajiv, Mahoney, Michael W.

arXiv.org Machine LearningAug-2-2021

Despite the ubiquitous use of stochastic optimization algorithms in machine learning, the precise impact of these algorithms on generalization performance in realistic non-convex settings is still poorly understood. In this paper, we provide an encompassing theoretical framework for investigating the generalization properties of stochastic optimizers, which is based on their dynamics. We first prove a generalization bound attributable to the optimizer dynamics in terms of the celebrated Fernique-Talagrand functional applied to the trajectory of the optimizer. This data- and algorithm-dependent bound is shown to be the sharpest possible in the absence of further assumptions. We then specialize this result by exploiting the Markovian structure of stochastic optimizers, deriving generalization bounds in terms of the (data-dependent) transition kernels associated with the optimization algorithms. In line with recent work that has revealed connections between generalization and heavy-tailed behavior in stochastic optimization, we link the generalization error to the local tail behavior of the transition kernels. We illustrate that the local power-law exponent of the kernel acts as an effective dimension, which decreases as the transitions become "less Gaussian". We support our theory with empirical results from a variety of neural networks, and we show that both the Fernique-Talagrand functional and the local power-law exponent are predictive of generalization performance.

artificial intelligence, neural network, optimizer, (17 more...)

arXiv.org Machine Learning

2108.00781

Country:

North America > United States > California (0.14)
North America > United States > New York (0.14)
Europe > United Kingdom > England (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

Add feedback

Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections

Nadjahi, Kimia, Durmus, Alain, Jacob, Pierre E., Badeau, Roland, Şimşekli, Umut

arXiv.org Machine LearningJun-29-2021

The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of measure phenomenon: under mild assumptions, one-dimensional projections of a high-dimensional random vector are approximately Gaussian. Based on this observation, we develop a simple deterministic approximation for SW. Our method does not require sampling a number of random projections, and is therefore both accurate and easy to use compared to the usual Monte Carlo approximation. We derive nonasymptotical guarantees for our approach, and show that the approximation error goes to zero as the dimension increases, under a weak dependence condition on the data distribution. We validate our theoretical findings on synthetic datasets, and illustrate the proposed approximation on a generative modeling problem.

artificial intelligence, neural network, projection, (19 more...)

arXiv.org Machine Learning

2106.15427

Country: North America > United States > California (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Relative Positional Encoding for Transformers with Linear Complexity

Liutkus, Antoine, Cífka, Ondřej, Wu, Shih-Lun, Şimşekli, Umut, Yang, Yi-Hsuan, Richard, Gaël

arXiv.org Machine LearningJun-10-2021

Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.

deep learning, neural network, transformer, (20 more...)

arXiv.org Machine Learning

2105.08399

Country:

Europe > France (0.46)
North America > Canada > Ontario > Toronto (0.14)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.46)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms

Camuto, Alexander, Deligiannidis, George, Erdogdu, Murat A., Gürbüzbalaban, Mert, Şimşekli, Umut, Zhu, Lingjiong

arXiv.org Machine LearningJun-9-2021

Understanding generalization in deep learning has been one of the major challenges in statistical learning theory over the last decade. While recent work has illustrated that the dataset and the training algorithm must be taken into account in order to obtain meaningful generalization bounds, it is still theoretically not clear which properties of the data and the algorithm determine the generalization performance. In this study, we approach this problem from a dynamical systems theory perspective and represent stochastic optimization algorithms as random iterated function systems (IFS). Well studied in the dynamical systems literature, under mild assumptions, such IFSs can be shown to be ergodic with an invariant measure that is often supported on sets with a fractal structure. As our main contribution, we prove that the generalization error of a stochastic optimization algorithm can be bounded based on the `complexity' of the fractal structure that underlies its invariant measure. Leveraging results from dynamical systems theory, we show that the generalization error can be explicitly linked to the choice of the algorithm (e.g., stochastic gradient descent -- SGD), algorithm hyperparameters (e.g., step-size, batch-size), and the geometry of the problem (e.g., Hessian of the loss). We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden-layered neural networks) and algorithms (e.g., SGD and preconditioned variants), and obtain analytical estimates for our bound.For modern neural networks, we develop an efficient algorithm to compute the developed bound and support our theory with various experiments on neural networks.

deep learning, dimension, neural network, (18 more...)

arXiv.org Machine Learning

2106.04881

Country:

Europe > United Kingdom > England (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

Barsbey, Melih, Sefidgaran, Milad, Erdogdu, Murat A., Richard, Gaël, Şimşekli, Umut

arXiv.org Machine LearningJun-7-2021

Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical studies have shown that compressible networks (in specific senses) should achieve a low generalization error. Yet, a theoretical characterization of the underlying cause that makes the networks amenable to such simple compression schemes is still missing. In this study, we address this fundamental question and reveal that the dynamics of the training algorithm has a key role in obtaining such compressible networks. Focusing our attention on stochastic gradient descent (SGD), our main contribution is to link compressibility to two recently established properties of SGD: (i) as the network size goes to infinity, the system can converge to a mean-field limit, where the network weights behave independently, (ii) for a large step-size/batch-size ratio, the SGD iterates can converge to a heavy-tailed stationary distribution. In the case where these two phenomena occur simultaneously, we prove that the networks are guaranteed to be '$\ell_p$-compressible', and the compression errors of different pruning techniques (magnitude, singular value, or node pruning) become arbitrarily small as the network size increases. We further prove generalization bounds adapted to our theoretical framework, which indeed confirm that the generalization error will be lower for more compressible networks. Our theory and numerical study on various neural networks show that large step-size/batch-size ratios introduce heavy-tails, which, in combination with overparametrization, result in compressibility.

deep learning, neural network, ompressibility, (20 more...)

arXiv.org Machine Learning

2106.03795

Country:

Europe > France (0.28)
North America > United States (0.28)
North America > Canada > Ontario > Toronto (0.14)
Asia > Middle East > Republic of Türkiye (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance

Wang, Hongjian, Gürbüzbalaban, Mert, Zhu, Lingjiong, Şimşekli, Umut, Erdogdu, Murat A.

arXiv.org Machine LearningFeb-20-2021

Recent studies have provided both empirical and theoretical evidence illustrating that heavy tails can emerge in stochastic gradient descent (SGD) in various scenarios. Such heavy tails potentially result in iterates with diverging variance, which hinders the use of conventional convergence analysis techniques that rely on the existence of the second-order moments. In this paper, we provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance, for a class of strongly convex objectives. In the case where the $p$-th moment of the noise exists for some $p\in [1,2)$, we first identify a condition on the Hessian, coined '$p$-positive (semi-)definiteness', that leads to an interesting interpolation between positive semi-definite matrices ($p=2$) and diagonally dominant matrices with non-negative diagonal entries ($p=1$). Under this condition, we then provide a convergence rate for the distance to the global optimum in $L^p$. Furthermore, we provide a generalized central limit theorem, which shows that the properly scaled Polyak-Ruppert averaging converges weakly to a multivariate $\alpha$-stable random vector. Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum without necessitating any modification neither to the loss function or to the algorithm itself, as typically required in robust statistics. We demonstrate the implications of our results to applications such as linear regression and generalized linear models subject to heavy-tailed data.

artificial intelligence, machine learning, variance, (17 more...)

arXiv.org Machine Learning

2102.10346

Country:

North America > United States (0.28)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections

Camuto, Alexander, Wang, Xiaoyu, Zhu, Lingjiong, Holmes, Chris, Gürbüzbalaban, Mert, Şimşekli, Umut

arXiv.org Machine LearningFeb-13-2021

Gaussian noise injections (GNIs) are a family of simple and widely-used regularisation methods for training neural networks, where one injects additive or multiplicative Gaussian noise to the network activations at every iteration of the optimisation algorithm, which is typically chosen as stochastic gradient descent (SGD). In this paper we focus on the so-called `implicit effect' of GNIs, which is the effect of the injected noise on the dynamics of SGD. We show that this effect induces an asymmetric heavy-tailed noise on SGD gradient updates. In order to model this modified dynamics, we first develop a Langevin-like stochastic differential equation that is driven by a general family of asymmetric heavy-tailed noise. Using this model we then formally prove that GNIs induce an `implicit bias', which varies depending on the heaviness of the tails and the level of asymmetry. Our empirical results confirm that different types of neural networks trained with GNIs are well-modelled by the proposed dynamics and that the implicit effect of these injections induces a bias that degrades the performance of networks.

artificial intelligence, neural network, noise, (16 more...)

arXiv.org Machine Learning

2102.07006

Country:

Europe > United Kingdom > England (0.14)
North America > United States > Massachusetts (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Energy > Oil & Gas (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Add feedback