AITopics

2401.13447

Country: Asia > Japan > Honshū > Kantō (0.14)

Genre: Research Report (0.50)

Industry: Education (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

arXiv.org Artificial IntelligenceOct-10-2023

The Probabilistic Stability of Stochastic Gradient Descent

Ziyin, Liu, Li, Botao, Galanti, Tomer, Ueda, Masahito

Characterizing and understanding the stability of Stochastic Gradient Descent (SGD) remains an open problem in deep learning. A common method is to utilize the convergence of statistical moments, esp. the variance, of the parameters to quantify the stability. We revisit the definition of stability for SGD and propose using the $\textit{convergence in probability}$ condition to define the $\textit{probabilistic stability}$ of SGD. The probabilistic stability sheds light on a fundamental question in deep learning theory: how SGD selects a meaningful solution for a neural network from an enormous number of possible solutions that may severely overfit. We show that only through the lens of probabilistic stability does SGD exhibit rich and practically relevant phases of learning, such as the phases of the complete loss of stability, incorrect learning where the model captures incorrect data correlation, convergence to low-rank saddles, and correct learning where the model captures the correct correlation. These phase boundaries are precisely quantified by the Lyapunov exponents of the dynamics. The obtained phase diagrams imply that SGD prefers low-rank saddles in a neural network when the underlying gradient is noisy, thereby influencing the learning performance.

deep learning, machine learning, stochastic gradient descent, (2 more...)

2303.13093

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

arXiv.org Artificial IntelligenceAug-12-2023

Law of Balance and Stationary Distribution of Stochastic Gradient Descent

Ziyin, Liu, Li, Hongchao, Ueda, Masahito

The stochastic gradient descent (SGD) algorithm is the algorithm we use to train neural networks. However, it remains poorly understood how the SGD navigates the highly nonlinear and degenerate loss landscape of a neural network. In this work, we prove that the minibatch noise of SGD regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry. Because the difference between a simple diffusion process and SGD dynamics is the most significant when symmetries are present, our theory implies that the loss function symmetries constitute an essential probe of how SGD works. We then apply this result to derive the stationary distribution of stochastic gradient flow for a diagonal linear network with arbitrary depth and width. The stationary distribution exhibits complicated nonlinear phenomena such as phase transitions, broken ergodicity, and fluctuation inversion. These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.

artificial intelligence, machine learning, stationary distribution, (14 more...)

2308.06671

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceMar-11-2023

What shapes the loss landscape of self-supervised learning?

Ziyin, Liu, Lubana, Ekdeep Singh, Ueda, Masahito, Tanaka, Hidenori

Prevention of complete and dimensional collapse of representations has recently become a design principle for self-supervised learning (SSL). However, questions remain in our theoretical understanding: When do those collapses occur? What are the mechanisms and causes? We answer these questions by deriving and thoroughly analyzing an analytically tractable theory of SSL loss landscapes. In this theory, we identify the causes of the dimensional collapse and study the effect of normalization and bias. Finally, we leverage the interpretability afforded by the analytical theory to understand how dimensional collapse can be beneficial and what affects the robustness of SSL against data imbalance.

artificial intelligence, augmentation, machine learning, (17 more...)

2210.00638

Country: North America > United States (0.68)

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.61)

arXiv.org Machine LearningJan-29-2022

Stochastic Neural Networks with Infinite Width are Deterministic

Ziyin, Liu, Zhang, Hanlin, Meng, Xiangming, Lu, Yuting, Xing, Eric, Ueda, Masahito

Applications of neural networks have achieved great success in various fields. A major extension of the standard neural networks is to make them stochastic, namely, to make the output a random function of the input. In a broad sense, stochastic neural networks include neural networks trained with dropout (Srivastava et al., 2014; Gal & Ghahramani, 2016), Bayesian networks (Mackay, 1992), variational autoencoders (VAE) (Kingma & Welling, 2013), and generative adversarial networks (Goodfellow et al., 2014). There are many reasons why one wants to make a neural network stochastic. Two main reasons are (1) regularization and (2) distribution modeling.

artificial intelligence, bayesian inference, machine learning, (16 more...)

2201.12724

Country: North America > Canada > Ontario > Toronto (0.14)

Genre:

Research Report (1.00)
Instructional Material (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)

arXiv.org Machine LearningJan-28-2022

Interplay between depth of neural networks and locality of target functions

Mori, Takashi, Ueda, Masahito

Deep neural networks (DNNs) have achieved unparalleled success in various tasks of artificial intelligence such as image classification [1, 2] and speech recognition [3]. Empirically, DNNs often outperform other machine learning methods such as kernel methods and Gaussian processes, but little is known about the underlying mechanism of outstanding performance of DNNs. To elucidate benefits of depth, numerous studies have investigated properties of DNNs from various perspectives. The approximation theory focuses on the expressive power of DNNs [4]. Although the universal approximation theorem states that a sufficiently wide neural network with a single hidden layer can approximate any continuous functions, expressivity of a DNN grows exponentially with increasing the depth rather than the width [5-8]. In statistical learning theory, the decay rate of the generalization error in large sample asymptotics has been analyzed. For learning generic smooth functions, shallow networks or other standard methods with linear estimators such as kernel methods already give the optimal rate [9], and hence benefits of depth are not obvious.

machine learning, teaching medhods, teaching method, (19 more...)

2201.12082

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

arXiv.org Artificial IntelligenceAug-16-2021

A Convergent and Efficient Deep Q Network Algorithm

Wang, Zhikang T., Ueda, Masahito

Despite the empirical success of the deep Q network (DQN) reinforcement learning algorithm and its variants, DQN is still not well understood and it does not guarantee convergence. In this work, we show that DQN can diverge and cease to operate in realistic settings. Although there exist gradient-based convergent methods, we show that they actually have inherent problems in learning behaviour and elucidate why they often fail in practice. To overcome these problems, we propose a convergent DQN algorithm (C-DQN) by carefully modifying DQN, and we show that the algorithm is convergent and can work with large discount factors ( 0.9998). It learns robustly in difficult settings and can learn several difficult games in the Atari 2600 benchmark where DQN fail, within a moderate computational budget. Our codes have been publicly released and can be used to reproduce our results.

artificial intelligence, dqn, neural network, (18 more...)

2106.15419

Country:

Asia > Japan (0.14)
North America > United States (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

arXiv.org Machine LearningJul-25-2021

SGD May Never Escape Saddle Points

Ziyin, Liu, Li, Botao, Ueda, Masahito

Stochastic gradient descent (SGD) has been deployed to solve highly non-linear and non-convex machine learning problems such as the training of deep neural networks. However, previous works on SGD often rely on highly restrictive and unrealistic assumptions about the nature of noise in SGD. In this work, we mathematically construct examples that defy previous understandings of SGD. For example, our constructions show that: (1) SGD may converge to a local maximum; (2) SGD may escape a saddle point arbitrarily slowly; (3) SGD may prefer sharp minima over the flat ones; and (4) AMSGrad may converge to a local maximum. Our result suggests that the noise structure of SGD might be more important than the loss landscape in neural network training and that future research should focus on deriving the actual noise structure in deep learning.

converge, deep learning, neural network, (19 more...)

2107.11774

Country:

Asia (0.14)
North America > United States (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)

arXiv.org Machine LearningMay-20-2021

Logarithmic landscape and power-law escape rate of SGD

Mori, Takashi, Ziyin, Liu, Liu, Kangqiao, Ueda, Masahito

Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of the SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a non-uniform transformation of the time variable. In the SDE, the gradient of the loss is replaced by that of the logarithmized loss. Consequently, we show that, near a local or global minimum, the stationary distribution $P_\mathrm{ss}(\theta)$ of the network parameters $\theta$ follows a power-law with respect to the loss function $L(\theta)$, i.e. $P_\mathrm{ss}(\theta)\propto L(\theta)^{-\phi}$ with the exponent $\phi$ specified by the mini-batch size, the learning rate, and the Hessian at the minimum. We obtain the escape rate formula from a local minimum, which is determined not by the loss barrier height $\Delta L=L(\theta^s)-L(\theta^*)$ between a minimum $\theta^*$ and a saddle $\theta^s$ but by the logarithmized loss barrier height $\Delta\log L=\log[L(\theta^s)/L(\theta^*)]$. Our escape-rate formula explains an empirical fact that SGD prefers flat minima with low effective dimensions.

artificial intelligence, loss function, neural network, (16 more...)

2105.09557

Country: Asia > Japan > Honshū > Kantō (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

arXiv.org Machine LearningFeb-10-2021

On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes

Ziyin, Liu, Liu, Kangqiao, Mori, Takashi, Ueda, Masahito

The noise in stochastic gradient descent (SGD), caused by minibatch sampling, remains poorly understood despite its enormous practical importance in offering good training efficiency and generalization ability. In this work, we study the minibatch noise in SGD. Motivated by the observation that minibatch sampling does not always cause a fluctuation, we set out to find the conditions that cause minibatch noise to emerge. We first derive the analytically solvable results for linear regression under various settings, which are compared to the commonly used approximations that are used to understand SGD noise. We show that some degree of mismatch between model and data complexity is needed in order for SGD to "cause" a noise, and that such mismatch may be due to the existence of static noise in the labels, in the input, the use of regularization, or underparametrization. Our results motivate a more accurate general formulation to describe minibatch noise.

artificial intelligence, machine learning, noise, (17 more...)

2102.05375

Country:

Europe (0.46)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)