AITopics | Ziyin, Liu

Plotting

Ziyin, Liu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On the Stepwise Nature of Self-Supervised Learning

Simon, James B., Knutins, Maksis, Ziyin, Liu, Geisz, Daniel, Fetterman, Abraham J., Albrecht, Joshua

arXiv.org Artificial IntelligenceMay-30-2023

We present a simple picture of the training process of joint embedding self-supervised learning methods. We find that these methods learn their high-dimensional embeddings one dimension at a time in a sequence of discrete, well-separated steps. We arrive at this conclusion via the study of a linearized model of Barlow Twins applicable to the case in which the trained network is infinitely wide. We solve the training dynamics of this model from small initialization, finding that the model learns the top eigenmodes of a certain contrastive kernel in a stepwise fashion, and obtain a closed-form expression for the final learned representations. Remarkably, we then see the same stepwise learning phenomenon when training deep ResNets using the Barlow Twins, SimCLR, and VICReg losses. Our theory suggests that, just as kernel regression can be thought of as a model of supervised learning, kernel PCA may serve as a useful model of self-supervised learning.

artificial intelligence, machine learning, representation, (17 more...)

arXiv.org Artificial Intelligence

2303.15438

Country:

Europe (0.28)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

What shapes the loss landscape of self-supervised learning?

Ziyin, Liu, Lubana, Ekdeep Singh, Ueda, Masahito, Tanaka, Hidenori

arXiv.org Artificial IntelligenceMar-11-2023

Prevention of complete and dimensional collapse of representations has recently become a design principle for self-supervised learning (SSL). However, questions remain in our theoretical understanding: When do those collapses occur? What are the mechanisms and causes? We answer these questions by deriving and thoroughly analyzing an analytically tractable theory of SSL loss landscapes. In this theory, we identify the causes of the dimensional collapse and study the effect of normalization and bias. Finally, we leverage the interpretability afforded by the analytical theory to understand how dimensional collapse can be beneficial and what affects the robustness of SSL against data imbalance.

artificial intelligence, augmentation, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2210.00638

Country:

Asia > Japan > Honshū > Kantō (0.14)
North America > United States > Michigan (0.14)

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.61)

Add feedback

Theoretically Motivated Data Augmentation and Regularization for Portfolio Construction

Ziyin, Liu, Minami, Kentaro, Imajo, Kentaro

arXiv.org Artificial IntelligenceDec-22-2022

The task we consider is portfolio construction in a speculative market, a fundamental problem in modern finance. While various empirical works now exist to explore deep learning in finance, the theory side is almost non-existent. In this work, we focus on developing a theoretical framework for understanding the use of data augmentation for deep-learning-based approaches to quantitative finance. The proposed theory clarifies the role and necessity of data augmentation for finance; moreover, our theory implies that a simple algorithm of injecting a random noise of strength $\sqrt{|r_{t-1}|}$ to the observed return $r_{t}$ is better than not injecting any noise and a few other financially irrelevant data augmentation techniques.

artificial intelligence, data augmentation, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3533271.3561720

2106.04114

Genre: Research Report > New Finding (0.67)

Industry:

Banking & Finance > Trading (1.00)
Information Technology (0.93)
Energy > Oil & Gas > Upstream (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Exact Solutions of a Deep Linear Network

Ziyin, Liu, Li, Botao, Meng, Xiangming

arXiv.org Machine LearningFeb-13-2022

This work finds the exact solutions to a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that weight decay strongly interacts with the model architecture and can create bad minima in a network with more than $1$ hidden layer, qualitatively different for a network with only $1$ hidden layer. As an application, we also analyze stochastic nets and show that their prediction variance vanishes to zero as the stochasticity, the width, or the depth tends to infinity.

deep linear network, machine learning, neural network, (2 more...)

arXiv.org Machine Learning

2202.04777

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.73)

Add feedback

Stochastic Neural Networks with Infinite Width are Deterministic

Ziyin, Liu, Zhang, Hanlin, Meng, Xiangming, Lu, Yuting, Xing, Eric, Ueda, Masahito

arXiv.org Machine LearningJan-29-2022

Applications of neural networks have achieved great success in various fields. A major extension of the standard neural networks is to make them stochastic, namely, to make the output a random function of the input. In a broad sense, stochastic neural networks include neural networks trained with dropout (Srivastava et al., 2014; Gal & Ghahramani, 2016), Bayesian networks (Mackay, 1992), variational autoencoders (VAE) (Kingma & Welling, 2013), and generative adversarial networks (Goodfellow et al., 2014). There are many reasons why one wants to make a neural network stochastic. Two main reasons are (1) regularization and (2) distribution modeling.

artificial intelligence, bayesian inference, machine learning, (16 more...)

arXiv.org Machine Learning

2201.12724

Country: North America > Canada > Ontario > Toronto (0.14)

Genre:

Research Report (1.00)
Instructional Material (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)

Add feedback

SGD May Never Escape Saddle Points

Ziyin, Liu, Li, Botao, Ueda, Masahito

arXiv.org Machine LearningJul-25-2021

Stochastic gradient descent (SGD) has been deployed to solve highly non-linear and non-convex machine learning problems such as the training of deep neural networks. However, previous works on SGD often rely on highly restrictive and unrealistic assumptions about the nature of noise in SGD. In this work, we mathematically construct examples that defy previous understandings of SGD. For example, our constructions show that: (1) SGD may converge to a local maximum; (2) SGD may escape a saddle point arbitrarily slowly; (3) SGD may prefer sharp minima over the flat ones; and (4) AMSGrad may converge to a local maximum. Our result suggests that the noise structure of SGD might be more important than the loss landscape in neural network training and that future research should focus on deriving the actual noise structure in deep learning.

converge, deep learning, neural network, (19 more...)

arXiv.org Machine Learning

2107.11774

Country:

Asia (0.14)
North America > United States (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)

Add feedback

Logarithmic landscape and power-law escape rate of SGD

Mori, Takashi, Ziyin, Liu, Liu, Kangqiao, Ueda, Masahito

arXiv.org Machine LearningMay-20-2021

Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of the SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a non-uniform transformation of the time variable. In the SDE, the gradient of the loss is replaced by that of the logarithmized loss. Consequently, we show that, near a local or global minimum, the stationary distribution $P_\mathrm{ss}(\theta)$ of the network parameters $\theta$ follows a power-law with respect to the loss function $L(\theta)$, i.e. $P_\mathrm{ss}(\theta)\propto L(\theta)^{-\phi}$ with the exponent $\phi$ specified by the mini-batch size, the learning rate, and the Hessian at the minimum. We obtain the escape rate formula from a local minimum, which is determined not by the loss barrier height $\Delta L=L(\theta^s)-L(\theta^*)$ between a minimum $\theta^*$ and a saddle $\theta^s$ but by the logarithmized loss barrier height $\Delta\log L=\log[L(\theta^s)/L(\theta^*)]$. Our escape-rate formula explains an empirical fact that SGD prefers flat minima with low effective dimensions.

artificial intelligence, loss function, neural network, (16 more...)

arXiv.org Machine Learning

2105.09557

Country: Asia > Japan > Honshū > Kantō (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

On the Distributional Properties of Adaptive Gradients

Zhiyi, Zhang, Ziyin, Liu

arXiv.org Machine LearningMay-15-2021

However, not much is known In this work, we take the first step for studying a rather about the mathematical and statistical properties of fundamental problem in the study of adaptive gradients; we this family of methods. This work aims at providing propose to study the distributional properties of the update a series of theoretical analyses of its statistical in the adaptive gradient method. The most closely related properties justified by experiments. In particular, previous work is [Liu et al., 2019]. The difference is that this we show that when the underlying gradient obeys work goes much deeper into the detail in the theoretical analysis a normal distribution, the variance of the magnitude and contradicts the results in [Liu et al., 2019]. The main of the update is an increasing and bounded contributions of this work are the following: (1) We prove function of time and does not diverge. This work that the variance of the adaptive gradient method is always finite suggests that the divergence of variance is not the (Proposition 1), which contradicts the result in Liu et al. cause of the need for warm up of the Adam optimizer, [2019]; this proof does not make any assumption regarding contrary to what is believed in the current the distribution of the gradient.

deep learning, neural network, variance, (17 more...)

arXiv.org Machine Learning

2105.07222

Country:

North America > United States > Louisiana (0.14)
Asia (0.14)

Genre:

Research Report (0.50)
Workflow (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes

Ziyin, Liu, Liu, Kangqiao, Mori, Takashi, Ueda, Masahito

arXiv.org Machine LearningFeb-10-2021

The noise in stochastic gradient descent (SGD), caused by minibatch sampling, remains poorly understood despite its enormous practical importance in offering good training efficiency and generalization ability. In this work, we study the minibatch noise in SGD. Motivated by the observation that minibatch sampling does not always cause a fluctuation, we set out to find the conditions that cause minibatch noise to emerge. We first derive the analytically solvable results for linear regression under various settings, which are compared to the commonly used approximations that are used to understand SGD noise. We show that some degree of mismatch between model and data complexity is needed in order for SGD to "cause" a noise, and that such mismatch may be due to the existence of static noise in the labels, in the input, the use of regularization, or underparametrization. Our results motivate a more accurate general formulation to describe minibatch noise.

artificial intelligence, machine learning, noise, (17 more...)

arXiv.org Machine Learning

2102.05375

Country:

Europe (0.46)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Stochastic Gradient Descent with Large Learning Rate

Liu, Kangqiao, Ziyin, Liu, Ueda, Masahito

arXiv.org Machine LearningDec-17-2020

As a simple and efficient optimization method in deep learning, stochastic gradient descent (SGD) has attracted tremendous attention. In the vanishing learning rate regime, SGD is now relatively well understood, and the majority of theoretical approaches to SGD set their assumptions in the continuous-time limit. However, the continuous-time predictions are unlikely to reflect the experimental observations well because the practice often runs in the large learning rate regime, where the training is faster and the generalization of models are often better. In this paper, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and relating them to experimental observations. The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of mini-batch noise, the escape rate from a sharp minimum, and and the stationary distribution of a few second order methods.

artificial intelligence, learning rate, upstream oil & gas, (20 more...)

arXiv.org Machine Learning

2012.03636

Country:

Europe (0.67)
North America > United States > California (0.14)

Genre: Research Report (0.64)

Industry:

Education (0.67)
Energy > Oil & Gas > Upstream (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Add feedback