AITopics

2007.01429

Country: Asia > China (0.14)

Genre:

Overview (0.87)
Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Machine LearningJun-25-2020

Global Convergence and Induced Kernels of Gradient-Based Meta-Learning with Neural Nets

Wang, Haoxiang, Sun, Ruoyu, Li, Bo

Gradient-based meta-learning (GBML) with deep neural nets (DNNs) has become a popular approach for few-shot learning. However, due to the non-convexity of DNNs and the complex bi-level optimization in GBML, the theoretical properties of GBML with DNNs remain largely unknown. In this paper, we first develop a novel theoretical analysis to answer the following questions: Does GBML with DNNs have global convergence guarantees? We provide a positive answer to this question by proving that GBML with over-parameterized DNNs is guaranteed to converge to global optima at a linear rate. The second question we aim to address is: How does GBML achieve fast adaption to new tasks with experience on past similar tasks? To answer it, we prove that GBML is equivalent to a functional gradient descent operation that explicitly propagates experience from the past tasks to new ones. Finally, inspired by our theoretical analysis, we develop a new kernel-based meta-learning approach. We show that the proposed approach outperforms GBML with standard DNNs on the Omniglot dataset when the number of past tasks for meta-training is small. The code is available at https://github.com/ AI-secure/Meta-Neural-Kernel .

deep learning, gbml, neural network, (17 more...)

2006.14606

Country:

North America > United States > Louisiana (0.14)
North America > United States > Illinois (0.14)
North America > United States > California (0.14)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-14-2020, 14:28:01 GMT

Adding One Neuron Can Eliminate All Bad Local Minima

LIANG, SHIYU, Sun, Ruoyu, Lee, Jason D., Srikant, R.

One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima. In this paper, we study the landscape of neural networks for binary classification tasks. Under mild assumptions, we prove that after adding one special neuron with a skip connection to the output, or one special neuron per layer, every local minimum is a global minimum. Papers published at the Neural Information Processing Systems Conference.

artificial intelligence, bad local minima, machine learning, (3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)

arXiv.org Machine LearningApr-11-2019

Max-Sliced Wasserstein Distance and its use for GANs

Deshpande, Ishan, Hu, Yuan-Ting, Sun, Ruoyu, Pyrros, Ayis, Siddiqui, Nasir, Koyejo, Sanmi, Zhao, Zhizhen, Forsyth, David, Schwing, Alexander

Generative adversarial nets (GANs) and variational auto-encoders have significantly improved our distribution modeling capabilities, showing promise for dataset augmentation, image-to-image translation and feature learning. However, to model high-dimensional distributions, sequential training and stacked architectures are common, increasing the number of tunable hyper-parameters as well as the training time. Nonetheless, the sample complexity of the distance metrics remains one of the factors affecting GAN training. We first show that the recently proposed sliced Wasserstein distance has compelling sample complexity properties when compared to the Wasserstein distance. To further improve the sliced Wasserstein distance we then analyze its `projection complexity' and develop the max-sliced Wasserstein distance which enjoys compelling sample complexity while reducing projection complexity, albeit necessitating a max estimation. We finally illustrate that the proposed distance trains GANs on high-dimensional images up to a resolution of 256x256 easily.

artificial intelligence, neural network, wasserstein distance, (18 more...)

1904.05877

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsDec-31-2018

Adding One Neuron Can Eliminate All Bad Local Minima

LIANG, SHIYU, Sun, Ruoyu, Lee, Jason D., Srikant, R.

artificial intelligence, machine learning, neural network, (17 more...)

Country:

North America > United States > Illinois (0.15)
North America > United States > California (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Neural Information Processing SystemsDec-31-2018

Adding One Neuron Can Eliminate All Bad Local Minima

LIANG, SHIYU, Sun, Ruoyu, Lee, Jason D., Srikant, R.

deep learning, neural network, neuron, (16 more...)

Country:

North America > United States > Illinois (0.15)
North America > United States > California (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Machine LearningDec-28-2018

Over-Parameterized Deep Neural Networks Have No Strict Local Minima For Any Continuous Activations

Li, Dawei, Ding, Tian, Sun, Ruoyu

Recently, the application of deep neural networks [1] has led to a phenomenal success in various artificial intelligence areas, e.g., computer vision, natural language processing, and audio recognition. However, the theoretical understanding of neural networks is still limited. One of the main difficulties of analyzing neural networks is the non-convexity of the objective function, which may cause many local minima. In practice, it is observed that when the number of parameters is sufficiently large, common optimization algorithmssuch as stochastic gradient descent (SGD) can achieve small training error [2-6]. These observations are often explained by the intuition that more parameters can smooth the landscape [4,7]. Among various definitions of over-parameterization, a popular one is that the last hidden layer has more neurons than the number of training samples. Even under this assumption, it is yet unclear to what extent we can prove a rigorous result. For instance, can we prove that for any neuron activation function, every local minimum is a global minimum? If not, what exactly can we prove, and what can we not prove?

deep learning, local minima, neural network, (15 more...)

1812.11039

Country:

North America > United States > Illinois (0.14)
North America > United States > California (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningAug-8-2018

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

Chen, Xiangyi, Liu, Sijia, Sun, Ruoyu, Hong, Mingyi

This paper studies a class of adaptive gradient based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. This class, which we refer to as the "Adam-type", includes the popular algorithms such as the Adam, AMSGrad and AdaGrad. Despite their popularity in training deep neural networks, the convergence of these algorithms for solving nonconvex problems remains an open question. This paper provides a set of mild sufficient conditions that guarantee the convergence for the Adam-type methods. We prove that under our derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization. We show the conditions are essential in the sense that violating them may make the algorithm diverge. Moreover, we propose and analyze a class of (deterministic) incremental adaptive gradient algorithms, which has the same $O(\log{T}/\sqrt{T})$ convergence rate. Our study could also be extended to a broader class of adaptive gradient methods in machine learning and optimization.

algorithm, deep learning, neural network, (18 more...)

1808.02941

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

arXiv.org Machine LearningMar-5-2018

Understanding the Loss Surface of Neural Networks for Binary Classification

Liang, Shiyu, Sun, Ruoyu, Li, Yixuan, Srikant, R.

It is widely conjectured that the reason that training algorithms for neural networks are successful because all local minima lead to similar performance, for example, see (LeCun et al., 2015, Choromanska et al., 2015, Dauphin et al., 2014). Performance is typically measured in terms of two metrics: training performance and generalization performance. Here we focus on the training performance of single-layered neural networks for binary classification, and provide conditions under which the training error is zero at all local minima of a smooth hinge loss function. Our conditions are roughly in the following form: the neurons have to be strictly convex and the surrogate loss function should be a smooth version of hinge loss. We also provide counterexamples to show that when the loss function is replaced with quadratic loss or logistic loss, the result may not hold.

artificial intelligence, neural network, neuron, (15 more...)

1803.00909

Genre: Research Report (0.81)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsDec-31-2015

Improved Iteration Complexity Bounds of Cyclic Block Coordinate Descent for Convex Problems

Sun, Ruoyu, Hong, Mingyi

The iteration complexity of the block-coordinate descent (BCD) type algorithm has been under extensive investigation. It was recently shown that for convex problems the classical cyclic BCGD (block coordinate gradient descent) achieves an O(1/r) complexity (r is the number of passes of all blocks). However, such bounds are at least linearly depend on $K$ (the number of variable blocks), and are at least $K$ times worse than those of the gradient descent (GD) and proximal gradient (PG) methods.In this paper, we close such theoretical performance gap between cyclic BCD and GD/PG. First we show that for a family of quadratic nonsmooth problems, the complexity bounds for cyclic Block Coordinate Proximal Gradient (BCPG), a popular variant of BCD, can match those of the GD/PG in terms of dependency on $K$ (up to a \log^2(K) factor). Second, we establish an improved complexity bound for Coordinate Gradient Descent (CGD) for general convex problems which can match that of GD in certain scenarios. Our bounds are sharper than the known bounds as they are always at least $K$ times worse than GD. {Our analyses do not depend on the update order of block variables inside each cycle, thus our results also apply to BCD methods with random permutation (random sampling without replacement, another popular variant).

artificial intelligence, complexity, machine learning, (15 more...)

Country:

North America > United States > Iowa (0.14)
North America > United States > California > Santa Clara County (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.75)