AITopics

2007.13985

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Castiglia, Timothy, Das, Anirban, Patterson, Stacy

Multi-Level Local SGD for Heterogeneous Hierarchical Networks

We propose Multi-Level Local SGD, a distributed gradient method for learning a smooth, non-convex objective in a heterogeneous multi-level network. Our network model consists of a set of disjoint sub-networks, with a single hub and multiple worker nodes; further, worker nodes may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a hub-and-spoke paradigm, and the hubs periodically average their models with neighboring hubs. We first provide a unified mathematical framework that describes the Multi-Level Local SGD algorithm. We then present a theoretical analysis of the algorithm; our analysis shows the dependence of the convergence error on the worker node heterogeneity, hub network topology, and the number of local, sub-network, and global iterations. We back up our theoretical results via simulation-based experiments using both convex and non-convex objectives.

artificial intelligence, iteration, machine learning, (16 more...)

2007.13819

Country:

North America > United States > New York > Rensselaer County > Troy (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Universality of Gradient Descent Neural Network Training

Welper, G.

It has been observed that design choices of neural networks are often crucial for their successful optimization. In this article, we therefore discuss the question if it is always possible to redesign a neural network so that it trains well with gradient descent. This yields the following universality result: If, for a given network, there is any algorithm that can find good network weights for a classification task, then there exists an extension of this network that reproduces these weights and the corresponding forward output by mere gradient descent training. The construction is not intended for practical computations, but it provides some orientation on the possibilities of meta-learning and related approaches.

artificial intelligence, machine learning, turing machine, (16 more...)

2007.13664

Country:

North America > United States > Florida > Orange County > Orlando (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(5 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)

Binary Search and First Order Gradient Based Method for Stochastic Optimization

Pandey, Vijay

In this paper, we present a novel stochastic optimization method, which uses the binary search technique with first order gradient based optimization method, called Binary Search Gradient Optimization (BSG) or BiGrad. In this optimization setup, a non-convex surface is treated as a set of convex surfaces. In BSG, at first, a region is defined, assuming region is convex. If region is not convex, then the algorithm leaves the region very fast and defines a new one, otherwise, it tries to converge at the optimal point of the region. In BSG, core purpose of binary search is to decide, whether region is convex or not in logarithmic time, whereas, first order gradient based method is primarily applied, to define a new region. In this paper, Adam is used as a first order gradient based method, nevertheless, other methods of this class may also be considered. In deep neural network setup, it handles the problem of vanishing and exploding gradient efficiently. We evaluate BSG on the MNIST handwritten digit, IMDB, and CIFAR10 data set, using logistic regression and deep neural networks. We produce more promising results as compared to other first order gradient based optimization methods. Furthermore, proposed algorithm generalizes significantly better on unseen data as compared to other methods.

algorithm, artificial intelligence, machine learning, (17 more...)

2007.13413

Country:

Asia > Middle East > Jordan (0.04)
Asia > India > West Bengal > Kharagpur (0.04)

Genre: Research Report > New Finding (0.36)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Woodworth, Blake, Patel, Kumar Kshitij, Srebro, Nathan

Minibatch vs Local SGD for Heterogeneous Distributed Learning

We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.

artificial intelligence, machine learning, minibatch sgd, (17 more...)

2006.04735

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

arXiv.org Machine LearningJul-24-2020

Variance Reduction for Deep Q-Learning using Stochastic Recursive Gradient

Jia, Haonan, Zhang, Xiao, Xu, Jun, Zeng, Wei, Jiang, Hao, Yan, Xiaohui, Wen, Ji-Rong

Deep Q-learning algorithms often suffer from poor gradient estimations with an excessive variance, resulting in unstable training and poor sampling efficiency. Stochastic variance-reduced gradient methods such as SVRG have been applied to reduce the estimation variance (Zhao et al. 2019). However, due to the online instance generation nature of reinforcement learning, directly applying SVRG to deep Q-learning is facing the problem of the inaccurate estimation of the anchor points, which dramatically limits the potentials of SVRG. To address this issue and inspired by the recursive gradient variance reduction algorithm SARAH (Nguyen et al. 2017), this paper proposes to introduce the recursive framework for updating the stochastic gradient estimates in deep Q-learning, achieving a novel algorithm called SRG-DQN. Unlike the SVRG-based algorithms, SRG-DQN designs a recursive update of the stochastic gradient estimate. The parameter update is along an accumulated direction using the past stochastic gradient information, and therefore can get rid of the estimation of the full gradients as the anchors. Additionally, SRG-DQN involves the Adam process for further accelerating the training process. Theoretical analysis and the experimental results on well-known reinforcement learning tasks demonstrate the efficiency and effectiveness of the proposed SRG-DQN algorithm.

algorithm, gradient, srg-dqn, (13 more...)

2007.12817

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)

Fukushima, Shintaro, Nitanda, Atsushi, Yamanishi, Kenji

Online Robust and Adaptive Learning from Data Streams

arXiv.org Machine LearningJul-23-2020

In online learning from non-stationary data streams, it is both necessary to learn robustly to outliers and to adapt to changes of underlying data generating mechanism quickly. In this paper, we refer to the former nature of online learning algorithms as robustness and the latter as adaptivity. There is an obvious tradeoff between them. It is a fundamental issue to quantify and evaluate the tradeoff because it provides important information on the data generating mechanism. However, no previous work has considered the tradeoff quantitatively. We propose a novel algorithm called the Stochastic approximation-based Robustness-Adaptivity algorithm (SRA) to evaluate the tradeoff. The key idea of SRA is to update parameters of distribution or sufficient statistics with the biased stochastic approximation scheme, while dropping data points with large values of the stochastic update. We address the relation between two parameters, one of which is the step size of the stochastic approximation, and the other is the threshold parameter of the norm of the stochastic update. The former controls the adaptivity and the latter does the robustness. We give a theoretical analysis for the non-asymptotic convergence of SRA in the presence of outliers, which depends on both the step size and the threshold parameter. Since SRA is formulated on the majorization-minimization principle, it is a general algorithm including many algorithms, such as the online EM algorithm and stochastic gradient descent. Empirical experiments for both synthetic and real datasets demonstrated that SRA was superior to previous methods.

algorithm, artificial intelligence, machine learning, (12 more...)

2007.1216

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.15)
Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.05)
North America > United States (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Industry: Education > Educational Setting (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Brukhim, Nataly, Hazan, Elad

Online Boosting with Bandit Feedback

arXiv.org Machine LearningJul-23-2020

We consider the problem of online boosting for regression tasks, when only limited information is available to the learner. We give an efficient regret minimization method that has two implications: an online boosting algorithm with noisy multi-point bandit feedback, and a new projection-free online convex optimization algorithm with stochastic gradient, that improves state-of-the-art guarantees in terms of efficiency.

algorithm, artificial intelligence, machine learning, (18 more...)

2007.11975

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)

Yan, Xiaohan, Bijral, Avleen S.

On a Bernoulli Autoregression Framework for Link Discovery and Prediction

arXiv.org Machine LearningJul-23-2020

We present a dynamic prediction framework for binary sequences that is based on a Bernoulli generalization of the auto-regressive process. Our approach lends itself easily to variants of the standard link prediction problem for a sequence of time dependent networks. Focusing on this dynamic network link prediction/recommendation task, we propose a novel problem that exploits additional information via a much larger sequence of auxiliary networks and has important real-world relevance. To allow discovery of links that do not exist in the available data, our model estimation framework introduces a regularization term that presents a trade-off between the conventional link prediction and this discovery task. In contrast to existing work our stochastic gradient based estimation approach is highly efficient and can scale to networks with millions of nodes. We show extensive empirical results on both actual product-usage based time dependent networks and also present results on a Reddit based data set of time dependent sentiment sequences.

artificial intelligence, data mining, machine learning, (20 more...)

2007.11811

Country:

South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > Washington > King County > Redmond (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Information Technology (0.68)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
(2 more...)

Qureshi, Muhammad I., Xin, Ran, Kar, Soummya, Khan, Usman A.

S-ADDOPT: Decentralized stochastic first-order optimization over directed graphs

arXiv.org Machine LearningJul-22-2020

In this report, we study decentralized stochastic optimization to minimize a sum of smooth and strongly convex cost functions when the functions are distributed over a directed network of nodes. In contrast to the existing work, we use gradient tracking to improve certain aspects of the resulting algorithm. In particular, we propose the~\textbf{\texttt{S-ADDOPT}} algorithm that assumes a stochastic first-order oracle at each node and show that for a constant step-size~$\alpha$, each node converges linearly inside an error ball around the optimal solution, the size of which is controlled by~$\alpha$. For decaying step-sizes~$\mathcal{O}(1/k)$, we show that~\textbf{\texttt{S-ADDOPT}} reaches the exact solution sublinearly at~$\mathcal{O}(1/k)$ and its convergence is asymptotically network-independent. Thus the asymptotic behavior of~\textbf{\texttt{S-ADDOPT}} is comparable to the centralized stochastic gradient descent. Numerical experiments over both strongly convex and non-convex problems illustrate the convergence behavior and the performance comparison of the proposed algorithm.

graph, optimization, s-addopt, (15 more...)

2005.07785

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Massachusetts > Middlesex County > Medford (0.04)
North America > United States > Hawaii (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)