AITopics

1805.08728

Country: North America > United States > Colorado > Pueblo County > Pueblo (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

#artificialintelligenceNov-1-2020, 22:30:09 GMT

ML Optimization pt.1 - Gradient Descent with Python

So far in our journey through the Machine Learning universe, we covered several big topics. We investigated some regression algorithms, classification algorithms and algorithms that can be used for both types of problems (SVM, Decision Trees and Random Forest). Apart from that, we dipped our toes in unsupervised learning, saw how we can use this type of learning for clustering and learned about several clustering techniques. We also talked about how to quantify machine learning model performance and how to improve it with regularization. In all these articles, we used Python for "from the scratch" implementations and libraries like TensorFlow, Pytorch and SciKit Learn.

artificial intelligence, machine learning, ml optimization pt, (3 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

arXiv.org Machine LearningNov-1-2020

Drinking from a Firehose: Continual Learning with Web-scale Natural Language

Hu, Hexiang, Sener, Ozan, Sha, Fei, Koltun, Vladlen

Continual learning systems will interact with humans, with each other, and with the physical world through time -- and continue to learn and adapt as they do. An important open problem for continual learning is a large-scale benchmark that enables realistic evaluation of algorithms. In this paper, we study a natural setting for continual learning on a massive scale. We introduce the problem of personalized online language learning (POLL), which involves fitting personalized language models to a population of users that evolves over time. To facilitate research on POLL, we collect massive datasets of Twitter posts. These datasets, Firehose10M and Firehose100M, comprise 100 million tweets, posted by one million users over six years. Enabled by the Firehose datasets, we present a rigorous evaluation of continual learning algorithms on an unprecedented scale. Based on this analysis, we develop a simple algorithm for continual gradient descent (ConGraD) that outperforms prior continual learning methods on the Firehose datasets as well as earlier benchmarks. Collectively, the POLL problem setting, the Firehose datasets, and the ConGraD algorithm enable a complete benchmark for reproducible research on web-scale continual learning.

artificial intelligence, machine learning, natural language, (21 more...)

2007.09335

Country:

North America > United States > California (0.14)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (0.64)

Industry:

Education > Curriculum > Subject-Specific Education (0.35)
Information Technology > Services (0.34)
Education > Educational Setting (0.31)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Benson, Austin R., Damle, Anil, Townsend, Alex

Over-parametrized neural networks as under-determined linear systems

arXiv.org Machine LearningOct-29-2020

We draw connections between simple neural networks and under-determined linear systems to comprehensively explore several interesting theoretical questions in the study of neural networks. First, we emphatically show that it is unsurprising such networks can achieve zero training loss. More specifically, we provide lower bounds on the width of a single hidden layer neural network such that only training the last linear layer suffices to reach zero training loss. Our lower bounds grow more slowly with data set size than existing work that trains the hidden layer weights. Second, we show that kernels typically associated with the ReLU activation function have fundamental flaws -- there are simple data sets where it is impossible for widely studied bias-free models to achieve zero training loss irrespective of how the parameters are chosen or trained. Lastly, our analysis of gradient descent clearly illustrates how spectral properties of certain matrices impact both the early iteration and long-term training behavior. We propose new activation functions that avoid the pitfalls of ReLU in that they admit zero training loss solutions for any set of distinct data points and experimentally exhibit favorable spectral properties.

activation function, artificial intelligence, machine learning, (18 more...)

2010.15959

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > New York > Tompkins County > Ithaca (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)

Kelterborn, Chad, Mazur, Marcin, Petrenko, Bogdan V.

Gravilon: Applications of a New Gradient Descent Method to Machine Learning

arXiv.org Machine LearningOct-28-2020

Gradient descent algorithms have been used in countless applications since the inception of Newton's method. The explosion in the number of applications of neural networks has re-energized efforts in recent years to improve the standard gradient descent method in both efficiency and accuracy. These methods modify the effect of the gradient in updating the values of the parameters. These modifications often incorporate hyperparameters: additional variables whose values must be specified at the outset of the program. We provide, below, a novel gradient descent algorithm, called Gravilon, that uses the geometry of the hypersurface to modify the length of the step in the direction of the gradient. Using neural networks, we provide promising experimental results comparing the accuracy and efficiency of the Gravilon method against commonly used gradient descent algorithms on MNIST digit classification.

artificial intelligence, machine learning, neural network, (14 more...)

2008.1137

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > North Carolina > Orange County > Chapel Hill (0.14)
North America > United States > New York > Broome County > Binghamton (0.04)
North America > United States > Illinois > Coles County > Charleston (0.04)

Genre: Workflow (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Chen, Keyi, Langford, John, Orabona, Francesco

Better Parameter-free Stochastic Optimization with ODE Updates for Coin-Betting

arXiv.org Machine LearningOct-28-2020

Parameter-free stochastic gradient descent (PFSGD) algorithms do not require setting learning rates while achieving optimal theoretical performance. In practical applications, however, there remains an empirical gap between tuned stochastic gradient descent (SGD) and PFSGD. In this paper, we close the empirical gap with a new parameter-free algorithm based on continuous-time Coin-Betting on truncated models. The new update is derived through the solution of an Ordinary Differential Equation (ODE) and solved in a closed form. We show empirically that this new parameter-free algorithm outperforms algorithms with the "best default" learning rates and almost matches the performance of finely tuned baselines without anything to tune.

algorithm, artificial intelligence, machine learning, (16 more...)

2006.07507

Country:

North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)

Brochard, Antoine, Błaszczyszyn, Bartłomiej, Mallat, Stéphane, Zhang, Sixin

Particle gradient descent model for point process generation

This paper introduces a generative model for planar point processes in a square window, built upon a single realization of a stationary, ergodic point process observed in this window. Inspired by recent advances in gradient descent methods for maximum entropy models, we propose a method to generate similar point patterns by jointly moving particles of an initial Poisson configuration towards a target counting measure. The target measure is generated via a deterministic gradient descent algorithm, so as to match a set of statistics of the given, observed realization. Our statistics are estimators of the multi-scale wavelet phase harmonic covariance, recently proposed in image modeling. They allow one to capture geometric structures through multi-scale interactions between wavelet coefficients. Both our statistics and the gradient descent algorithm scale better with the number of observed points than the classical k-nearest neighbour distances previously used in generative models for point processes, based on the rejection sampling or simulated-annealing. The overall quality of our model is evaluated on point processes with various geometric structures through spectral and topological data analysis.

artificial intelligence, descriptor, machine learning, (17 more...)

2010.14928

Country:

Europe > France > Île-de-France > Paris > Paris (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)

Genre: Research Report (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Collins, Liam, Mokhtari, Aryan, Shakkottai, Sanjay

Why Does MAML Outperform ERM? An Optimization Perspective

Model-Agnostic Meta-Learning (MAML) has demonstrated widespread success in training models that can quickly adapt to new tasks via one or few stochastic gradient descent steps. However, the MAML objective is significantly more difficult to optimize compared to standard Empirical Risk Minimization (ERM), and little is understood about how much MAML improves over ERM in terms of the fast adaptability of their solutions in various scenarios. We analytically address this issue in a linear regression setting consisting of a mixture of easy and hard tasks, where hardness is determined by the number of gradient steps required to solve the task. Specifically, we prove that for $\Omega(d_{\text{eff}})$ labelled test samples (for gradient-based fine-tuning) where $d_{\text{eff}}$ is the effective dimension of the problem, in order for MAML to achieve substantial gain over ERM, the optimal solutions of the hard tasks must be closely packed together with the center far from the center of the easy task optimal solutions. We show that these insights also apply in a low-dimensional feature space when both MAML and ERM learn a representation of the tasks, which reduces the effective problem dimension. Further, our few-shot image classification experiments suggest that our results generalize beyond linear regression.

artificial intelligence, machine learning, maml, (19 more...)

2010.14672

Country:

North America > United States > Texas > Travis County > Austin (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Kumar, Aviral, Agarwal, Rishabh, Ghosh, Dibya, Levine, Sergey

Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity in terms of a drop in the rank of the learned value network features, and show that this corresponds to a drop in performance. We demonstrate this phenomenon on widely studies domains, including Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse improves performance.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2010.14498

Country:

North America > United States (0.14)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > New Finding (0.92)

Industry:

Leisure & Entertainment > Games (0.47)
Telecommunications > Networks (0.34)
Information Technology > Networks (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.58)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

A Non-Asymptotic Analysis for Stein Variational Gradient Descent

Korba, Anna, Salim, Adil, Arbel, Michael, Luise, Giulia, Gretton, Arthur

We study the Stein Variational Gradient Descent (SVGD) algorithm, which optimises a set of particles to approximate a target probability distribution $\pi\propto e^{-V}$ on $\mathbb{R}^d$. In the population limit, SVGD performs gradient descent in the space of probability distributions on the KL divergence with respect to $\pi$, where the gradient is smoothed through a kernel integral operator. In this paper, we provide a novel finite time analysis for the SVGD algorithm. We provide a descent lemma establishing that the algorithm decreases the objective at each iteration, and rates of convergence. We also provide a convergence result of the finite particle system corresponding to the practical implementation of SVGD to its population version.

artificial intelligence, machine learning, stein, (13 more...)

2006.09797

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.92)