AITopics

1911.0535

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

arXiv.org Machine LearningNov-13-2019

Convergence to minima for the continuous version of Backtracking Gradient Descent

Truong, Tuyen Trung

The main result of this paper is: {\bf Theorem.} Let $f:\mathbb{R}^k\rightarrow \mathbb{R}$ be a $C^{1}$ function, so that $\nabla f$ is locally Lipschitz continuous. Assume moreover that $f$ is $C^2$ near its generalised saddle points. Fix real numbers $\delta_0>0$ and $0<\alpha <1$. Then there is a smooth function $h:\mathbb{R}^k\rightarrow (0,\delta_0]$ so that the map $H:\mathbb{R}^k\rightarrow \mathbb{R}^k$ defined by $H(x)=x-h(x)\nabla f(x)$ has the following property: (i) For all $x\in \mathbb{R}^k$, we have $f(H(x)))-f(x)\leq -\alpha h(x)||\nabla f(x)||^2$. (ii) For every $x_0\in \mathbb{R}^k$, the sequence $x_{n+1}=H(x_n)$ either satisfies $\lim_{n\rightarrow\infty}||x_{n+1}-x_n||=0$ or $ \lim_{n\rightarrow\infty}||x_n||=\infty$. Each cluster point of $\{x_n\}$ is a critical point of $f$. If moreover $f$ has at most countably many critical points, then $\{x_n\}$ either converges to a critical point of $f$ or $\lim_{n\rightarrow\infty}||x_n||=\infty$. (iii) There is a set $\mathcal{E}_1\subset \mathbb{R}^k$ of Lebesgue measure $0$ so that for all $x_0\in \mathbb{R}^k\backslash \mathcal{E}_1$, the sequence $x_{n+1}=H(x_n)$, {\bf if converges}, cannot converge to a {\bf generalised} saddle point. (iv) There is a set $\mathcal{E}_2\subset \mathbb{R}^k$ of Lebesgue measure $0$ so that for all $x_0\in \mathbb{R}^k\backslash \mathcal{E}_2$, any cluster point of the sequence $x_{n+1}=H(x_n)$ is not a saddle point, and more generally cannot be an isolated generalised saddle point. Some other results are proven.

critical point, generalised saddle point, saddle point, (15 more...)

1911.04221

Country:

Europe > Norway > Eastern Norway > Oslo (0.04)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
(4 more...)

Genre:

Research Report (0.50)
Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.41)

#artificialintelligenceNov-12-2019, 11:02:30 GMT

Intro to optimization in deep learning: Momentum, RMSProp and Adam

In another post, we covered the nuts and bolts of Stochastic Gradient Descent and how to address problems like getting stuck in a local minima or a saddle point. In this post, we take a look at another problem that plagues training of neural networks, pathological curvature. While local minima and saddle points can stall our training, pathological curvature can slow down training to an extent that the machine learning practitioner might think that search has converged to a sub-optimal minma. Let us understand in depth what pathological curvature is. Consider the following loss contour.

curvature, gradient, pathological curvature, (15 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

#artificialintelligenceNov-11-2019, 20:46:34 GMT

bims-arihec 2019-11-10 papers

OBJECTIVE: To evaluate the potential value of the machine learning (ML)-based MRI texture analysis for predicting 1p/19q codeletion status of lower-grade gliomas (LGG), using various state-of-the-art ML algorithms.MATERIALS AND METHODS: For this retrospective study, 107 patients with LGG were included from a public database. Texture features were extracted from conventional T2-weighted and contrast-enhanced T1-weighted MRI images, using LIFEx software. Training and unseen validation splits were created using stratified 10-fold cross-validation technique along with minority over-sampling. Dimension reduction was done using collinearity analysis and feature selection (ReliefF). Classifications were done using adaptive boosting, k-nearest neighbours, naive Bayes, neural network, random forest, stochastic gradient descent, and support vector machine.

algorithm, mri texture analysis, neural network, (10 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.64)

#artificialintelligenceNov-11-2019, 05:03:18 GMT

Neural Architecture Search Could Tune AI's Algorithmic Heart - InformationWeek

Data science has evolved far beyond science. It now represents the heart and soul of many disruptive business applications. Everywhere you look, enterprise data science practices have become industrialized within 24x7 DevOps workflows. Under that trend, automation has come to practically every process in the machine-learning DevOps pipeline that surrounds AI. Modeling is the next and perhaps ultimate milestone in the move toward end-to-end, data-science pipeline automation.

architecture, architecture search, neural architecture search, (11 more...)

Country: North America > United States (0.30)

Industry:

Information Technology (0.49)
Government > Regional Government > North America Government > United States Government (0.30)
Energy (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.30)

arXiv.org Machine LearningNov-11-2019

Nonconvex Low-Rank Symmetric Tensor Completion from Noisy Data

Cai, Changxiao, Li, Gen, Poor, H. Vincent, Chen, Yuxin

We study a noisy symmetric tensor completion problem of broad practical interest, namely, the reconstruction of a low-rank symmetric tensor from highly incomplete and randomly corrupted observations of its entries. While a variety of prior work has been dedicated to this problem, prior algorithms either are computationally too expensive for large-scale applications, or come with sub-optimal statistical guarantees. Focusing on "incoherent" and well-conditioned tensors of a constant CP rank, we propose a two-stage nonconvex algorithm --- (vanilla) gradient descent following a rough initialization --- that achieves the best of both worlds. Specifically, the proposed nonconvex algorithm faithfully completes the tensor and retrieves all individual tensor factors within nearly linear time, while at the same time enjoying near-optimal statistical guarantees (i.e. minimal sample complexity and optimal estimation accuracy). The estimation errors are evenly spread out across all entries, thus achieving optimal $\ell_{\infty}$ statistical accuracy. The insight conveyed through our analysis of nonconvex optimization might have implications for other tensor estimation problems.

lemma 4, nullnull null, probability, (14 more...)

1911.04436

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Dan, Soham, Sahoo, Dushyant

Variance Reduced Stochastic Proximal Algorithm for AUC Maximization

arXiv.org Machine LearningNov-8-2019

Stochastic Gradient Descent has been widely studied with classification accuracy as a performance measure. However, these stochastic algorithms cannot be directly used when non-decomposable pairwise performance measures are used such as Area under the ROC curve (AUC) which is a common performance metric when the classes are imbalanced. There have been several algorithms proposed for optimizing AUC as a performance metric, and one of the recent being a stochastic proximal gradient algorithm (SPAM). But the downside of the stochastic methods is that they suffer from high variance leading to slower convergence. To combat this issue, several variance reduced methods have been proposed with faster convergence guarantees than vanilla stochastic gradient descent. Again, these variance reduced methods are not directly applicable when non-decomposable performance measures are used. In this paper, we develop a Variance Reduced Stochastic Proximal algorithm for AUC Maximization (\textsc{VRSPAM}) and perform a theoretical analysis as well as empirical analysis to show that our algorithm converges faster than SPAM which is the previous state-of-the-art for the AUC maximization problem.

algorithm, convergence rate, variance, (13 more...)

1911.03548

Country:

South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > Pennsylvania (0.04)
North America > United States > California > Orange County > Irvine (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.88)

#artificialintelligenceNov-7-2019, 08:17:53 GMT

r/MachineLearning - [P] Predict figure skating world championship ranking from season performances (part 2: hybrid models learned by gradient descent)

I previous posted the write-up on the first part of my project (Github repo) to predict how skaters would rank in the figure skating world championship from earlier scores that they earned in the season. The main idea is to separate the skater effect, the intrinsic ability of each skater, from the event effect, the influence of an event on a skater's performance, so that a more accurate ranking could be built. Unfortunately, this model does not have a closed-form solution to learn the parameters as opposed to the earlier models. Therefore, gradient descent was used to learn them, which resulted in this neat little animation that tracks how the model residuals, RMSE, as well as predicted ranking gets better and better as gradient descent runs. I also explore different strategies to reduce model overfit (so that it can predict skater ranking more accurately), using familiar methods such as model penalization and early stopping. Lastly, note that this hybrid model is nothing but factorizing the event-skater score matrix into an event-specific vector and skater-specific vector, which can multiply together to approximate the score matrix.

gradient descent, hybrid model, skater, (8 more...)

Industry:

Leisure & Entertainment > Sports > Figure Skating (0.63)
Media > News (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)
Information Technology > Communications > Social Media (0.76)

Heiss, Jakob, Teichmann, Josef, Wutte, Hanna

How implicit regularization of Neural Networks affects the learned function -- Part I

arXiv.org Machine LearningNov-7-2019

Today, various forms of neural networks are trained to perform approximation tasks in many fields. However, the solutions obtained are not wholly understood. Empirical results suggest that the training favors regularized solutions. These observations motivate us to analyze properties of the solutions found by the gradient descent algorithm frequently employed to perform the training task. As a starting point, we consider one dimensional (shallow) neural networks in which weights are chosen randomly and only the terminal layer is trained. We show, that the resulting solution converges to the smooth spline interpolation of the training data as the number of hidden nodes tends to infinity. This might give valuable insight on the properties of the solutions obtained using gradient descent methods in general settings.

neural network, supp, theorem 3, (14 more...)

1911.02903

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Switzerland (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Negrea, Jeffrey, Haghifam, Mahdi, Dziugaite, Gintare Karolina, Khisti, Ashish, Roy, Daniel M.

Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates

arXiv.org Machine LearningNov-7-2019

Information-Theoretic Generalization Bounds for SGLD via Data-Dependent EstimatesJeffrey Negrea University of Toronto, V ector Institute Mahdi Haghifam University of Toronto, Element AI Gintare Karolina Dziugaite Element AI Ashish Khisti University of Toronto Daniel M. Roy University of Toronto, V ector Institute Abstract In this work, we improve upon the stepwise analysis of noisy iterative learning algorithms initiated by Pensia, Jog, and Loh (2018) and recently extended by Bu, Zou, and V eeravalli (2019). Our main contributions are significantly improved mutual information bounds for Stochastic Gradient Langevin Dynamics via data-dependent estimates. Our approach is based on the variational characterization of mutual information and the use of data-dependent priors that forecast the mini-batch gradient based on a subset of the training samples. Our approach is broadly applicable within the information-theoretic framework of Russo and Zou (2015) and Xu and Raginsky (2017). Our bound can be tied to a measure of flatness of the empirical risk surface. As compared with other bounds that depend on the squared norms of gradients, empirical investigations show that the terms in our bounds are orders of magnitude smaller. 1 Introduction Stochastic subgradient methods, especially stochastic gradient descent (SGD), are at the core of recent advances in deep-learning practice. Despite some progress, developing a precise understanding of generalization error for that class of algorithms remains wide open. Concurrently, there has been steady progress for noisy variants of SGD, such as stochastic gradient Langevin dynamics (SGLD) [13, 25, 33] and its full-batch counterpart, the Langevin algorithm [13]. The introduction of Gaussian noise to the iterates of SGD expands the set of theoretical frameworks that can be brought to bear on the study of generalization. In pioneering work, Raginsky, Rakhlin, and Telgarsky [25] exploit the fact that SGLD approximates Langevin diffusion, a continuous time Markov process, in the small step size limit. One drawback of this and related analyses involving Markov processes is the reliance on mixing. We hypothesize that SGLD is not mixing in practice, so results based upon mixing may not be representative of empirical performance. In recent work, Pensia, Jog, and Loh [23] perform a stepwise analysis of a family of noisy iterative algorithms that includes SGLD and the Langevin algorithm. At the foundation of this work is the framework of Russo and Zou [28] and Xu and Raginsky [34], where mean generalization error is controlled in terms of the mutual information between the dataset and the learned parameters.

generalization error, information, mutual information, (15 more...)

1911.02151

Country:

North America > Canada > Ontario > Toronto (1.00)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)