AITopics

1710.06952

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Training and Evaluating Improved Dependency-Based Word Embeddings

Li, Chen (BeiHang University) | Li, Jianxin (BeiHang University) | Song, Yangqiu (Hong Kong University of Science and Technology) | Lin, Ziwei (BeiHang University)

Word embedding has been widely used in many natural language processing tasks. In this paper, we focus on learning word embeddings through selective higher-order relationships in sentences to improve the embeddings to be less sensitive to local context and more accurate in capturing semantic compositionality. We present a novel multi-order dependency-based strategy to composite and represent the context under several essential constraints. In order to realize selective learning from the word contexts, we automatically assign the strengths of different dependencies between co-occurred words in the stochastic gradient descent process. We evaluate and analyze our proposed approach using several direct and indirect tasks for word embeddings. Experimental results demonstrate that our embeddings are competitive to or better than state-of-the-art methods and significantly outperform other methods in terms of context stability. The output weights and representations of dependencies obtained in our embedding model conform to most of the linguistic characteristics and are valuable for many downstream tasks.

artificial intelligence, machine learning, natural language, (20 more...)

Thirty-Second AAAI Conference on Artificial Intelligence

Country: Asia > China (0.47)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Ma, Ke (Institute of Information Engineering, Chinese Academy of Sciences) | Zeng, Jinshan (School of Cyber Security, University of Chinese Academy of Sciences) | Xiong, Jiechao (School of Computer Information Engineering, Jiangxi Normal University) | Xu, Qianqian (Hong Kong University of Science and Technology) | Cao, Xiaochun (Tencent AI Lab) | Liu, Wei (Institute of Information Engineering, Chinese Academy of Sciences) | Yao, Yuan (Institute of Information Engineering, Chinese Academy of Sciences)

Stochastic Non-Convex Ordinal Embedding With Stabilized Barzilai-Borwein Step Size

Learning representation from relative similarity comparisons, often called ordinal embedding, gains rising attention in recent years. Most of the existing methods are batch methods designed mainly based on the convex optimization, say, the projected gradient descent method. However, they are generally time-consuming due to that the singular value decomposition (SVD) is commonly adopted during the update, especially when the data size is very large. To overcome this challenge, we propose a stochastic algorithm called SVRG-SBB, which has the following features: (a) SVD-free via dropping convexity, with good scalability by the use of stochastic algorithm, i.e., stochastic variance reduced gradient (SVRG), and (b) adaptive step size choice via introducing a new stabilized Barzilai-Borwein (SBB) method as the original version for convex problems might fail for the considered stochastic non-convex optimization problem. Moreover, we show that the proposed algorithm converges to a stationary point at a rate O (1/ T ) in our setting, where T is the number of total iterations. Numerous simulations and real-world data experiments are conducted to show the effectiveness of the proposed algorithm via comparing with the state-of-the-art methods, particularly, much lower computational cost with good prediction performance.

artificial intelligence, machine learning, step size, (17 more...)

Thirty-Second AAAI Conference on Artificial Intelligence

Country: Asia > China (0.14)

Genre: Research Report (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.37)

Riemannian Stein Variational Gradient Descent for Bayesian Inference

Liu, Chang (Tsinghua University) | Zhu, Jun (Tsinghua University)

We develop Riemannian Stein Variational Gradient Descent (RSVGD), a Bayesian inference method that generalizes Stein Variational Gradient Descent (SVGD) to Riemann manifold. The benefits are two-folds: (i) for inference tasks in Euclidean spaces, RSVGD has the advantage over SVGD of utilizing information geometry, and (ii) for inference tasks on Riemann manifolds, RSVGD brings the unique advantages of SVGD to the Riemannian world. To appropriately transfer to Riemann manifolds, we conceive novel and non-trivial techniques for RSVGD, which are required by the intrinsically different characteristics of general Riemann manifolds from Euclidean spaces. We also discover Riemannian Stein's Identity and Riemannian Kernelized Stein Discrepancy. Experimental results show the advantages over SVGD of exploring distribution geometry and the advantages of particle-efficiency, iteration-effectiveness and approximation flexibility over other inference methods on Riemann manifolds.

artificial intelligence, machine learning, manifold, (16 more...)

Thirty-Second AAAI Conference on Artificial Intelligence

Country:

Asia (0.28)
Europe (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.71)

Statistical Inference Using SGD

Li, Tianyang (University of Texas at Austin) | Liu, Liu (University of Texas at Austin) | Kyrillidis, Anastasios ( IBM T.J. Watson Research Center, Yorktown Heights ) | Caramanis, Constantine (University of Texas at Austin)

We present a novel method for frequentist statistical inference in M-estimation problems, based on stochastic gradient descent (SGD) with a fixed step size: we demonstrate that the average of such SGD sequences can be used for statistical inference, after proper scaling. An intuitive analysis using the Ornstein-Uhlenbeck process suggests that such averages are asymptotically normal. To show the merits of our scheme, we apply it to both synthetic and real data sets, and demonstrate that its accuracy is comparable to classical statistical methods, while requiring potentially far less computation.

artificial intelligence, bayesian inference, machine learning, (19 more...)

Thirty-Second AAAI Conference on Artificial Intelligence

Country: North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.96)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.47)

Deep Reinforcement Learning That Matters

In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Thirty-Second AAAI Conference on Artificial Intelligence

Country:

North America > Canada (0.28)
Europe (0.28)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Huang, Feihu, Chen, Songcan

Mini-Batch Stochastic ADMMs for Nonconvex Nonsmooth Optimization

arXiv.org Machine LearningFeb-7-2018

In the paper, we study the mini-batch stochastic ADMMs (alternating direction method of multipliers) for the nonconvex nonsmooth optimization. We prove that, given an appropriate mini-batch size, the mini-batch stochastic ADMM without variance reduction (VR) technique is convergent and reaches the convergence rate of $O(1/T)$ to obtain a stationary point of the nonconvex optimization, where $T$ denotes the number of iterations. Moreover, we extend the mini-batch stochastic gradient method to both the nonconvex SVRG-ADMM and SAGA-ADMM in our initial paper \citep{huang2016stochastic}, and also prove that these mini-batch stochastic ADMMs reach the convergence rate of $O(1/T)$ without the condition on the mini-batch size. In particular, we provide a specific parameter selection for step size $\eta$ of stochastic gradients and penalization parameter $\rho$ of the augmented Lagrangian function. Finally, some experimental results demonstrate the effectiveness of our algorithms.

admm, artificial intelligence, machine learning, (17 more...)

1802.03284

Country: Asia (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Goel, Surbhi, Klivans, Adam, Meka, Raghu

Learning One Convolutional Layer with Overlapping Patches

arXiv.org Machine LearningFeb-7-2018

We give the first provably efficient algorithm for learning a one hidden layer convolutional network with respect to a general class of (potentially overlapping) patches. Additionally, our algorithm requires only mild conditions on the underlying distribution. We prove that our framework captures commonly used schemes from computer vision, including one-dimensional and two-dimensional "patch and stride" convolutions. Our algorithm-- $Convotron$ -- is inspired by recent work applying isotonic regression to learning neural networks. Convotron uses a simple, iterative update rule that is stochastic in nature and tolerant to noise (requires only that the conditional mean function is a one layer convolutional network, as opposed to the realizable setting). In contrast to gradient descent, Convotron requires no special initialization or learning-rate tuning to converge to the global optimum. We also point out that learning one hidden convolutional layer with respect to a Gaussian distribution and just $one$ disjoint patch $P$ (the other patches may be arbitrary) is $easy$ in the following sense: Convotron can efficiently recover the hidden weight vector by updating $only$ in the direction of $P$.

artificial intelligence, machine learning, neural network, (15 more...)

1802.02547

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Ma, Siyuan, Bassily, Raef, Belkin, Mikhail

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

arXiv.org Machine LearningFeb-5-2018

Stochastic Gradient Descent (SGD) with small mini-batch is a key component in modern large-scale learning. However, its efficiency has not been easy to analyze as most theoretical results require adaptive rates and show convergence rates far slower than that for gradient descent, making computational comparisons difficult. In this paper we aim to clarify the issue of fast SGD convergence. The key observation is that most modern architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero. While it is still unclear why these interpolated solutions perform well on test data, we show that these regimes allow for very fast convergence of SGD, comparable in the number of iterations to full gradient descent. Specifically, consider the setting with a quadratic objective function, or a general objective function in the proximity of a minimum, where the quadratic term is dominant. First, we obtain the optimal convergence rate for any mini-batch SGD and derive the optimal step size as a function of the batch size $m$. Second, we show: (1) $m=1$ is optimal in terms of number of computations required to achieve any desired accuracy. (2) There is a critical mini-batch size $m^*$ such that: (2a) SGD iteration with batch size $m\leq m^*$ is nearly equivalent to $m$ iterations of batch size $1$. (2b) SGD iteration with mini-batch $m> m^*$ is nearly equivalent to a full gradient descent iteration. The critical mini-batch size can be viewed as the limit for effective mini-batch parallelization. It is also nearly independent of the data size, implying $O(n)$ acceleration over GD per unit of computation. These theoretical analyses are verified by experimental results. Finally, we show how the interpolation perspective and our results fit in the deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.

artificial intelligence, computation, machine learning, (14 more...)

1712.06559

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceFeb-4-2018, 18:00:00 GMT

Towards Shockingly Easy Structured Classification: A Search-based Probabilistic Online Learning Framework

Sun, Xu

There are two major approaches for structured classification. One is the probabilistic gradient-based methods such as conditional random fields (CRF), which has high accuracy but with drawbacks: slow training, and no support of search-based optimization (which is important in many cases). The other one is the search-based learning methods such as perceptrons and margin infused relaxed algorithm (MIRA), which have fast training but also with drawbacks: low accuracy, no probabilistic information, and non-convergence in real-world tasks. We propose a novel and "shockingly easy" solution, a search-based probabilistic online learning method, to address most of those issues. This method searches the output candidates, derives probabilities, and conduct efficient online learning. We show that this method is with fast training, support search-based optimization, very easy to implement, with top accuracy, with probabilities, and with theoretical guarantees of convergence. Experiments on well-known tasks show that our method has better accuracy than CRF and almost as fast training speed as perceptron and MIRA. Results also show that SAPO can easily beat the state-of-the-art systems on those highly-competitive tasks, achieving record-breaking accuracies. The codes can be found at https://github.com/lancopku

artificial intelligence, learning method, machine learning, (15 more...)

arXiv.org Artificial Intelligence

1503.08381

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.59)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)