Learning Graphical Models
Self-Sustaining Iterated Learning
In this form of iterated learning, agents teach each other in sequence: X teaches Y, who then teaches Z, who then teaches... [1-10]. By a classic result of Griffiths and Kalish [3], Quenya will vanish after a finite number of iterations, at which point the agents, assumed to be rational, will be "teaching" each other plain English. In other words, after a while, learners will be taught nothing they don't already know: iterated learning is not self-sustaining. Such findings are hard to validate empirically but variants of it are within the reach of experimental psychology. As early as 1932, in fact, the English psychologist Frederic Bartlett used iterated learning to expose hidden biases among humans. He presented a picture of an owl to a person for given period of time and then asked her to draw it from memory. Her picture was then shown to the next learner for the same amount of time, who then proceeded to draw it back from memory. After 20 iterations of this process, to Bartlett's surprise, what was being drawn was no longer an owl but, quite clearly, a This work was supported in part by NSF grant CCF-1420112.
Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression
Xu, Ning, Hong, Jian, Fisher, Timothy C. G.
In this paper, we study the performance of extremum estimators from the perspective of generalization ability (GA): the ability of a model to predict outcomes in new samples from the same population. By adapting the classical concentration inequalities, we derive upper bounds on the empirical out-of-sample prediction errors as a function of the in-sample errors, in-sample data size, heaviness in the tails of the error distribution, and model complexity. We show that the error bounds may be used for tuning key estimation hyper-parameters, such as the number of folds K in cross-validation. We also show how K affects the bias-variance tradeoff for cross-validation. Simulations are used to demonstrate key results. We would also like to acknowledge participants at the 12th International Symposium on Econometric Theory and Applications and the 26th New Zealand Econometric Study Group as well as seminar participants at Utah, UNSW, and University of Melbourne for useful questions and comments. Fisher would like to acknowledge the financial support of the Australian Research Council, grant DP0663477. 1 1 Introduction Traditionally in econometrics, an estimation method is implemented on sample data in order to infer patterns in a population. Put another way, inference centers on generalizing to the population the pattern learned from the sample and evaluating how well the sample pattern fits the population. An alternative perspective is to consider how well a sample pattern fits another sample. In this paper, we study the ability of a model estimated from a given sample to fit new samples from the same population, referred to as the generalization ability (GA) of the model. As a way of evaluating the external validity of sample estimates, the concept of GA has been implemented in recent empirical research. For example, in the policy evaluation literature [Belloni et al., 2013, Gechter, 2015, Dolton, 2006, Blundell et al., 2004], the central question is whether any treatment effect estimated from a pilot program can be generalized to out-of-sample individuals.
Noisy Inductive Matrix Completion Under Sparse Factor Models
Soni, Akshay, Chevalier, Troy, Jain, Swayambhoo
Inductive Matrix Completion (IMC) is an important class of matrix completion problems that allows direct inclusion of available features to enhance estimation capabilities. These models have found applications in personalized recommendation systems, multilabel learning, dictionary learning, etc. This paper examines a general class of noisy matrix completion tasks where the underlying matrix is following an IMC model i.e., it is formed by a mixing matrix (a priori unknown) sandwiched between two known feature matrices. The mixing matrix here is assumed to be well approximated by the product of two sparse matrices---referred here to as "sparse factor models." We leverage the main theorem of Soni:2016:NMC and extend it to provide theoretical error bounds for the sparsity-regularized maximum likelihood estimators for the class of problems discussed in this paper. The main result is general in the sense that it can be used to derive error bounds for various noise models. In this paper, we instantiate our main result for the case of Gaussian noise and provide corresponding error bounds in terms of squared loss.
Information Theoretic Structure Learning with Confidence
Moon, Kevin R., Noshad, Morteza, Sekeh, Salimeh Yasaei, Hero, Alfred O. III
Information theoretic measures (e.g. the Kullback Liebler divergence and Shannon mutual information) have been used for exploring possibly nonlinear multivariate dependencies in high dimension. If these dependencies are assumed to follow a Markov factor graph model, this exploration process is called structure discovery. For discrete-valued samples, estimates of the information divergence over the parametric class of multinomial models lead to structure discovery methods whose mean squared error achieves parametric convergence rates as the sample size grows. However, a naive application of this method to continuous nonparametric multivariate models converges much more slowly. In this paper we introduce a new method for nonparametric structure discovery that uses weighted ensemble divergence estimators that achieve parametric convergence rates and obey an asymptotic central limit theorem that facilitates hypothesis testing and other types of statistical validation.
Policy Networks with Two-Stage Training for Dialogue Systems
Fatemi, Mehdi, Asri, Layla El, Schulz, Hannes, He, Jing, Suleman, Kaheer
In this paper, we propose to use deep policy networks which are trained with an advantage actor-critic method for statistically optimised dialogue systems. First, we show that, on summary state and action spaces, deep Reinforcement Learning (RL) outperforms Gaussian Processes methods. Summary state and action spaces lead to good performance but require pre-engineering effort, RL knowledge, and domain expertise. In order to remove the need to define such summary spaces, we show that deep RL can also be trained efficiently on the original state and action spaces. Dialogue systems based on partially observable Markov decision processes are known to require many dialogues to train, which makes them unappealing for practical deployment. We show that a deep RL method based on an actor-critic architecture can exploit a small amount of data very efficiently. Indeed, with only a few hundred dialogues collected with a handcrafted policy, the actor-critic deep learner is considerably boot-strapped from a combination of supervised and batch RL. In addition, convergence to an optimal policy is significantly sped up compared to other deep RL methods initialized on the data with batch RL. All experiments are performed on a restaurant domain derived from the Dialogue State Tracking Challenge 2 (DSTC2) dataset.
On the Relationship between Online Gaussian Process Regression and Kernel Least Mean Squares Algorithms
Van Vaerenbergh, Steven, Fernandez-Bes, Jesus, Elvira, Víctor
ABSTRACT We study the relationship between online Gaussian process (GP) regression and kernel least mean squares (KLMS) algorithms. While the latter have no capacity of storing the entire posterior distribution during online learning, we discover that their operation corresponds to the assumption of a fixed posterior covariance that follows a simple parametric model. Interestingly, several well-known KLMS algorithms correspond to specific cases of this model. The probabilistic perspective allows us to understand how each of them handles uncertainty, which could explain some of their performance differences. Index Terms-- online learning, regression, Gaussian processes, kernel least-mean squares 1. INTRODUCTION Gaussian Process (GP) regression is a state-of-the-art Bayesian technique for nonlinear regression [1].
Nonparametric risk bounds for time-series forecasting
McDonald, Daniel J., Shalizi, Cosma Rohilla, Schervish, Mark
Generalization error bounds are probabilistically valid, non-asymptotic tools for characterizing the predictive ability of forecasting models. This methodology is fundamentally about choosing particular prediction functions out of some class of plausible alternatives so that, with high reliability, the resulting predictions will be nearly as accurate as possible ("probably approximately correct"). While many of these results are aimed at classification problems with independent and identically distributed (i.i.d.) data, this paper adapts and extends these methods to time-series models, so that economic and financial forecasting techniques can be evaluated rigorously. In particular, these methods control the expected accuracy of future predictions from mis-specified models based on finite samples. This allows for immediate model comparisons which neither appeal to asymptotics nor make strong assumptions about the data-generating process, in stark contrast to such popular model-selection tools as AIC.
Google's DeepMind has learnt how to talk like a human
Anyone that might be concerned about computers taking over look away now, because they are a step closer to sounding just like humans. Researchers in the UK at Google's DeepMind unit have been working on making computer-generated speech sound as "natural" as humans. The technology, called WaveNet, which is focused on the area of speech synthesis, or text-to-speech, was found to sound more natural than any of Google's products. However, this was only achieved after the WaveNet artificial neural network was trained to produce English and Chinese speech which required copious amounts of computing power, so the technology probably won't be hitting the mainstream any time soon. Using a convolutional neural network, which is used for artificial intelligence in deep learning, it is trained on data and then the systems make inferences about new data, in addition to being used to generate new data.
PyData Carolinas 2016 Presentation: Deep Finch? A Continued Comparison of Machine Learning Models to Label Birdsong Syllables
Songbirds provide a model system that neuroscientists use to understand how the brain learns and controls speech and similar skills. Much like infants learning to speak from their parents, songbirds learn their song from a tutor and practice it millions of times before reaching maturity. Also like humans, songbirds have evolved special brain regions for learning and producing their vocalizations. These newly-evolved brain regions in songbirds, known as the song system, are found within broader brain areas shared by birds and humans across evolution. So by studying how the song system works, we can learn about our own brains.
Singularity structures and impacts on parameter estimation in finite mixtures of distributions
Singularities of a statistical model are the elements of the model's parameter space which make the corresponding Fisher information matrix degenerate. These are the points for which estimation techniques such as the maximum likelihood estimator and standard Bayesian procedures do not admit the root-$n$ parametric rate of convergence. We propose a general framework for the identification of singularity structures of the parameter space of finite mixtures, and study the impacts of the singularity levels on minimax lower bounds and rates of convergence for the maximum likelihood estimator over a compact parameter space. Our study makes explicit the deep links between model singularities, parameter estimation convergence rates and minimax lower bounds, and the algebraic geometry of the parameter space for mixtures of continuous distributions. The theory is applied to establish concrete convergence rates of parameter estimation for finite mixture of skewnormal distributions. This rich and increasingly popular mixture model is shown to exhibit a remarkably complex range of asymptotic behaviors which have not been hitherto reported in the literature.