Statistical Learning
A Natural Policy Gradient
Sham Kakade Gatsby Computational Neuroscience Unit 17 Queen Square, London, UK WC1N 3AR http://www.gatsby.ucl.ac.uk sham@gatsby.ucl.ac.uk Abstract We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space.Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient ismoving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton etal. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1 Introduction There has been a growing interest in direct policy-gradient methods for approximate planning in large Markov decision problems (MDPs). Unfortunately, the standard gradient descent rule is noncovariant. In this paper, we present a covariant gradient by defining a metric based on the underlying structure of the policy.
Batch Value Function Approximation via Support Vectors
Dietterich, Thomas G., Wang, Xin
One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attemptto minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods,the kernel methods described here can easily'adjust the complexity of the function approximator to fit the complexity of the value function.
Face Recognition Using Kernel Methods
Principal Component Analysis and Fisher Linear Discriminant methods have demonstrated their success in face detection, recognition, andtracking. The representation in these subspace methods is based on second order statistics of the image set, and does not address higher order statistical dependencies such as the relationships amongthree or more pixels. Recently Higher Order Statistics and Independent Component Analysis (ICA) have been used as informative lowdimensional representations for visual recognition. In this paper, we investigate the use of Kernel Principal Component Analysisand Kernel Fisher Linear Discriminant for learning low dimensional representations for face recognition, which we call Kernel Eigenface and Kernel Fisherface methods. While Eigenface and Fisherface methods aim to find projection directions based on the second order correlation of samples, Kernel Eigenface and Kernel Fisherfacemethods provide generalizations which take higher order correlations into account.
Active Learning in the Drug Discovery Process
Warmuth, Manfred K., Rรคtsch, Gunnar, Mathieson, Michael, Liao, Jun, Lemmen, Christian
We investigate the following data mining problem from Computational Chemistry: From a large data set of compounds, find those that bind to a target molecule in as few iterations of biological testing as possible. In each iteration a comparatively small batch of compounds is screened for binding to the target. We apply active learning techniques for selecting the successive batches. One selection strategy picks unlabeled examples closest to the maximum margin hyperplane. Another produces many weight vectors by running perceptrons over multiple permutations of the data.
Prodding the ROC Curve: Constrained Optimization of Classifier Performance
Mozer, Michael C., Dodier, Robert, Colagrosso, Michael D., Guerra-Salcedo, Cesar, Wolniewicz, Richard
When designing a two-alternative classifier, one ordinarily aims to maximize the classifier's ability to discriminate between members of the two classes. We describe a situation in a real-world business application of machine-learning prediction in which an additional constraint is placed on the nature of the solution: thatthe classifier achieve a specified correct acceptance or correct rejection rate (i.e., that it achieve a fixed accuracy on members of one class or the other). Our domain is predicting churn in the telecommunications industry. Churn refers to customers who switch from one service provider to another. We propose fouralgorithms for training a classifier subject to this domain constraint, and present results showing that each algorithm yields a reliable improvement in performance.
Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
Chapados, Nicolas, Bengio, Yoshua, Vincent, Pascal, Ghosn, Joumana, Dugas, Charles, Takeuchi, Ichiro, Meng, Linyan
This conditional expected claim amount is called the pure premium and it is the basis of the gross premium charged to the insured. This expected value is conditionned on information available about the insured and about the contract, which we call input profile here. This regression problem is difficult for several reasons: large number of examples, -large number variables (most of which are discrete and multi-valued), non-stationarity of the distribution, and a conditional distribution of the dependent variable which is very different from those usually encountered in typical applications .of
Learning Body Pose via Specialized Maps
Rosales, Rรณmer, Sclaroff, Stan
A nonlinear supervised learning model, the Specialized Mappings Architecture (SMA), is described and applied to the estimation of human body pose from monocular images. The SMA consists of several specialized forward mapping functions and an inverse mapping function.Each specialized function maps certain domains of the input space (image features) onto the output space (body pose parameters). The key algorithmic problems faced are those of learning the specialized domains and mapping functions in an optimal way,as well as performing inference given inputs and knowledge of the inverse function. Solutions to these problems employ the EM algorithm and alternating choices of conditional independence assumptions.Performance of the approach is evaluated with synthetic and real video sequences of human motion. 1 Introduction In everyday life, humans can easily estimate body part locations (body pose) from relatively low-resolution images of the projected 3D world (e.g., when viewing a photograph or a video). However, body pose estimation is a very difficult computer vision problem.
Categorization by Learning and Combining Object Parts
Heisele, Bernd, Serre, Thomas, Pontil, Massimiliano, Vetter, Thomas, Poggio, Tomaso
We describe an algorithm for automatically learning discriminative components ofobjects with SVM classifiers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classification experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classifier which may be relevant for biological modelsof visual recognition.
Speech Recognition using SVMs
An important issue in applying SVMs to speech recognition is the ability to classify variable length sequences. This paper presents extensions to a standard scheme for handling this variable length data, the Fisher score. A more useful mapping is introduced based on the likelihood-ratio. The score-space defined by this mapping avoids some limitations of the Fisher score. Class-conditional generative modelsare directly incorporated into the definition of the score-space.