Technology
Greedy Importance Sampling
I present a simple variation of importance sampling that explicitly searches forimportant regions in the target distribution. I prove that the technique yieldsunbiased estimates, and show empirically it can reduce the variance of standard Monte Carlo estimators. This is achieved by concentrating samplesin more significant regions of the sample space. 1 Introduction It is well known that general inference and learning with graphical models is computationally hard[1] and it is therefore necessary to consider restricted architectures [13], or approximate algorithms to perform these tasks [3, 7]. Among the most convenient and successful techniques are stochastic methods which are guaranteed to converge to a correct solution in the limit oflarge samples [10, 11, 12, 15]. These methods can be easily applied to complex inference problems that overwhelm deterministic approaches.
The Infinite Gaussian Mixture Model
In a Bayesian mixture model it is not necessary a priori to limit the number ofcomponents to be finite. In this paper an infinite Gaussian mixture model is presented which neatly sidesteps the difficult problem of finding the"right" number of mixture components. Inference in the model is done using an efficient parameter-free Markov Chain that relies entirely on Gibbs sampling.
A Multi-class Linear Learning Algorithm Related to Winnow
Committee is an algorithm forcombining the predictions of a set of sub-experts in the online mistake-bounded model oflearning. A sub-expert is a special type of attribute that predicts with a distribution over a finite number of classes. Committee learns a linear function of sub-experts and uses this function to make class predictions. We provide bounds for Committee that show it performs well when the target can be represented by a few relevant sub-experts. We also show how Committee can be used to solve more traditional problems composed of attributes. This leads to a natural extension thatlearns on multi-class problems that contain both traditional attributes and sub-experts.
The Relaxed Online Maximum Margin Algorithm
We describe a new incremental algorithm for training linear threshold functions:the Relaxed Online Maximum Margin Algorithm, or ROMMA. ROMMA can be viewed as an approximation to the algorithm that repeatedly chooses the hyperplane that classifies previously seen examples correctlywith the maximum margin. It is known that such a maximum-margin hypothesis can be computed by minimizing the length of the weight vector subject to a number of linear constraints. ROMMA works by maintaining a relatively simple relaxation of these constraints that can be efficiently updated. We prove a mistake bound for ROMMA that is the same as that proved for the perceptron algorithm. Our analysis implies that the more computationally intensive maximum-margin algorithm alsosatisfies this mistake bound; this is the first worst-case performance guaranteefor this algorithm. We describe some experiments using ROMMA and a variant that updates its hypothesis more aggressively as batch algorithms to recognize handwritten digits. The computational complexity and simplicity of these algorithms is similar to that of perceptron algorithm,but their generalization is much better. We describe a sense in which the performance of ROMMA converges to that of SVM in the limit if bias isn't considered.
Inference for the Generalization Error
Nadeau, Claude, Bengio, Yoshua
In order to to compare learning algorithms, experimental results reported in the machine learning litterature often use statistical tests of significance. Unfortunately,most of these tests do not take into account the variability due to the choice of training set. We perform a theoretical investigation of the variance of the cross-validation estimate of the generalization errorthat takes into account the variability due to the choice of training sets. This allows us to propose two new ways to estimate this variance. We show, via simulations, that these new statistics perform well relative to the statistics considered by Dietterich (Dietterich, 1998). 1 Introduction When applying a learning algorithm (or comparing several algorithms), one is typically interested in estimating its generalization error. Its point estimation is rather trivial through cross-validation. Providing a variance estimate of that estimation, so that hypothesis testing and/orconfidence intervals are possible, is more difficult, especially, as pointed out in (Hinton et aI., 1995), if one wants to take into account the variability due to the choice of the training sets (Breiman, 1996). A notable effort in that direction is Dietterich's work (Dietterich, 1998).Careful investigation of the variance to be estimated allows us to provide new variance estimates, which tum out to perform well. Let us first layout the framework in which we shall work.
Neural Computation with Winner-Take-All as the Only Nonlinear Operation
Everybody "knows" that neural networks need more than a single layer ofnonlinear units to compute interesting functions. We show that this is false if one employs winner-take-all as nonlinear unit: - Any boolean function can be computed by a single k-winner-takeall unitapplied to weighted sums of the input variables.
A Recurrent Model of the Interaction Between Prefrontal and Inferotemporal Cortex in Delay Tasks
Renart, Alfonso, Parga, Nรฉstor, Rolls, Edmund T.
A very simple model of two reciprocally connected attractor neural networks isstudied analytically in situations similar to those encountered in delay match-to-sample tasks with intervening stimuli and in tasks of memory guided attention. The model qualitatively reproduces many of the experimental data on these types of tasks and provides a framework for the understanding of the experimental observations in the context of the attractor neural network scenario.