Asia
Visualizing Group Structure
Held, Marcus, Puzicha, Jan, Buhmann, Joachim M.
Cluster analysis is a fundamental principle in exploratory data analysis, providing the user with a description of the group structure of given data. A key problem in this context is the interpretation and visualization of clustering solutions in high-dimensional or abstract data spaces. In particular, probabilistic descriptions of the group structure, essential to capture inter-cluster relationships, are hardly assessable by simple inspection ofthe probabilistic assignment variables. VVe present a novel approach to the visualization of group structure. It is based on a statistical model of the object assignments which have been observed or estimated by a probabilistic clustering procedure. The objects or data points are embedded in a low dimensional Euclidean space by approximating the observed data statistics with a Gaussian mixture model. The algorithm provides a new approach to the visualization of the inherent structure for a broad variety of data types, e.g.
Learning Nonlinear Dynamical Systems Using an EM Algorithm
Ghahramani, Zoubin, Roweis, Sam T.
The Expectation-Maximization (EM) algorithm is an iterative procedure for maximum likelihood parameter estimation from data sets with missing or hidden variables [2]. It has been applied to system identification in linear stochastic state-space models, where the state variables are hidden from the observer and both the state and the parameters of the model have to be estimated simultaneously [9]. We present a generalization of the EM algorithm for parameter estimation in nonlinear dynamical systems. The "expectation" step makes use of Extended Kalman Smoothing to estimate the state, while the "maximization" step re-estimates the parameters using these uncertain state estimates. In general, the nonlinear maximization step is difficult because it requires integrating out the uncertainty in the states.
Global Optimisation of Neural Network Models via Sequential Sampling
Freitas, João F. G. de, Niranjan, Mahesan, Doucet, Arnaud, Gee, Andrew H.
We propose a novel strategy for training neural networks using sequential sampling-importance resampling algorithms. This global optimisation strategy allows us to learn the probability distribution of the network weights in a sequential framework. It is well suited to applications involving online, nonlinear, non-Gaussian or non-stationary signal processing. 1 INTRODUCTION This paper addresses sequential training of neural networks using powerful sampling techniques. Sequential techniques are important in many applications of neural networks involving real-time signal processing, where data arrival is inherently sequential. Furthermore, one might wish to adopt a sequential training strategy to deal with non-stationarity in signals, so that information from the recent past is lent more credence than information from the distant past. One way to sequentially estimate neural network models is to use a state space formulation and the extended Kalman filter (Singhal and Wu 1988, de Freitas, Niranjan and Gee 1998).
Approximate Learning of Dynamic Models
Inference is a key component in learning probabilistic models from partially observable data. When learning temporal models, each of the many inference phases requires a traversal over an entire long data sequence; furthermore, the data structures manipulated are exponentially large, making this process computationally expensive. In [2], we describe an approximate inference algorithm for monitoring stochastic processes, and prove bounds on its approximation error. In this paper, we apply this algorithm as an approximate forward propagation step in an EM algorithm for learning temporal Bayesian networks. We provide a related approximation for the backward step, and prove error bounds for the combined algorithm.
Learning a Hierarchical Belief Network of Independent Factor Analyzers
The model parameters are learned in an unsupervised manner by maximizing the likelihood that these data are generated by the model. A multilayer belief network is a realization of such a model. Many belief networks have been proposed that are composed of binary units. The hidden units in such networks represent latent variables that explain different features of the data, and whose relation to the ·Current address: Gatsby Computational Neuroscience Unit, University College London, 17 Queen Square, London WC1N 3AR, U.K. 362 H. Attias data is highly nonlinear. However, for tasks such as object and speech recognition which produce real-valued data, the models provided by binary networks are often inadequate.
A Theory of Mean Field Approximation
I present a theory of mean field approximation based on information geometry. This theory includes in a consistent way the naive mean field approximation, as well as the TAP approach and the linear response theorem in statistical physics, giving clear information-theoretic interpretations to them. 1 INTRODUCTION Many problems of neural networks, such as learning and pattern recognition, can be cast into a framework of statistical estimation problem. How difficult it is to solve a particular problem depends on a statistical model one employs in solving the problem. For Boltzmann machines[ 1] for example, it is computationally very hard to evaluate expectations of state variables from the model parameters. Mean field approximation[2], which is originated in statistical physics, has been frequently used in practical situations in order to circumvent this difficulty.
Learning Curves for Gaussian Processes
I consider the problem of calculating learning curves (i.e., average generalization performance) of Gaussian processes used for regression. A simple expression for the generalization error in terms of the eigenvalue decomposition of the covariance function is derived, and used as the starting point for several approximation schemes. I identify where these become exact, and compare with existing bounds on learning curves; the new approximations, which can be used for any input space dimension, generally get substantially closer to the truth. 1 INTRODUCTION: GAUSSIAN PROCESSES Within the neural networks community, there has in the last few years been a good deal of excitement about the use of Gaussian processes as an alternative to feedforward networks [lJ. The advantages of Gaussian processes are that prior assumptions about the problem to be learned are encoded in a very transparent way, and that inference-at least in the case of regression that I will consider-is relatively straightforward. One crucial question for applications is then how'fast' Gaussian processes learn, i.e., how many training examples are needed to achieve a certain level of generalization performance.
Tight Bounds for the VC-Dimension of Piecewise Polynomial Networks
O(ws(s log d log(dqh/ s))) and O(ws((h/ s) log q) log(dqh/ s)) are upper bounds for the VC-dimension of a set of neural networks of units with piecewise polynomial activation functions, where s is the depth of the network, h is the number of hidden units, w is the number of adjustable parameters, q is the maximum of the number of polynomial segments of the activation function, and d is the maximum degree of the polynomials; also n(wslog(dqh/s)) is a lower bound for the VC-dimension of such a network set, which are tight for the cases s 8(h) and s is constant. For the special case q 1, the VC-dimension is 8(ws log d). 1 Introduction In spite of its importance, we had been unable to obtain VC-dimension values for practical types of networks, until fairly tight upper and lower bounds were obtained ([6], [8], [9], and [10]) for linear threshold element networks in which all elements perform a threshold function on weighted sum of inputs. This is mainly because the differentiability of the functions is needed to perform backpropagation or other learning algorithms. Unfortunately explicit bounds obtained so far for the VC-dimension of sigmoidal networks exhibit large gaps (O(w2h2) ([3]), n(w log h) for bounded depth 324 A. Sakurai and f!(wh) for unbounded depth) and are hard to improve. For the piecewise linear case, Maass obtained a result that the VO-dimension is O(w210g q), where q is the number of linear pieces of the function ([5]).