Statistical Learning
Initialization of Self-Organizing Maps: Principal Components Versus Random Initialization. A Case Study
Akinduko, A. A., Mirkes, E. M.
The performance of the Self-Organizing Map (SOM) algorithm is dependent on the initial weights of the map. The different initialization methods can broadly be classified into random and data analysis based initialization approach. In this paper, the performance of random initialization (RI) approach is compared to that of principal component initialization (PCI) in which the initial map weights are chosen from the space of the principal component. Performance is evaluated by the fraction of variance unexplained (FVU). Datasets were classified into quasi-linear and non-linear and it was observed that RI performed better for non-linear datasets; however the performance of PCI approach remains inconclusive for quasi-linear datasets.
Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials
Krรคhenbรผhl, Philipp, Koltun, Vladlen
Most state-of-the-art techniques for multi-class image segmentation and labeling use conditional random fields defined over pixels or image regions. While region-level models often feature dense pairwise connectivity, pixel-level models are considerably larger and have only permitted sparse graph structures. In this paper, we consider fully connected CRF models defined on the complete set of pixels in an image. The resulting graphs have billions of edges, making traditional inference algorithms impractical. Our main contribution is a highly efficient approximate inference algorithm for fully connected CRF models in which the pairwise edge potentials are defined by a linear combination of Gaussian kernels. Our experiments demonstrate that dense connectivity at the pixel level substantially improves segmentation and labeling accuracy.
On Information Regularization
Corduneanu, Adrian, Jaakkola, Tommi S.
We formulate a principle for classification with the knowledge of the marginal distribution over the data points (unlabeled data). The principle is cast in terms of Tikhonov style regularization where the regularization penalty articulates the way in which the marginal density should constrain otherwise unrestricted conditional distributions. Specifically, the regularization penalty penalizes any information introduced between the examples and labels beyond what is provided by the available labeled examples. The work extends Szummer and Jaakkola's information regularization (NIPS 2002) to multiple dimensions, providing a regularizer independent of the covering of the space used in the derivation. We show in addition how the information regularizer can be used as a measure of complexity of the classification task with unlabeled data and prove a relevant sample-complexity bound. We illustrate the regularization principle in practice by restricting the class of conditional distributions to be logistic regression models and constructing the regularization penalty from a finite set of unlabeled examples.
Boltzmann Machine Learning with the Latent Maximum Entropy Principle
Wang, Shaojun, Schuurmans, Dale, Peng, Fuchun, Zhao, Yunxin
We present a new statistical learning paradigm for Boltzmann machines based on a new inference principle we have proposed: the latent maximum entropy principle (LME). LME is different both from Jaynes maximum entropy principle and from standard maximum likelihood estimation.We demonstrate the LME principle BY deriving new algorithms for Boltzmann machine parameter estimation, and show how robust and fast new variant of the EM algorithm can be developed.Our experiments show that estimation based on LME generally yields better results than maximum likelihood estimation, particularly when inferring hidden units from small amounts of data.
Learning Generative Models of Similarity Matrices
Rosales, Romer, Frey, Brendan J.
We describe a probabilistic (generative) view of affinity matrices along with inference algorithms for a subclass of problems associated with data clustering. This probabilistic view is helpful in understanding different models and algorithms that are based on affinity functions OF the data. IN particular, we show how(greedy) inference FOR a specific probabilistic model IS equivalent TO the spectral clustering algorithm.It also provides a framework FOR developing new algorithms AND extended models. AS one CASE, we present new generative data clustering models that allow us TO infer the underlying distance measure suitable for the clustering problem at hand. These models seem to perform well in a larger class of problems for which other clustering algorithms (including spectral clustering) usually fail. Experimental evaluation was performed in a variety point data sets, showing excellent performance.
On the Convergence of Bound Optimization Algorithms
Salakhutdinov, Ruslan R, Roweis, Sam T, Ghahramani, Zoubin
Many practitioners who use the EM algorithm complain that it is sometimes slow. When does this happen, and what can be done about it? In this paper, we study the general class of bound optimization algorithms - including Expectation-Maximization, Iterative Scaling and CCCP - and their relationship to direct optimization algorithms such as gradient-based methods for parameter learning. We derive a general relationship between the updates performed by bound optimization methods and those of gradient and second-order methods and identify analytic conditions under which bound optimization algorithms exhibit quasi-Newton behavior, and conditions under which they possess poor, first-order convergence. Based on this analysis, we consider several specific algorithms, interpret and analyze their convergence properties and provide some recipes for preprocessing input to these algorithms to yield faster convergence behavior. We report empirical results supporting our analysis and showing that simple data preprocessing can result in dramatically improved performance of bound optimizers in practice.
Markov Random Walk Representations with Continuous Distributions
Yeang, Chen-Hsiang, Szummer, Martin
Representations based on random walks can exploit discrete data distributions for clustering and classification. We extend such representations from discrete to continuous distributions. Transition probabilities are now calculated using a diffusion equation with a diffusion coefficient that inversely depends on the data density. We relate this diffusion equation to a path integral and derive the corresponding path probability measure. The framework is useful for incorporating continuous data densities and prior knowledge.
Collaborative Ensemble Learning: Combining Collaborative and Content-Based Information Filtering via Hierarchical Bayes
Yu, Kai, Schwaighofer, Anton, Tresp, Volker, Ma, Wei-Ying, Zhang, HongJiang
Collaborative filtering (CF) and content-based filtering (CBF) have widely been used in information filtering applications. Both approaches have their strengths and weaknesses which is why researchers have developed hybrid systems. This paper proposes a novel approach to unify CF and CBF in a probabilistic framework, named collaborative ensemble learning. It uses probabilistic SVMs to model each user's profile (as CBF does).At the prediction phase, it combines a society OF users profiles, represented by their respective SVM models, to predict an active users preferences(the CF idea).The combination scheme is embedded in a probabilistic framework and retains an intuitive explanation.Moreover, collaborative ensemble learning does not require a global training stage and thus can incrementally incorporate new data.We report results based on two data sets. For the Reuters-21578 text data set, we simulate user ratings under the assumption that each user is interested in only one category. In the second experiment, we use users' opinions on a set of 642 art images that were collected through a web-based survey. For both data sets, collaborative ensemble achieved excellent performance in terms of recommendation accuracy.
Efficiently Inducing Features of Conditional Random Fields
Conditional Random Fields (CRFs) are undirected graphical models, a special case of which correspond to conditionally-trained finite state machines. A key advantage of these models is their great flexibility to include a wide array of overlapping, multi-granularity, non-independent features of the input. In face of this freedom, an important question that remains is, what features should be used? This paper presents a feature induction method for CRFs. Founded on the principle of constructing only those feature conjunctions that significantly increase log-likelihood, the approach is based on that of Della Pietra et al [1997], but altered to work with conditional rather than joint probabilities, and with additional modifications for providing tractability specifically for a sequence model. In comparison with traditional approaches, automated feature induction offers both improved accuracy and more than an order of magnitude reduction in feature count; it enables the use of richer, higher-order Markov models, and offers more freedom to liberally guess about which atomic features may be relevant to a task. The induction method applies to linear-chain CRFs, as well as to more arbitrary CRF structures, also known as Relational Markov Networks [Taskar & Koller, 2002]. We present experimental results on a named entity extraction task.
Sufficient Dimensionality Reduction with Irrelevant Statistics
Globerson, Amir, Chechik, Gal, Tishby, Naftali
The problem of finding a reduced dimensionality representation of categorical variables while preserving their most relevant characteristics is fundamental for the analysis of complex data. Specifically, given a co-occurrence matrix of two variables, one often seeks a compact representation of one variable which preserves information about the other variable. We have recently introduced ``Sufficient Dimensionality Reduction' [GT-2003], a method that extracts continuous reduced dimensional features whose measurements (i.e., expectation values) capture maximal mutual information among the variables. However, such measurements often capture information that is irrelevant for a given task. Widely known examples are illumination conditions, which are irrelevant as features for face recognition, writing style which is irrelevant as a feature for content classification, and intonation which is irrelevant as a feature for speech recognition. Such irrelevance cannot be deduced apriori, since it depends on the details of the task, and is thus inherently ill defined in the purely unsupervised case. Separating relevant from irrelevant features can be achieved using additional side data that contains such irrelevant structures. This approach was taken in [CT-2002], extending the information bottleneck method, which uses clustering to compress the data. Here we use this side-information framework to identify features whose measurements are maximally informative for the original data set, but carry as little information as possible on a side data set. In statistical terms this can be understood as extracting statistics which are maximally sufficient for the original dataset, while simultaneously maximally ancillary for the side dataset. We formulate this tradeoff as a constrained optimization problem and characterize its solutions. We then derive a gradient descent algorithm for this problem, which is based on the Generalized Iterative Scaling method for finding maximum entropy distributions. The method is demonstrated on synthetic data, as well as on real face recognition datasets, and is shown to outperform standard methods such as oriented PCA.