Statistical Learning
Pairwise Exemplar Clustering
Yang, Yingzhen (University of Illinois at Urbana-Champaign) | Chu, Xinqi (University of Illinois at Urbana-Champaign) | Liang, Feng (University of Illinois at Urbana-Champaign) | Huang, Thomas S. (University of Illinois at Urbana-Champaign)
Exemplar-based clustering methods have been extensively shown to be effective in many clustering problems. They adaptively determine the number of clusters and hold the appealing advantage of not requiring the estimation of latent parameters, which is otherwise difficult in case of complicated parametric model and high dimensionality of the data. However, modeling arbitrary underlying distribution of the data is still difficult for existing exemplar-based clustering methods. We present Pairwise Exemplar Clustering (PEC) to alleviate this problem by modeling the underlying cluster distributions more accurately with non-parametric kernel density estimation. Interpreting the clusters as classes from a supervised learning perspective, we search for an optimal partition of the data that balances two quantities: 1 the misclassification rate of the data partition for separating the clusters; 2 the sum of within-cluster dissimilarities for controlling the cluster size. The broadly used kernel form of cut turns out to be a special case of our formulation. Moreover, we optimize the corresponding objective function by a new efficient algorithm for message computation in a pairwise MRF. Experimental results on synthetic and real data demonstrate the effectiveness of our method.
Semi-Supervised Kernel Matching for Domain Adaptation
Xiao, Min (Temple University) | Guo, Yuhong (Temple University)
In this paper, we propose a semi-supervised kernel matching method to address domain adaptation problems where the source distribution substantially differs from the target distribution. Specifically, we learn a prediction function on the labeled source data while mapping the target data points to similar source data points by matching the target kernel matrix to a submatrix of the source kernel matrix based on a Hilbert Schmidt Independence Criterion. We formulate this simultaneous learning and mapping process as a non-convex integer optimization problem and present a local minimization procedure for its relaxed continuous form. Our empirical results show the proposed kernel matching method significantly outperforms alternative methods on the task of across domain sentiment classification.
A Bregman Divergence Optimization Framework for Ranking on Data Manifold and Its New Extensions
Xu, Bin (Zhejiang University) | Bu, Jiajun (Zhejiang University) | Chen, Chun (Zhejiang University) | Cai, Deng (Zhejiang University)
Recently, graph-based ranking algorithms have received considerable interests in machine learning, computer vision and information retrieval communities. Ranking on data manifold (or manifold ranking, MR) is one of the representative approaches. One of the limitations of manifold ranking is its high computational complexity (O( n 3 ), where n is the number of samples in database). In this paper, we cast the manifold ranking into a Bregman divergence optimization framework under which we transform the original MR to an equivalent optimal kernel matrix learning problem.With this new formulation, two effective and efficient extensions are proposed to enhance the ranking performance. Extensive experimental results on two real world image databases show the effectiveness of the proposed approach.
Discriminative Clustering via Generative Feature Mapping
Wang, Liwei (The Chinese University of Hong Kong) | Li, Xiong (Shanghai Jiao Tong University) | Tu, Zhuowen (Microsoft Research Asia and UCLA) | Jia, Jiaya (The Chinese University of Hong Kong)
Existing clustering methods can be roughly classified into two categories: generative and discriminative approaches. Generative clustering aims to explain the data and thus is adaptive to the underlying data distribution; discriminative clustering, on the other hand, emphasizes on finding partition boundaries. In this paper, we take the advantages of both models by coupling the two paradigms through feature mapping derived from linearizing Bayesian classifiers. Such the feature mapping strategy maps nonlinear boundaries of generative clustering to linear ones in the feature space where we explicitly impose the maximum entropy principle. We also propose the unified probabilistic framework, enabling solvers using standard techniques. Experiments on a variety of datasets bear out the notable benefit of our method in terms of adaptiveness and robustness.
Manifold Warping: Manifold Alignment over Time
Vu, Hoa Trong (University of Massachusetts, Amherst) | Carey, Clifton (University of Massachusetts, Amherst) | Mahadevan, Sridhar (University of Massachusetts, Amherst)
Knowledge transfer is computationally challenging, due in part to the curse of dimensionality, compounded by source and target domains expressed using different features (e.g., documents written in different languages). Recent work on manifold learning has shown that data collected in real-world settings often have high-dimensional representations, but lie on low-dimensional manifolds. Furthermore, data sets collected from similar generating processes often present different high-dimensional views, even though their underlying manifolds are similar. The ability to align these data sets and extract this common structure is critical for many transfer learning tasks. In this paper, we present a novel framework for aligning two sequentially-ordered data sets, taking advantage of a shared low-dimensional manifold representation. Our approach combines traditional manifold alignment and dynamic time warping algorithms using alternating projections. We also show that the previously-proposed canonical time warping algorithm is a special case of our approach. We provide a theoretical formulation as well as experimental results on synthetic and real-world data, comparing manifold warping to other alignment methods.
Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching
Treeratpituk, Pucktada (Pennsylvania State University) | Giles, C. Lee (Pennsylvania State University)
Personal names are important and common information in many data sources, ranging from social networks and news articles to patient records and scientific documents.They are often used as queries for retrieving records and also as key information for linking documents from multiple sources. Matching personal names can be challenging due to variations in spelling and various formatting of names. While many approximated name matching techniques have been proposed, most are generic string-matching algorithms. Unlike other types of proper names, personal names are highly cultural. Many ethnicities have their own unique naming systems and identifiable characteristics. In this paper we explore such relationships between ethnicities and personal names to improve the name matching performance. First, we propose a name-ethnicity classifier based on the multinomial logistic regression. Our model can effectively identify name-ethnicity from personal names in Wikipedia, which we use to define name-ethnicity, to within 85\% accuracy.Next, we propose a novel alignment-based name matching algorithm, based on Smith–Waterman algorithm and logistic regression.Different name matching models are then trained for different name-ethnicity groups.Our preliminary experimental result on DBLP's disambiguated author dataset yields a performance of 99\% precision and 89\% recall.Surprisingly, textual features carry more weight than phonetic ones in name-ethnicity classification.
Hierarchical Double Dirichlet Process Mixture of Gaussian Processes
Tayal, Aditya (University of Waterloo) | Poupart, Pascal (University of Waterloo) | Li, Yuying (University of Waterloo)
We consider an infinite mixture model of Gaussian processes that share mixture components between non-local clusters in data. Meeds and Osindero (2006) use a single Dirichlet process prior to specify a mixture of Gaussian processes using an infinite number of experts. In this paper, we extend this approach to allow for experts to be shared non-locally across the input domain. This is accomplished with a hierarchical double Dirichlet process prior, which builds upon a standard hierarchical Dirichlet process by incorporating local parameters that are unique to each cluster while sharing mixture components between them. We evaluate the model on simulated and real data, showing that sharing Gaussian process components non-locally can yield effective and useful models for richly clustered non-stationary, non-linear data.
Investigating the Effectiveness of Laplacian-Based Kernels in Hub Reduction
Suzuki, Ikumi (Nara Institute of Science and Technology) | Hara, Kazuo (National Institute of Genetics) | Shimbo, Masashi (Nara Institute of Science and Technology) | Matsumoto, Yuji (Nara Institute of Science and Technology) | Saerens, Marco (Universite Catholique de Louvain)
A “hub” is an object closely surrounded by, or very similar to, many other objects in the dataset. Recent studies by Radovanovi´c et al. indicate that in high dimensional spaces, hubs almost always emerge, and objects close to the data centroid tend to become hubs. In this paper, we show that the family of kernels based on the graph Laplacian makes all objects in the dataset equally similar to the centroid, and thus they are expected to make less hubs when used as a similarity measure. We investigate this hypothesis using both synthetic and real-world data. It turns out that these kernels suppress hubs in some cases but not always, and the results seem to be affected by the size of the data—a factor not discussed previously. However, for the datasets in which hubs are indeed reduced by the Laplacian-based kernels, these kernels work well in ranking and classification tasks. This result suggests that the amount of hubs, which can be readily computed in an unsupervised fashion, can be a yardstick of whether Laplacian-based kernels work effectively for a given data.
Rule Ensemble Learning Using Hierarchical Kernels in Structured Output Spaces
Nair, Naveen (IITB-Monash Research Academy, IIT Bombay, Monash University.) | Saha, Amrita (IIT Bombay) | Ramakrishnan, Ganesh (IIT Bombay) | Krishnaswamy, Shonali (Institute for Infocomm Research (I2R), Singapore)
The goal in Rule Ensemble Learning (REL) is simultaneous discovery of a small set of simple rules and their optimal weights that lead to good generalization. Rules are assumed to be conjunctions of basic propositions concerning the values taken by the input features. It has been shown that rule ensembles for classification can be learnt optimally and efficiently using hierarchical kernel learning approaches that explore the exponentially large space of conjunctions by exploiting its hierarchical structure. The regularizer employed penalizes large features and thereby selects a small set of short features. In this paper, we generalize the rule ensemble learning using hierarchical kernels (RELHKL) framework to multi class structured output spaces. We build on the StructSVM model for sequence prediction problems and employ a ρ-norm hierarchical regularizer for observation features and a conventional 2-norm regularizer for state transition features. The exponentially large feature space is searched using an active set algorithm and the exponentially large set of constraints are handled using a cutting plane algorithm. The approach can be easily extended to other structured output problems. We perform experiments on activity recognition datasets which are prone to noise, sparseness and skewness. We demonstrate that our approach outperforms other approaches.
Compressed Least-Squares Regression on Sparse Spaces
Fard, Mahdi Milani (McGill University) | Grinberg, Yuri (McGill University) | Pineau, Joelle (McGill University) | Precup, Doina (McGill University)
Recent advances in the area of compressed sensing suggest that it is possible to reconstruct high-dimensional sparse signals from a small number of random projections. Domains in which the sparsity assumption is applicable also offer many interesting large-scale machine learning prediction tasks. It is therefore important to study the effect of random projections as a dimensionality reduction method under such sparsity assumptions. In this paper we develop the bias-variance analysis of a least-squares regression estimator in compressed spaces when random projections are applied on sparse input signals. Leveraging the sparsity assumption, we are able to work with arbitrary non i.i.d. sampling strategies and derive a worst-case bound on the entire space. Empirical results on synthetic and real-world datasets shows how the choice of the projection size affects the performance of regression on compressed spaces, and highlights a range of problems where the method is useful.