Statistical Learning
Robust Multi-View Spectral Clustering via Low-Rank and Sparse Decomposition
Xia, Rongkai (Sun Yat-sen University) | Pan, Yan (Sun Yat-sen University) | Du, Lei (Sun Yat-sen University) | Yin, Jian (Sun Yat-sen University)
Multi-view clustering, which seeks a partition of the data inmultiple views that often provide complementary information to eachother, has received considerable attention in recent years. In reallife clustering problems, the data in each view may haveconsiderable noise. However, existing clustering methods blindlycombine the information from multi-view data with possiblyconsiderable noise, which often degrades their performance. In thispaper, we propose a novel Markov chain method for RobustMulti-view Spectral Clustering (RMSC). Our method has a flavor oflow-rank and sparse decomposition, where we firstly construct atransition probability matrix from each single view, and then usethese matrices to recover a shared low-rank transition probabilitymatrix as a crucial input to the standard Markov chain methodfor clustering. The optimization problem of RMSC has a low-rankconstraint on the transition probability matrix, and simultaneouslya probabilistic simplex constraint on each of its rows. To solvethis challenging optimization problem, we propose an optimization procedurebased on the Augmented Lagrangian Multiplier scheme. Experimentalresults on various real world datasets show that theproposed method has superior performance over severalstate-of-the-art methods for multi-view clustering.
Learning Relative Similarity by Stochastic Dual Coordinate Ascent
Wu, Pengcheng (Nanyang Technological University) | Ding, Yi (Nanyang Technological University) | Zhao, Peilin (Rutgers University) | Miao, Chunyan (Nanyang Technological University) | Hoi, Steven C. H. (Nanyang Technological University)
Learning relative similarity from pairwise instances is an important problem in machine learning and has a wide range of applications. Despite being studied for years, some existing methods solved by Stochastic Gradient Descent (SGD) techniques generally suffer from slow convergence. In this paper, we investigate the application of Stochastic Dual Coordinate Ascent (SDCA) technique to tackle the optimization task of relative similarity learning by extending from vector to matrix parameters. Theoretically, we prove the optimal linear convergence rate for the proposed SDCA algorithm, beating the well-known sublinear convergence rate by the previous best metric learning algorithms. Empirically, we conduct extensive experiments on both standard and large-scale data sets to validate the effectiveness of the proposed algorithm for retrieval tasks.
Small-Variance Asymptotics for Dirichlet Process Mixtures of SVMs
Wang, Yining (Tsinghua University) | Zhu, Jun (Tsinghua University)
Infinite SVM (iSVM) is a Dirichlet process (DP) mixture of large-margin classifiers. Though flexible in learning nonlinear classifiers and discovering latent clustering structures, iSVM has a difficult inference task and existing methods could hinder its applicability to large-scale problems. This paper presents a small-variance asymptotic analysis to derive a simple and efficient algorithm, which monotonically optimizes a max-margin DP-means (M2DPM) problem, an extension of DP-means for both predictive learning and descriptive clustering. Our analysis is built on Gibbs infinite SVMs, an alternative DP mixture of large-margin machines, which admits a partially collapsed Gibbs sampler without truncation by exploring data augmentation techniques. Experimental results show that M2DPM runs much faster than similar algorithms without sacrificing prediction accuracies.
Cross-Domain Metric Learning Based on Information Theory
Wang, Hao (Chinese Academy of Sciences) | Wang, Wei (Chinese Academy of Sciences) | Zhang, Chen (Chinese Academy of Sciences) | Xu, Fanjiang (Chinese Academy of Sciences)
Supervised metric learning plays a substantial role in statistical classification. Conventional metric learning algorithms have limited utility when the training data and testing data are drawn from related but different domains (i.e., source domain and target domain). Although this issue has got some progress in feature-based transfer learning, most of the work in this area suffers from non-trivial optimization and pays little attention to preserving the discriminating information. In this paper, we propose a novel metric learning algorithm to transfer knowledge from the source domain to the target domain in an information-theoretic setting, where a shared Mahalanobis distance across two domains is learnt by combining three goals together: 1) reducing the distribution difference between different domains; 2) preserving the geometry of target domain data; 3) aligning the geometry of source domain data with its label information. Based on this combination, the learnt Mahalanobis distance effectively transfers the discriminating power and propagates standard classifiers across these two domains. More importantly, our proposed method has closed-form solution and can be efficiently optimized. Experiments in two real-world applications demonstrate the effectiveness of our proposed method.
Reconsidering Mutual Information Based Feature Selection: A Statistical Significance View
Vinh, Nguyen Xuan (The University of Melbourne) | Chan, Jeffrey (The University of Melbourne) | Bailey, James (The University of Melbourne)
Mutual information (MI) based approaches are a popular feature selection paradigm. Although the stated goal of MI-based feature selection is to identify a subset of features that share the highest mutual information with the class variable, most current MI-based techniques are greedy methods that make use of low dimensional MI quantities. The reason for using low dimensional approximation has been mostly attributed to the difficulty associated with estimating the high dimensional MI from limited samples. In this paper, we argue a different viewpoint that, given a very large amount of data, the high dimensional MI objective is still problematic to be employed as a meaningful optimization criterion, due to its overfitting nature: the MI almost always increases as more features are added, thus leading to a trivial solution which includes all features. We propose a novel approach to the MI-based feature selection problem, in which the overfitting phenomenon is controlled rigourously by means of a statistical test. We develop local and global optimization algorithms for this new feature selection model, and demonstrate its effectiveness in the applications of explaining variables and objects.
Locality Preserving Projection for Domain Adaptation with Multi-Objective Learning
Shu, Le (Temple University) | Ma, Tianyang (Temple University) | Latecki, Longin Jan (Temple University)
In many practical cases, we need to generalize a model trained in a source domain to a new target domain.However, the distribution of these two domains may differ very significantly, especially sometimes some crucial target features may not have support in the source domain.This paper proposes a novel locality preserving projection method for domain adaptation task,which can find a linear mapping preserving the 'intrinsic structure' for both source and target domains.We first construct two graphs encoding the neighborhood information for source and target domains separately.We then find linear projection coefficients which have the property of locality preserving for each graph.Instead of combing the two objective terms under compatibility assumption and requiring the user to decide the importance of each objective function,we propose a multi-objective formulation for this problem and solve it simultaneously using Pareto optimization.The Pareto frontier captures all possible good linear projection coefficients that are preferred by one or more objectives.The effectiveness of our approach is justified by both theoretical analysis and empirical results on real world data sets.The new feature representation shows better prediction accuracy as our experiments demonstrate.
Sparse Compositional Metric Learning
Shi, Yuan (University of Southern California) | Bellet, Aurรฉlien (University of Southern California) | Sha, Fei (University of Southern California)
We propose a new approach for metric learning by framing it as learning a sparse combination of locally discriminative metrics that are inexpensive to generate from the training data. This flexible framework allows us to naturally derive formulations for global, multi-task and local metric learning. The resulting algorithms have several advantages over existing methods in the literature: a much smaller number of parameters to be estimated and a principled way to generalize learned metrics to new testing data points. To analyze the approach theoretically, we derive a generalization bound that justifies the sparse combination. Empirically, we evaluate our algorithms on several datasets against state-of-the-art metric learning methods. The results are consistent with our theoretical findings and demonstrate the superiority of our approach in terms of classification performance and scalability.
Anytime Active Learning
Ramirez-Loaiza, Maria Eugenia (Illinois Institute of Technology) | Culotta, Aron (Illinois Institute of Technology) | Bilgic, Mustafa (Illinois Institute of Technology)
A common bottleneck in deploying supervised learning systems is collecting human-annotated examples. In many domains, annotators form an opinion about the label of an example incrementally -- e.g., each additional word read from a document or each additional minute spent inspecting a video helps inform the annotation. In this paper, we investigate whether we can train learning systems more efficiently by requesting an annotation before inspection is fully complete -- e.g., after reading only 25 words of a document. While doing so may reduce the overall annotation time, it also introduces the risk that the annotator might not be able to provide a label if interrupted too early. We propose an anytime active learning approach that optimizes the annotation time and response rate simultaneously. We conduct user studies on two document classification datasets and develop simulated annotators that mimic the users. Our simulated experiments show that anytime active learning outperforms several baselines on these two datasets. For example, with an annotation budget of one hour, training a classifier by annotating the first 25 words of each document reduces classification error by 17% over annotating the first 100 words of each document.
Bagging by Design (on the Suboptimality of Bagging)
Papakonstantinou, Periklis (Tsinghua University) | Xu, Jia (Tsinghua University) | Cao, Zhu (Tsinghua University)
Bagging (Breiman 1996) and its variants is one of the most popular methods in aggregating classifiers and regressors. Originally, its analysis assumed that the bootstraps are built from an unlimited, independent source of samples, therefore we call this form of bagging ideal-bagging. However in the real world, base predictors are trained on data subsampled from a limited number of training samples and thus they behave very differently. We analyze the effect of intersections between bootstraps, obtained by subsampling, to train different base predictors. Most importantly, we provide an alternative subsampling method called design-bagging based on a new construction of combinatorial designs, and prove it universally better than bagging. Methodologically, we succeed at this level of generality because we compare the prediction accuracy of bagging and design-bagging relative to the accuracy ideal-bagging. This finds potential applications in more involved bagging-based methods. Our analytical results are backed up by experiments on classification and regression settings.
Robust Non-Negative Dictionary Learning
Pan, Qihe (Beihang University) | Kong, Deguang (University of Texas Arlington) | Ding, Chris (University of Texas Arlington) | Luo, Bin (Anhui University)
Dictionary learning plays an important role in machine learning, where data vectors are modeled as a sparse linear combinations of basis factors (i.e., dictionary). However, how to conduct dictionary learning in noisy environment has not been well studied. Moreover, in practice, the dictionary (i.e., the lower rank approximation of the data matrix) and the sparse representations are required to be nonnegative, such as applications for image annotation, document summarization, microarray analysis. In this paper, we propose a new formulation for non-negative dictionary learning in noisy environment, where structure sparsity is enforced on sparse representation. The proposed new formulation is also robust for data with noises and outliers, due to a robust loss function used. We derive an efficient multiplicative updating algorithm to solve the optimization problem, where dictionary and sparse representation are updated iteratively. We prove the convergence and correctness of proposed algorithm rigorously.We show the differences of dictionary at different level of sparsity constraint.The proposed algorithm can be adapted for clustering and semi-supervised learning.