Optimization
Towards Safe Semi-Supervised Learning for Multivariate Performance Measures
Li, Yu-Feng (Nanjing University) | Kwok, James T. (Hong Kong University of Science and Technology) | Zhou, Zhi-Hua (Nanjing University, China)
Semi-supervised learning (SSL) is an important research problem in machine learning. While it is usually expected that the use of unlabeled data can improve performance, in many cases SSL is outperformed by supervised learning using only labeled data. To this end, the construction of a performance-safe SSL method has become a key issue of SSL study. To alleviate this problem, we propose in this paper the UMVP (safe semi-sUpervised learning for MultiVariate Performance measure) method, because of the need of various performance measures in practical tasks. The proposed method integrates multiple semi-supervised learners, and maximizes the worst-case performance gain to derive the final prediction. The overall problem is formulated as a maximin optimization. In oder to solve the resultant difficult maximin optimization, this paper shows that when the performance measure is the Top- k Precision, F ฮฒ score or AUC, a minimax convex relaxation of the maximin optimization can be solved efficiently. Experimental results show that the proposed method can effectively improve the safeness of SSL under multiple multivariate performance measures.
Uncorrelated Group LASSO
Kong, Deguang (Samsung Research America) | Liu, Ji (University of Rochester) | Liu, Bo (Philips Research North America) | Bao, Xuan (Google)
l 2,1 -norm is an effective regularization to enforce a simple group sparsity for feature learning. To capture some subtle structures among feature groups, we propose a new regularization called exclusive group l 2,1 -norm. It enforces the sparsity at the intra-group level by using l 2,1 -norm, while encourages the selected features to distribute in different groups by using l 2 norm at the inter-group level. The proposed exclusivegroup l 2,1 -norm is capable of eliminating the feature correlationsin the context of feature selection, if highly correlated features are collected in the same groups. To solve the generic exclusive group l 2,1 -norm regularized problems, we propose an efficient iterative re-weighting algorithm and provide a rigorous convergence analysis. Experiment results on real world datasets demonstrate the effectiveness of the proposed new regularization and algorithm.
Improving Predictive State Representations via Gradient Descent
Jiang, Nan (University of Michigan) | Kulesza, Alex (University of Michigan) | Singh, Satinder (University of Michigan)
Predictive state representations (PSRs) model dynamical systems using appropriately chosen predictions about future observations as a representation of the current state. In contrast to the hidden states posited by HMMs or RNNs, PSR states are directly observable in the training data; this gives rise to a moment-matching spectral algorithm for learning PSRs that is computationally efficient and statistically consistent when the model complexity matches that of the true system generating the data. In practice, however, model mismatch is inevitable and while spectral learning remains appealingly fast and simple it may fail to find optimal models. To address this problem, we investigate the use of gradient methods for improving spectrally-learned PSRs. We show that only a small amount of additional gradient optimization can lead to significant performance gains, and moreover that initializing gradient methods with the spectral learning solution yields better models in significantly less time than starting from scratch.
Optimal Discrete Matrix Completion
Huo, Zhouyuan (University of Texas at Arlington) | Liu, Ji (University of Rochester) | Huang, Heng (University of Texas at Arlington)
In recent years, matrix completion methods have been successfully applied to solve recommender system applications. Most of them focus on the matrix completion problem in real number domain, and produce continuous prediction values. However, these methods are not appropriate in some occasions where the entries of matrix are discrete values, such as movie ratings prediction, social network relation and interaction prediction, because their continuous outputs are not probabilities and uninterpretable. In this case, an additional step to process the continuous results with either heuristic threshold parameters or complicated mapping is necessary, while it is inefficient and may diverge from the optimal solution. There are a few matrix completion methods working on discrete number domain, however, they are not applicable to sparse and large-scale data set. In this paper, we propose a novel optimal discrete matrix completion model, which is able to learn optimal thresholds automatically and also guarantees an exact low-rank structure of the target matrix. We use stochastic gradient descent algorithm with momentum method to optimize the new objective function and speed up optimization. In the experiments, it is proved that our method can predict discrete values with high accuracy, very close to or even better than these values obtained by carefully tuned thresholds on Movielens and YouTube data sets. Meanwhile, our model is able to handle online data and easy to parallelize.
Joint Multi-View Representation Learning and Image Tagging
Xue, Zhe (University of Chinese Academy of Sciences) | Li, Guorong (University of Chinese Academy of Sciences) | Huang, Qingming (University of Chinese Academy of Sciences)
Automatic image annotation is an important problem in several machine learning applications such as image search. Since there exists a semantic gap between low-level image features and high-level semantics, the description ability of image representation can largely affect annotation results. In fact, image representation learning and image tagging are two closely related tasks. A proper image representation can achieve better image annotation results, and image tags can be treated as guidance to learn more effective image representation. In this paper, we present an optimal predictive subspace learning method which jointly conducts multi-view representation learning and image tagging. The two tasks can promote each other and the annotation performance can be further improved. To make the subspace to be more compact and discriminative, both visual structure and semantic information are exploited during learning. Moreover, we introduce powerful predictors (SVM) for image tagging to achieve better annotation performance. Experiments on standard image annotation datasets demonstrate the advantages of our method over the existing image annotation methods.
A Framework for Outlier Description Using Constraint Programming
Kuo, Chia-Tung (University of California, Davis) | Davidson, Ian (University of California, Davis)
Outlier detection has been studied extensively and employed in diverse applications in the past decades. In this paper we formulate a related yet understudied problem which we call outlier description. This problem often arises in practice when we have a small number of data instances that had been identified to be outliers and we wish to explain why they are outliers. We propose a framework based on constraint programming to find an optimal subset of features that most differentiates the outliers and normal instances. We further demonstrate the framework offers great flexibility in incorporating diverse scenarios arising in practice such as multiple explanations and human in the loop extensions. We empirically evaluate our proposed framework on real datasets, including medical imaging and text corpus, and demonstrate how the results are useful and interpretable in these domains.
Column Sampling Based Discrete Supervised Hashing
Kang, Wang-Cheng (Nanjing University) | Li, Wu-Jun (Nanjing University) | Zhou, Zhi-Hua (Nanjing University)
By leveraging semantic (label) information, supervised hashing has demonstrated better accuracy than unsupervised hashing in many real applications. Because the hashing-code learning problem is essentially a discrete optimization problem which is hard to solve, most existing supervised hashing methods try to solve a relaxed continuous optimization problem by dropping the discrete constraints. However, these methods typically suffer from poor performance due to the errors caused by the relaxation. Some other methods try to directly solve the discrete optimization problem. However, they are typically time-consuming and unscalable. In this paper, we propose a novel method, called column sampling based discrete supervised hashing (COSDISH), to directly learn the discrete hashing code from semantic information. COSDISH is an iterative method, in each iteration of which several columns are sampled from the semantic similarity matrix and then the hashing code is decomposed into two parts which can be alternately optimized in a discrete way. Theoretical analysis shows that the learning (optimization) algorithm of COSDISH has a constant-approximation bound in each step of the alternating optimization procedure. Empirical results on datasets with semantic labels illustrate that COSDISH can outperform the state-of-the-art methods in real applications like image retrieval.
Using Decomposition-Parameters for QBF: Mind the Prefix!
Eiben, Eduart (Technische Universitรคt Wien) | Ganian, Robert (Technische Universitรคt Wien) | Ordyniak, Sebastian (Technische Universitรคt Wien)
Similar to the satisfiability (SAT) problem, which can be seen to be the archetypical problem for NP, the quantified Boolean formula problem (QBF) is the archetypical problem for PSPACE. Recently, Atserias and Oliva (2014) showed that, unlike for SAT, many of the well-known decompositional parameters (such as treewidth and pathwidth) do not allow efficient algorithms for QBF. The main reason for this seems to be the lack of awareness of these parameters towards the dependencies between variables of a QBF formula. In this paper we extend the ordinary pathwidth to the QBF-setting by introducing prefix pathwidth, which takes into account the dependencies between variables in a QBF, and show that it leads to an efficient algorithm for QBF. We hope that our approach will help to initiate the study of novel tailor-made decompositional parameters for QBF and thereby help to lift the success of these decompositional parameters from SAT to QBF.
Linearized Alternating Direction Method with Penalization for Nonconvex and Nonsmooth Optimization
Wang, Yiyang (Dalian University of Technology) | Liu, Risheng (Dalian University of Technology) | Song, Xiaoliang (Dalian University of Technology) | Su, Zhixun (Dalian University of Technology and National Engineering Research Center of Digital Life)
Being one of the most effective methods, Alternating Direction Method (ADM) has been extensively studied in numerical analysis for solving linearly constrained convex program. However, there are few studies focusing on the convergence property of ADM under nonconvex framework though it has already achieved well-performance on applying to various nonconvex tasks. In this paper, a linearized algorithm with penalization is proposed on the basis of ADM for solving nonconvex and nonsmooth optimization. We start from analyzing the convergence property for the classical constrained problem with two variables and then establish a similar result for multi-block case. To demonstrate the effectiveness of our proposed algorithm, experiments with synthetic and real-world data have been conducted on specific applications in signal and image processing.
Fast ADMM Algorithm for Distributed Optimization with Adaptive Penalty
Song, Changkyu (Rutgers University) | Yoon, Sejong (Rutgers University) | Pavlovic, Vladimir (Rutgers University)
We propose new methods to speed up convergence of the Alternating Direction Method of Multipliers (ADMM), a common optimization tool in the context of large scale and distributed learning. The proposed method accelerates the speed of convergence by automatically deciding the constraint penalty needed for parameter consensus in each iteration. In addition, we also propose an extension of the method that adaptively determines the maximum number of iterations to update the penalty. We show that this approach effectively leads to an adaptive, dynamic network topology underlying the distributed optimization. The utility of the new penalty update schemes is demonstrated on both synthetic and real data, including an instance of the probabilistic matrix factorization task known as the structure from motion problem.