Inductive Learning
Learning Max-Margin Tree Predictors
Meshi, Ofer, Eban, Elad, Elidan, Gal, Globerson, Amir
Structured prediction is a powerful framework for coping with joint prediction of interacting outputs. A central difficulty in using this framework is that often the correct label dependence structure is unknown. At the same time, we would like to avoid an overly complex structure that will lead to intractable prediction. In this work we address the challenge of learning tree structured predictive models that achieve high accuracy while at the same time facilitate efficient (linear time) inference. We start by proving that this task is in general NP-hard, and then suggest an approximate alternative. Briefly, our CRANK approach relies on a novel Circuit-RANK regularizer that penalizes non-tree structures and that can be optimized using a CCCP procedure. We demonstrate the effectiveness of our approach on several domains and show that, despite the relative simplicity of the structure, prediction accuracy is competitive with a fully connected model that is computationally costly at prediction time.
Reference Distance Estimator
Abstract: A theoretical study is presented for a simple linear classifier called reference distance estimator (RDE), which assigns the weight of each feature j as P(r j)-P(r), where r is a reference feature relevant to the target class y. The analysis shows that if r performs better than random guess in predicting y and is conditionally independent with each feature j, the RDE will have the same classification performance as that from P(y j)-P(y), a classifier trained with the gold standard y. Since the estimation of P(r j)-P(r) does not require labeled data, under the assumption above, RDE trained with a large number of unlabeled examples would be close to that trained with infinite labeled examples. For the case the assumption does not hold, we theoretically analyze the factors that influence the closeness of the RDE to the perfect one under the assumption, and present an algorithm to select reference features and combine multiple RDEs from different reference features using both labeled and unlabeled data. The experimental results on 10 text classification tasks show that the semi-supervised learning method improves supervised methods using 5,000 labeled examples and 13 million unlabeled ones, and in many tasks, its performance is even close to a classifier trained with 13 million labeled examples. In addition, the bounds in the theorems provide good estimation of the classification performance and can be useful for new algorithm design.
Semi-Supervised Learning with Manifold Fitted Graphs
Zhang, Tongtao (Columbia University) | Ji, Rongrong (Xiamen University) | Liu, Wei (IBM Research) | Tao, Dacheng (University of Technology, Sydney) | Hua, Gang (Stevens Institute of Technology)
In this paper, we propose a locality-constrained and sparsity-encouraged manifold fitting approach, aiming at capturing the locally sparse manifold structure into neighborhood graph construction by exploiting a principled optimization model. The proposed model formulates neighborhood graph construction as a sparse coding problem with the locality constraint, therefore achieving simultaneous neighbor selection and edge weight optimization. The core idea underlying our model is to perform a sparse manifold fitting task for each data point so that close-by points lying on the same local manifold are automatically chosen to connect and meanwhile the connection weights are acquired by simple geometric reconstruction. We term the novel neighborhood graph generated by our proposed optimization model M-Fitted Graph since such a graph stems from sparse manifold fitting. To evaluate the robustness and effectiveness ofM -fitted graphs, we leverage graph-based semisupervised learning as the testbed. Extensive experiments carried out on six benchmark datasets validate that the proposed M -fitted graph is superior to state-of-the-art neighborhood graphs in terms of classification accuracy using popular graph-based semi-supervised learning methods.
Multi-Instance Multi-Label Learning with Weak Label
Yang, Shu-Jun (Nanjing University) | Jiang, Yuan (Nanjing University) | Zhou, Zhi-Hua (Nanjing University)
Multi-Instance Multi-Label learning (MIML) deals with data objects that are represented by a bag of instances and associated with a set of class labels simultaneously. Previous studies typically assume that for every training example, all positive labels are tagged whereas the untagged labels are all negative. In many real applications such as image annotation, however, the learning problem often suffers from weak label; that is, users usually tag only a part of positive labels, and the untagged labels are not necessarily negative. In this paper, we propose the MIMLwel approach which works by assuming that highly relevant labels share some common instances, and the underlying class means of bags for each label are with a large margin. Experiments validate the effectiveness of MIMLwel in handling the weak label problem.
Deep Feature Learning Using Target Priors with Applications in ECoG Signal Decoding for BCI
Wang, Zuoguan (Rensselaer Polytechnic Institute) | Lyu, Siwei (University at Albany, SUNY) | Schalk, Gerwin (Wadsworth Center) | Ji, Qiang (Rensselaer Polytechnic Institute)
Recent years have seen a great interest in using deep architectures for feature learning from data. One drawback of the commonly used unsupervised deep feature learning methods is that for supervised or semi-supervised learning tasks, the information in the target variables are not used until the final stage when the classifier or regressor is trained on the learned features. This could lead to over-generalized features that are not competitive on the specific supervised or semi-supervised learning tasks. In this work, we describe a new learning method that combines deep feature learning on mixed labeled and unlabeled data sets. Specifically, we describe a weakly supervised learning method of a prior supervised convolutional stacked auto-encoders (PCSA), of which information in the target variables is represented probabilistically using a Gaussian Bernoulli restricted Boltzmann machine (RBM). We apply this method to the decoding problem of an ECoG based Brain Computer Interface (BCI) system. Our experimental results show that PCSA achieves significant improvement in decoding performance on benchmark data sets compared to the unsupervised feature learning as well as to the current state-of-the-art algorithms that are based on manually crafted features.
Probabilistic Multi-Label Classi๏ฌcation with Sparse Feature Learning
Guo, Yuhong (Temple University) | Xue, Wei (Temple University)
Multi-label classi๏ฌcation is a critical problem in many areas of data analysis such as image labeling and text categorization. In this paper we propose a probabilistic multi-label classi๏ฌcation model based on novel sparse feature learning. By employing an individual sparsity inducing โ1-norm and a group sparsity inducing โ2,1-norm, the proposed model has the capacity of capturing both label interdependencies and common predictive model structures. We formulate this sparse norm regularized learning problem as a non-smooth convex optimization problem, and develop a fast proximal gradient algorithm to solve it for an optimal solution. Our empirical study demonstrates the ef๏ฌcacy of the proposed method on a set of multi-label tasks given a limited number of labeled training instances.
Does generalization performance of $l^q$ regularization learning depend on $q$? A negative example
Lin, Shaobo, Xu, Chen, Zeng, Jingshan, Fang, Jian
$l^q$-regularization has been demonstrated to be an attractive technique in machine learning and statistical modeling. It attempts to improve the generalization (prediction) capability of a machine (model) through appropriately shrinking its coefficients. The shape of a $l^q$ estimator differs in varying choices of the regularization order $q$. In particular, $l^1$ leads to the LASSO estimate, while $l^{2}$ corresponds to the smooth ridge regression. This makes the order $q$ a potential tuning parameter in applications. To facilitate the use of $l^{q}$-regularization, we intend to seek for a modeling strategy where an elaborative selection on $q$ is avoidable. In this spirit, we place our investigation within a general framework of $l^{q}$-regularized kernel learning under a sample dependent hypothesis space (SDHS). For a designated class of kernel functions, we show that all $l^{q}$ estimators for $0< q < \infty$ attain similar generalization error bounds. These estimated bounds are almost optimal in the sense that up to a logarithmic factor, the upper and lower bounds are asymptotically identical. This finding tentatively reveals that, in some modeling contexts, the choice of $q$ might not have a strong impact in terms of the generalization capability. From this perspective, $q$ can be arbitrarily specified, or specified merely by other no generalization criteria like smoothness, computational complexity, sparsity, etc..
Application of three graph Laplacian based semi-supervised learning methods to protein function prediction problem
Protein function prediction is the important problem in modern biology. In this paper, the un-normalized, symmetric normalized, and random walk graph Laplacian based semi-supervised learning methods will be applied to the integrated network combined from multiple networks to predict the functions of all yeast proteins in these multiple networks. These multiple networks are network created from Pfam domain structure, co-participation in a protein complex, protein-protein interaction network, genetic interaction network, and network created from cell cycle gene expression measurements. Multiple networks are combined with fixed weights instead of using convex optimization to determine the combination weights due to high time complexity of convex optimization method. This simple combination method will not affect the accuracy performance measures of the three semi-supervised learning methods. Experiment results show that the un-normalized and symmetric normalized graph Laplacian based methods perform slightly better than random walk graph Laplacian based method for integrated network. Moreover, the accuracy performance measures of these three semi-supervised learning methods for integrated network are much better than the best accuracy performance measures of these three methods for the individual network.
A Reliable Effective Terascale Linear Learning System
Agarwal, Alekh, Chapelle, Olivier, Dudik, Miroslav, Langford, John
We present a system and a set of techniques for learning linear predictors with convex losses on terascale datasets, with trillions of features, {The number of features here refers to the number of non-zero entries in the data matrix.} billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are new, but the careful synthesis required to obtain an efficient implementation is. The result is, up to our knowledge, the most scalable and efficient linear learning system reported in the literature (as of 2011 when our experiments were conducted). We describe and thoroughly evaluate the components of the system, showing the importance of the various design choices.
Online Inference-Rule Learning from Natural-Language Extractions
Raghavan, Sindhu (The University of Texas at Austin) | Mooney, Raymond J. (The University of Texas at Austin)
In this paper, we consider the problem of learning commonsenseknowledge in the form of first-order rules from incomplete and noisynatural-language extractions produced by an off-the-shelf informationextraction (IE) system. Much of the information conveyed in text mustbe inferred from what is explicitly stated since easily inferablefacts are rarely mentioned. The proposed rule learner accounts forthis phenomenon by learning rules in which the body of the rulecontains relations that are usually explicitly stated, while the heademploys a less-frequently mentioned relation that is easilyinferred. The rule learner processes training examples in an onlinemanner to allow it to scale to large text corpora. Furthermore, wepropose a novel approach to weighting rules using a curated lexicalontology like WordNet. The learned rules along with their parametersare then used to infer implicit information using a Bayesian LogicProgram. Experimental evaluation on a machine reading testbeddemonstrates the efficacy of the proposed methods.