Supervised Learning
Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
Socher, Richard, Huang, Eric H., Pennin, Jeffrey, Manning, Christopher D., Ng, Andrew Y.
Paraphrase detection is the task of examining two sentences and determining whether they have the same meaning. In order to obtain high accuracy on this task, thorough syntactic and semantic analysis of the two statements is needed. We introduce a method for paraphrase detection based on recursive autoencoders (RAE). Our unsupervised RAEs are based on a novel unfolding objective and learn feature vectors for phrases in syntactic trees. These features are used to measure the word-and phrase-wise similarity between two sentences. Since sentences may be of arbitrary length, the resulting matrix of similarity measures is of variable size. We introduce a novel dynamic pooling layer which computes a fixed-sized representation from the variable-sized matrices. The pooled representation is then used as input to a classifier. Our method outperforms other state-of-the-art approaches onthe challenging MSRP paraphrase corpus.
Hodge Theory on Metric Spaces
Bartholdi, Laurent, Schick, Thomas, Smale, Nat, Smale, Steve, Baker, Anthony W.
Hodge theory is a beautiful synthesis of geometry, topology, and analysis, which has been developed in the setting of Riemannian manifolds. On the other hand, spaces of images, which are important in the mathematical foundations of vision and pattern recognition, do not fit this framework. This motivates us to develop a version of Hodge theory on metric spaces with a probability measure. We believe that this constitutes a step towards understanding the geometry of vision. The appendix by Anthony Baker provides a separable, compact metric space with infinite dimensional \alpha-scale homology.
Effective End-User Interaction with Machine Learning
Amershi, Saleema (University of Washington) | Fogarty, James (University of Washington) | Kapoor, Ashish (Microsoft Research) | Tan, Desney (Microsoft Research)
End-user interactive machine learning is a promising tool for enhancing human productivity and capabilities with large unstructured data sets. Recent work has shown that we can create end-user interactive machine learning systems for specific applications. However, we still lack a generalized understanding of how to design effective end-user interaction with interactive machine learning systems. This work presents three explorations in designing for effective end-user interaction with machine learning in CueFlik, a system developed to support Web image search. These explorations demonstrate that interactions designed to balance the needs of end-users and machine learning algorithms can significantly improve the effectiveness of end-user interactive machine learning.
Learning Instance Specific Distance for Multi-Instance Classification
Wang, Hua (University of Texas at Arlington) | Nie, Feiping (University of Texas at Arlington) | Huang, Heng (University of Texas at Arlington)
Multi-Instance Learning (MIL) deals with problems where each training example is a bag, and each bag contains a set of instances. Multi-instance representation is useful in many real world applications, because it is able to capture more structural information than traditional flat single-instance representation. However, it also brings new challenges. Specifically, the distance between data objects in MIL is a set-to-set distance, which is harder to estimate than vector distances used in single-instance data. Moreover, because in MIL labels are assigned to bags instead of instances, although a bag belongs to a class, some, or even most, of its instances may not be truly related to the class. In order to address these difficulties, in this paper we propose a novel Instance Specific Distance (ISD) method for MIL, which computes the Class-to-Bag (C2B) distance by further considering the relevances of training instances with respect to their labeled classes. Taking into account the outliers caused by the weak label association in MIL, we learn ISD by solving an l0+-norm minimization problem. An efficient algorithm to solve the optimization problem is presented, together with the rigorous proof of its convergence. The promising results on five benchmark multi-instance data sets and two real world multi-instance applications validate the effectiveness of the proposed method.
Partially Supervised Text Classification with Multi-Level Examples
Liu, Tao (Renmin University of China) | Du, Xiaoyong (Renmin University of China) | Xu, Yongdong (Harbin Institute of Technology) | Li, Minghui (Microsoft) | Wang, Xiaolong (Harbin Institute of Technology)
Partially supervised text classification has received great research attention since it only uses positive and unlabeled examples as training data. This problem can be solved by automatically labeling some negative (and more positive) examples from unlabeled examples before training a text classifier. But it is difficult to guarantee both high quality and quantity of the new labeled examples. In this paper, a multi-level example based learning method for partially supervised text classification is proposed, which can make full use of all unlabeled examples. A heuristic method is proposed to assign possible labels to unlabeled examples and partition them into multiple levels according to their labeling confidence. A text classifier is trained on these multi-level examples using weighted support vector machines. Experiments show that the multi-level example based learning method is effective for partially supervised text classification, and outperforms the existing popular methods such as Biased-SVM, ROC-SVM, S-EM and WL.
Wrapper Maintenance: A Machine Learning Approach
Knoblock, C. A., Lerman, K., Minton, S. N.
The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task.
ProDiGe: PRioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples
Mordelet, Fantine, Vert, Jean-Philippe
Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases. Here we propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.
Negative Example Aided Transcription Factor Binding Site Search
Computational approaches to transcription factor binding site identification have been actively researched for the past decade. Negative examples have long been utilized in de novo motif discovery and have been shown useful in transcription factor binding site search as well. However, understanding of the roles of negative examples in binding site search is still very limited. We propose the 2-centroid and optimal discriminating vector methods, taking into account negative examples. Cross-validation results on E. coli transcription factors show that the proposed methods benefit from negative examples, outperforming the centroid and position-specific scoring matrix methods. We further show that our proposed methods perform better than a state-of-the-art method. We characterize the proposed methods in the context of the other compared methods and show that, coupled with motif subtype identification, the proposed methods can be effectively applied to a wide range of transcription factors. Finally, we argue that the proposed methods are well-suited for eukaryotic transcription factors as well. Software tools are available at: http://biogrid.engr.uconn.edu/tfbs_search/.
Transfer Learning Framework for Early Detection of Fatigue Using Non-invasive Surface Electromyogram Signals (SEMG)
Chattopadhyay, Rita (Arizona State University) | Ye, Jieping (Arizona State University) | Panchanathan, Sethuraman (Professor and Deputy Vice President of Research and Economic Affairs, School of Computing, Informatics, and Decision Systems Engineering, Computer Science and Engineering Faculty)
The fundamental assumption being, any hypothesis found to approximate well over a sufficiently large Surface Electromyogram (SEMG) signals are physiological set of training examples will also approximate well over signals processed to assess the intensity of activity and the other unobserved examples (Mitchell 1997), belonging to fatigue state of the muscles, non-invasively (Kumar, Pah, the same distribution as the training data. But if this basic and Bradley 2003; Georgakis, Stergioulas, and Giakas 2003; assumption is violated as in the case of SEMG data over Koumantakis et al. 2001; Gerdle, Larsson, and Karlsson multiple subjects, direct application of traditional data mining 2000). However researches observed significant difference and machine learning methods would not work. Figure 1 between the data collected from different subjects shows a typical distribution of SEMG data for three different though they performed the same activity under similar experimental subjects, collected over a fatiguing exercise at varying speed conditions (Contessa, Adam, and Luca 2009; representing the four physiological phases corresponding to Gerdle, Larsson, and Karlsson 2000). Because of their four classes (l) low intensity of activity and low fatigue, (2) highly subject specific nature the SEMG based fatigue assessment high intensity of activity and moderate fatigue, (3) low intensity requires subject specific calibration and are hence of activity and moderate fatigue and (4) high intensity confined to clinical environments related to training and rehabilitation. of activity and high fatigue.
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
Ross, Stephane, Gordon, Geoffrey J., Bagnell, J. Andrew
Sequential prediction problems such as imitation learning, where future observations depend on previous predictions (actions), violate the common i.i.d. assumptions made in statistical learning. This leads to poor performance in theory and often in practice. Some recent approaches provide stronger guarantees in this setting, but remain somewhat unsatisfactory as they train either non-stationary or stochastic policies and require a large number of iterations. In this paper, we propose a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting. We show that any such no regret algorithm, combined with additional reduction assumptions, must find a policy with good performance under the distribution of observations it induces in such sequential settings. We demonstrate that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.