Perceptrons
Predtron: A Family of Online Algorithms for General Prediction Problems
Jain, Prateek, Natarajan, Nagarajan, Tewari, Ambuj
Modern prediction problems arising in multilabel learning and learning to rank pose unique challenges to the classical theory of supervised learning. These problems have large prediction and label spaces of a combinatorial nature and involve sophisticated loss functions. We offer a general framework to derive mistake driven online algorithms and associated loss bounds. The key ingredients in our framework are a general loss function, a general vector space representation of predictions, and a notion of margin with respect to a general norm. Our general algorithm, Predtron, yields the perceptron algorithm and its variants when instantiated on classic problems such as binary classification, multiclass classification, ordinal regression, and multilabel classification. For multilabel ranking and subset ranking, we derive novel algorithms, notions of margins, and loss bounds. A simulation study confirms the behavior predicted by our bounds and demonstrates the flexibility of the design choices in our framework.
A Hybrid Neural Model for Type Classification of Entity Mentions
Dong, Li (Beihang University) | Wei, Furu (Microsoft Research) | Sun, Hong (Microsoft Corporation) | Zhou, Ming (Microsoft Research) | Xu, Ke (Beihang University)
The semantic class (i.e., type) of an entity plays a vital role in many natural language processing tasks, such as question answering. However, most of existing type classification systems extensively rely on hand-crafted features. This paper introduces a hybrid neural model which classifies entity mentions to a wide-coverage set of 22 types derived from DBpedia. It consists of two parts. The mention model uses recurrent neural networks to recursively obtain the vector representation of an entity mention from the words it contains. The context model, on the other hand, employs multilayer perceptrons to obtain the hidden representation for contextual information of a mention. Representations obtained by the two parts are used together to predict the type distribution. Using automatically generated data, these two parts are jointly learned. Experimental studies illustrate that the proposed approach outperforms baseline methods. Moreover, when type information provided by our method is used in a question answering system, we observe a 14.7% relative improvement for the top-1 accuracy of answers.
Multitask Coactive Learning
Goetschalckx, Robby (Oregon State University) | Fern, Alan (Oregon State University) | Tadepalli, Prasad (Oregon State University)
In this paper we investigate the use of coactive learning in a multitask setting. In coactive learning, an expert presents the learner with a problem and the learner returns a candidate solution. The expert then improves on the solution if necessary and presents the improved solution to the learner. The goal for the learner is to learn to produce solutions which cannot be further improved by the expert while minimizing the average expert effort. In this paper, we consider the setting where there are multiple experts (tasks), and in each iteration one expert presents a problem to the learner. While the experts are expected to have different solution preferences, they are also assumed to share similarities, which should enable generalization across experts. We analyze several algorithms for this setting and derive bounds on the average expert effort during learning. Our main contribution is the balanced Perceptron algorithm, which is the first coactive learning algorithm that is both able to generalize across experts when possible, while also guaranteeing convergence to optimal solutions for individual experts. Our experiments in three domains confirm that this algorithm is effective in the multitask setting, compared to natural baselines.
Surrogate Functions for Maximizing Precision at the Top
Kar, Purushottam, Narasimhan, Harikrishna, Jain, Prateek
The problem of maximizing precision at the top of a ranked list, often dubbed Precision@k (prec@k), finds relevance in myriad learning applications such as ranking, multi-label classification, and learning with severe label imbalance. However, despite its popularity, there exist significant gaps in our understanding of this problem and its associated performance measure. The most notable of these is the lack of a convex upper bounding surrogate for prec@k. We also lack scalable perceptron and stochastic gradient descent algorithms for optimizing this performance measure. In this paper we make key contributions in these directions. At the heart of our results is a family of truly upper bounding surrogates for prec@k. These surrogates are motivated in a principled manner and enjoy attractive properties such as consistency to prec@k under various natural margin/noise conditions. These surrogates are then used to design a class of novel perceptron algorithms for optimizing prec@k with provable mistake bounds. We also devise scalable stochastic gradient descent style methods for this problem with provable convergence bounds. Our proofs rely on novel uniform convergence bounds which require an in-depth analysis of the structural properties of prec@k and its surrogates. We conclude with experimental results comparing our algorithms with state-of-the-art cutting plane and stochastic gradient algorithms for maximizing prec@k.
Techniques for Learning Binary Stochastic Feedforward Neural Networks
Raiko, Tapani, Berglund, Mathias, Alain, Guillaume, Dinh, Laurent
Stochastic binary hidden units in a multi-layer perceptron (MLP) network give at least three potential benefits when compared to deterministic MLP networks. (1) They allow to learn one-to-many type of mappings. (2) They can be used in structured prediction problems, where modeling the internal structure of the output is important. (3) Stochasticity has been shown to be an excellent regularizer, which makes generalization performance potentially better in general. However, training stochastic networks is considerably more difficult. We study training using M samples of hidden activations per input. We show that the case M=1 leads to a fundamentally different behavior where the network tries to avoid stochasticity. We propose two new estimators for the training gradient and propose benchmark tests for comparing training algorithms. Our experiments confirm that training stochastic networks is difficult and show that the proposed two estimators perform favorably among all the five known estimators.
Joint Anaphoricity Detection and Coreference Resolution with Constrained Latent Structures
Lassalle, Emmanuel (Universitรฉ Paris-Diderot) | Denis, Pascal (INRIA)
This paper introduces a new structured model for learning anaphoricity detection and coreference resolution in a joint fashion. Specifically,we use a latent tree to represent the full coreference and anaphoric structure of a document at a global level, and we jointly learn the parameters of the two models using a version of the structured perceptron algorithm. Our joint structured model is further refined by the use of pairwise constraints which help the model to capture accurately certain patterns of coreference. Our experiments on the CoNLL-2012 English datasets show large improvements in both coreference resolution and anaphoricity detection, compared to various competing architectures. Our best coreference system obtains a CoNLL score of 81.97 on gold mentions, which is to date the best score reported on this setting.
Distributed Decision Trees
ฤฐrsoy, Ozan, Alpaydฤฑn, Ethem
Recently proposed budding tree is a decision tree algorithm in which every node is part internal node and part leaf. This allows representing every decision tree in a continuous parameter space, and therefore a budding tree can be jointly trained with backpropagation, like a neural network. Even though this continuity allows it to be used in hierarchical representation learning, the learned representations are local: Activation makes a soft selection among all root-to-leaf paths in a tree. In this work we extend the budding tree and propose the distributed tree where the children use different and independent splits and hence multiple paths in a tree can be traversed at the same time. This ability to combine multiple paths gives the power of a distributed representation, as in a traditional perceptron layer. We show that distributed trees perform comparably or better than budding and traditional hard trees on classification and regression tasks.
Reducing the Effects of Detrimental Instances
Smith, Michael R., Martinez, Tony
Not all instances in a data set are equally beneficial for inducing a model of the data. Some instances (such as outliers or noise) can be detrimental. However, at least initially, the instances in a data set are generally considered equally in machine learning algorithms. Many current approaches for handling noisy and detrimental instances make a binary decision about whether an instance is detrimental or not. In this paper, we 1) extend this paradigm by weighting the instances on a continuous scale and 2) present a methodology for measuring how detrimental an instance may be for inducing a model of the data. We call our method of identifying and weighting detrimental instances reduced detrimental instance learning (RDIL). We examine RIDL on a set of 54 data sets and 5 learning algorithms and compare RIDL with other weighting and filtering approaches. RDIL is especially useful for learning algorithms where every instance can affect the classification boundary and the training instances are considered individually, such as multilayer perceptrons trained with backpropagation (MLPs). Our results also suggest that a more accurate estimate of which instances are detrimental can have a significant positive impact for handling them.
Transformed Representations for Convolutional Neural Networks in Diabetic Retinopathy Screening
Lim, Gilbert (National University of Singapore) | Lee, Mong Li (National University of Singapore) | Hsu, Wynne (National University of Singapore) | Wong, Tien Yin (Singapore National Eye Centre)
Convolutional neural networks (CNNs) are flexible, biologically-inspired variants of multi-layer perceptrons that have proven themselves to be exceptionally suited to discriminative vision tasks. However, relatively little is known on whether they can be made both more efficient and more accurate, by introducing suitable transformations that exploit general knowledge of the target classes. We demonstrate this functionality through pre-segmentation of input images with a fast and robust but loose segmentation step, to obtain a set of candidate objects. These objects then undergo a spatial transformation into a reduced space, retaining but a compact high-level representation of their appearance. Additional attributes may be abstracted as raw features that are incorporated after the convolutional phase of the network. Finally, we compare its performance against existing approaches on the challenging problem of detecting lesions in retinal images.
Data Assimilation by Artificial Neural Networks for an Atmospheric General Circulation Model: Conventional Observation
Cintra, Rosangela S., Velho, Haroldo F. de Campos
This paper presents an approach for employing artificial neural networks (NN) to emulate an ensemble Kalman filter (EnKF) as a method of data assimilation. The assimilation methods are tested in the Simplified Parameterizations PrimitivE-Equation Dynamics (SPEEDY) model, an atmospheric general circulation model (AGCM), using synthetic observational data simulating localization of balloon soundings. For the data assimilation scheme, the supervised NN, the multilayer perceptrons (MLP-NN), is applied. The MLP-NN are able to emulate the analysis from the local ensemble transform Kalman filter (LETKF). After the training process, the method using the MLP-NN is seen as a function of data assimilation. The NN were trained with data from first three months of 1982, 1983, and 1984. A hind-casting experiment for the 1985 data assimilation cycle using MLP-NN were performed with synthetic observations for January 1985. The numerical results demonstrate the effectiveness of the NN technique for atmospheric data assimilation. The results of the NN analyses are very close to the results from the LETKF analyses, the differences of the monthly average of absolute temperature analyses is of order 0.02. The simulations show that the major advantage of using the MLP-NN is better computational performance, since the analyses have similar quality. The CPU-time cycle assimilation with MLP-NN is 90 times faster than cycle assimilation with LETKF for the numerical experiment.