Goto

Collaborating Authors

 Inductive Learning


Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

arXiv.org Machine Learning

A BSTRACT In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetun-ing a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as "mixout", motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE. 1 I NTRODUCTION Transfer learning has been widely used for the tasks in natural language processing (NLP) (Collobert et al., 2011; Devlin et al., 2018; Y ang et al., 2019; Liu et al., 2019; Phang et al., 2018). In particular, Devlin et al. (2018) recently demonstrated the effectiveness of finetuning a large-scale language model pretrained on a large, unannotated corpus on a wide range of NLP tasks including question answering and language inference. They have designed two variants of models, BERT LARGE(340M parameters) and BERT BASE(110M parameters). Although BERT LARGEoutperforms BERT BASE generally, it was observed that finetuning sometimes fails when a target dataset has fewer than 10,000 training instances (Devlin et al., 2018; Phang et al., 2018). When finetuning a big, pretrained language model, dropout (Srivastava et al., 2014) has been used as a regularization technique to prevent co-adaptation of neurons (V aswani et al., 2017; Devlin et al., 2018; Y ang et al., 2019).


Learning definable hypotheses on trees

arXiv.org Artificial Intelligence

We study the problem of learning properties of nodes in tree structures. Those properties are specified by logical formulas, such as formulas from first-order or monadic second-order logic. We think of the tree as a database encoding a large dataset and therefore aim for learning algorithms which depend at most sublinearly on the size of the tree. We present a learning algorithm for quantifier-free formulas where the running time only depends polynomially on the number of training examples, but not on the size of the background structure. By a previous result on strings we know that for general first-order or monadic second-order (MSO) formulas a sublinear running time cannot be achieved. However, we show that by building an index on the tree in a linear time preprocessing phase, we can achieve a learning algorithm for MSO formulas with a logarithmic learning phase.


PDE-Inspired Algorithms for Semi-Supervised Learning on Point Clouds

arXiv.org Machine Learning

Given a data set and a subset of labels the problem of semi-supervised learning on point clouds is to extend the labels to the entire data set. In this paper we extend the labels by minimising the constrained discrete $p$-Dirichlet energy. Under suitable conditions the discrete problem can be connected, in the large data limit, with the minimiser of a weighted continuum $p$-Dirichlet energy with the same constraints. We take advantage of this connection by designing numerical schemes that first estimate the density of the data and then apply PDE methods, such as pseudo-spectral methods, to solve the corresponding Euler-Lagrange equation. We prove that our scheme is consistent in the large data limit for two methods of density estimation: kernel density estimation and spline kernel density estimation.


Generate More Training Data When You Don't Have Enough

#artificialintelligence

Computers outperform humans in image and object recognition. Big corporations like Google and Microsoft have beat the human benchmark on image recognition [1, 2]. On average, human makes an error on image recognition tasks about 5% of the time. As of 2015, Microsoft's image recognition software reached an error rate of 4.94%, and at around the same time, Google announced that its software achieved a reduced error rate of 4.8% [3]. This was possible by training deep convolutional neural networks on millions of training examples from ImageNet dataset which contains hundreds of object categories [1].


Positive-Unlabeled Compression on the Cloud

arXiv.org Machine Learning

Many attempts have been done to extend the great success of convolutional neural networks (CNNs) achieved on high-end GPU servers to portable devices such as smart phones. Providing compression and acceleration service of deep learning models on the cloud is therefore of significance and is attractive for end users. However, existing network compression and acceleration approaches usually fine-tuning the svelte model by requesting the entire original training data (\eg ImageNet), which could be more cumbersome than the network itself and cannot be easily uploaded to the cloud. In this paper, we present a novel positive-unlabeled (PU) setting for addressing this problem. In practice, only a small portion of the original training set is required as positive examples and more useful training examples can be obtained from the massive unlabeled data on the cloud through a PU classifier with an attention based multi-scale feature extractor. We further introduce a robust knowledge distillation (RKD) scheme to deal with the class imbalance problem of these newly augmented training examples. The superiority of the proposed method is verified through experiments conducted on the benchmark models and datasets. We can use only $8\%$ of uniformly selected data from the ImageNet to obtain an efficient model with comparable performance to the baseline ResNet-34.


Learning Your Way Without Map or Compass: Panoramic Target Driven Visual Navigation

arXiv.org Artificial Intelligence

Learning Y our Way Without Map or Compass: Panoramic T arget Driven Visual Navigation David Watkins-V alls,1, Jingxi Xu,1, Nicholas Waytowich 2 and Peter Allen 1 Abstract -- We present a robot navigation system that uses an imitation learning framework to successfully navigate in complex environments. Our framework takes a pre-built 3D scan of a real environment and trains an agent from pre-generated expert trajectories to navigate to any position given a panoramic view of the goal and the current visual input without relying on map, compass, odometry, GPS or relative position of the target at runtime. Our end-to-end trained agent uses RGB and depth (RGBD) information and can handle large environments (up to 1031 m 2) across multiple rooms (up to 40) and generalizes to unseen targets. We show that when compared to several baselines using deep reinforcement learning and RGBD SLAM, our method (1) requires fewer training examples and less training time, (2) reaches the goal location with higher accuracy, (3) produces better solutions with shorter paths for long-range navigation tasks, and (4) generalizes to unseen environments given an RGBD map of the environment. I NTRODUCTION The ability to navigate efficiently and accurately within an environment is fundamental to intelligent behavior and has been a focus of research in robotics for many years. Traditionally, robotic navigation is solved using model-based methods with an explicit focus on position inference and mapping, such as Simultaneous Localization and Mapping (SLAM) [1]. These models use path planning algorithms, such as Probabilistic Roadmaps (PRM) [2] and Rapidly Exploring Random Trees (RRT) [3], [4] to plan a collision-free path. These methods ignore the rich information from visual input and are highly sensitive to robot odometry and noise in sensor data.


An Automated Engineering Assistant: Learning Parsers for Technical Drawings

arXiv.org Artificial Intelligence

From a set of technical drawings and expert knowledge, we automatically learn a parser to interpret such a drawing. This enables automatic reasoning and learning on top of a large database of technical drawings. In this work, we develop a similarity based search algorithm to help engineers and designers find or complete designs more easily and flexibly. This is part of an ongoing effort to build an automated engineering assistant. The proposed methods make use of both neural methods to learn to interpret images, and symbolic methods to learn to interpret the structure in the technical drawing and incorporate expert knowledge.


Everything Happens for a Reason: Discovering the Purpose of Actions in Procedural Text

arXiv.org Artificial Intelligence

Our goal is to better comprehend procedural text, e.g., a paragraph about photosynthesis, by not only predicting what happens, but why some actions need to happen before others. Our approach builds on a prior process comprehension framework for predicting actions' effects, to also identify subsequent steps that those effects enable. We present our new model (XPAD) that biases effect predictions towards those that (1) explain more of the actions in the paragraph and (2) are more plausible with respect to background knowledge. We also extend an existing benchmark dataset for procedural text comprehension, ProPara, by adding the new task of explaining actions by predicting their dependencies. We find that XPAD significantly outperforms prior systems on this task, while maintaining the performance on the original task in ProPara. The dataset is available at http://data.allenai.org/propara


Stacking Models for Nearly Optimal Link Prediction in Complex Networks

arXiv.org Machine Learning

Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speedup the collection of network data and improve the validity of network models. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability varies across methods and networks from different domains, and how close to optimality current methods are. We answer these questions by systematically evaluating 203 individual link predictor algorithms, representing three popular families of methods, applied to a large corpus of 548 structurally diverse networks from six scientific domains. We first show that individual algorithms exhibit a broad diversity of prediction errors, such that no one predictor or family is best, or worst, across all realistic inputs. We then exploit this diversity via meta-learning to construct a series of "stacked" models that combine predictors into a single algorithm. Applied to a broad range of synthetic networks, for which we may analytically calculate optimal performance, these stacked models achieve optimal or nearly optimal levels of accuracy. Applied to real-world networks, stacked models are also superior, but their accuracy varies strongly by domain, suggesting that link prediction may be fundamentally easier in social networks than in biological or technological networks. These results indicate that the state-of-the-art for link prediction comes from combining individual algorithms, which achieves nearly optimal predictions. We close with a brief discussion of limitations and opportunities for further improvement of these results.


A Tsetlin Machine with Multigranular Clauses

arXiv.org Artificial Intelligence

The recently introduced Tsetlin Machine (TM) has provided competitive pattern recognition accuracy in several benchmarks, however, requires a 3-dimensional hyperparameter search. In this paper, we introduce the Multigranular Tsetlin Machine (MTM). The MTM eliminates the specificity hyperparameter, used by the TM to control the granularity of the conjunctive clauses that it produces for recognizing patterns. Instead of using a fixed global specificity, we encode varying specificity as part of the clauses, rendering the clauses multigranular. This makes it easier to configure the TM because the dimensionality of the hyperparameter search space is reduced to only two dimensions. Indeed, it turns out that there is significantly less hyperparameter tuning involved in applying the MTM to new problems. Further, we demonstrate empirically that the MTM provides similar performance to what is achieved with a finely specificity-optimized TM, by comparing their performance on both synthetic and real-world datasets.