Grammars & Parsing
Grammar as a Foreign Language
Vinyals, Oriol, Kaiser, Lukasz, Koo, Terry, Petrov, Slav, Sutskever, Ilya, Hinton, Geoffrey
Syntactic constituency parsing is a fundamental problem in natural language processing and has been the subject of intensive research and engineering for decades. As a result, the most accurate parsers are domain specific, complex, and inefficient. In this paper we show that the domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used syntactic constituency parsing dataset, when trained on a large synthetic corpus that was annotated using existing parsers. It also matches the performance of standard parsers when trained only on a small human-annotated dataset, which shows that this model is highly data-efficient, in contrast to sequence-to-sequence models without the attention mechanism. Our parser is also fast, processing over a hundred sentences per second with an unoptimized CPU implementation.
Learning to Search Better Than Your Teacher
Chang, Kai-Wei, Krishnamurthy, Akshay, Agarwal, Alekh, Daumé, Hal III, Langford, John
Methods for learning to search for structured prediction typically imitate a reference policy, with existing theoretical guarantees demonstrating low regret compared to that reference. This is unsatisfactory in many applications where the reference policy is suboptimal and the goal of learning is to improve upon it. Can learning to search work even when the reference is poor? We provide a new learning to search algorithm, LOLS, which does well relative to the reference policy, but additionally guarantees low regret compared to deviations from the learned policy: a local-optimality guarantee. Consequently, LOLS can improve upon the reference policy, unlike previous algorithms. This enables us to develop structured contextual bandits, a partial information structured prediction setting with many potential applications.
Towards a Visual Turing Challenge
Malinowski, Mateusz, Fritz, Mario
As language and visual understanding by machines progresses rapidly, we are observing an increasing interest in holistic architectures that tightly interlink both modalities in a joint learning and inference process. This trend has allowed the community to progress towards more challenging and open tasks and refueled the hope at achieving the old AI dream of building machines that could pass a turing test in open domains. In order to steadily make progress towards this goal, we realize that quantifying performance becomes increasingly difficult. Therefore we ask how we can precisely define such challenges and how we can evaluate different algorithms on this open tasks? In this paper, we summarize and discuss such challenges as well as try to give answers where appropriate options are available in the literature. We exemplify some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge. Finally, we argue despite the success of unique ground-truth annotation, we likely have to step away from carefully curated dataset and rather rely on 'social consensus' as the main driving force to create suitable benchmarks. Providing coverage in this inherently ambiguous output space is an emerging challenge that we face in order to make quantifiable progress in this area.
A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video
Yu, Haonan, Siddharth, N., Barbu, Andrei, Siskind, Jeffrey Mark
We present an approach to simultaneously reasoning about a video clip and an entire natural-language sentence. The compositional nature of language is exploited to construct models which represent the meanings of entire sentences composed out of the meanings of the words in those sentences mediated by a grammar that encodes the predicate-argument relations. We demonstrate that these models faithfully represent the meanings of sentences and are sensitive to how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) affect the meaning of a sentence and how it is grounded in video. We exploit this methodology in three ways. In the first, a video clip along with a sentence are taken as input and the participants in the event described by the sentence are highlighted, even when the clip depicts multiple similar simultaneous events. In the second, a video clip is taken as input without a sentence and a sentence is generated that describes an event in that clip. In the third, a corpus of video clips is paired with sentences which describe some of the events in those clips and the meanings of the words in those sentences are learned. We learn these meanings without needing to specify which attribute of the video clips each word in a given sentence refers to. The learned meaning representations are shown to be intelligible to humans.
Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields
Schmidt, Mark, Babanezhad, Reza, Ahmed, Mohamed Osama, Defazio, Aaron, Clifton, Ann, Sarkar, Anoop
Conditional random fields (CRFs) [Lafferty et al., 2001] are a ubiquitous tool in natural language processing. They are used for part-of-speech tagging [McCallum et al., 2003], semantic role labeling [Cohn and Blunsom, 2005], topic modeling [Zhu and Xing, 2010], information extraction [Peng and McCallum, 2006], shallow parsing [Sha and Pereira, 2003], named-entity recognition [Settles, 2004], as well as a host of other applications in natural language processing and in other fields such as computer vision [Nowozin and Lampert, 2011]. Similar to generative Markov random field (MRF) models, CRFs allow us to model probabilistic dependencies between output variables. The key advantage of discriminative CRF models is the ability to use a very highdimensional feature set, without explicitly building a model for these features (as required by MRF models). Despite the widespread use of CRFs, a major disadvantage of these models is that they can be very slow to train and the time needed for numerical optimization in CRF models remains a bottleneck in many applications. Due to the high cost of evaluating the CRF objective function on even a single training example, it is now common to train CRFs using stochastic gradient methods [Vishwanathan et al., 2006]. These methods are advantageous over deterministic methods because on each iteration they only require computing the gradient of a single example (and not all example as in deterministic methods). Thus, if we have a data set with n training examples, the iterations of stochastic gradient methods are n times faster than deterministic methods. However, the number of stochastic gradient iterations required might be very high.
Modeling the Lifespan of Discourse Entities with Application to Coreference Resolution
de Marneffe, Marie-Catherine, Recasens, Marta, Potts, Christopher
A discourse typically involves numerous entities, but few are mentioned more than once. Distinguishing those that die out after just one mention (singleton) from those that lead longer lives (coreferent) would dramatically simplify the hypothesis space for coreference resolution models, leading to increased performance. To realize these gains, we build a classifier for predicting the singleton/coreferent distinction. The models feature representations synthesize linguistic insights about the factors affecting discourse entity lifespans (especially negation, modality, and attitude predication) with existing results about the benefits of surface (part-of-speech and n-gram-based) features for coreference resolution. The model is effective in its own right, and the feature representations help to identify the anchor phrases in bridging anaphora as well. Furthermore, incorporating the model into two very different state-of-the-art coreference resolution systems, one rule-based and the other learning-based, yields significant performance improvements.
Inferring Team Task Plans from Human Meetings: A Generative Modeling Approach with Logic-Based Prior
Kim, Been, Chacha, Caleb M., Shah, Julie A.
We aim to reduce the burden of programming and deploying autonomous systems to work in concert with people in time-critical domains such as military field operations and disaster response. Deployment plans for these operations are frequently negotiated on-the-fly by teams of human planners. A human operator then translates the agreed-upon plan into machine instructions for the robots. We present an algorithm that reduces this translation burden by inferring the final plan from a processed form of the human team's planning conversation. Our hybrid approach combines probabilistic generative modeling with logical plan validation used to compute a highly structured prior over possible plans, enabling us to overcome the challenge of performing inference over a large solution space with only a small amount of noisy data from the team planning session. We validate the algorithm through human subject experimentations and show that it is able to infer a human team's final plan with 86% accuracy on average. We also describe a robot demonstration in which two people plan and execute a first-response collaborative task with a PR2 robot. To the best of our knowledge, this is the first work to integrate a logical planning technique within a generative model to perform plan inference.
Detecting Rumor and Disinformation by Web Mining
Galitsky, Boris (Knowledge-Trail)
A method for determining whether given text is a rumor or disinformation is proposed, based on web mining and linguistic technology comparing two paragraphs of text. We hypothesize about a family of content generation algorithms which are capable of producing disinformation from a portion of genuine, original text. We then propose a disinformation detection algorithm which finds a candidate source of text on the web and compares it with the given text, applying parse thicket technology. Parse thicket is graph combined from a sequence of parse trees augmented with inter-sentence relations for anaphora and rhetoric structures. We evaluate our algorithm in the domain of customer reviews, considering a product review as an instance of possible disinformation. It is confirmed as a plausible way to detect rumor and disinformation in a web document. Linguistic approach presented here complements social network structure-based described on a corpus of research on disinformation detection.
Visual Commonsense for Scene Understanding Using Perception, Semantic Parsing and Reasoning
Aditya, Somak (Arizona State University) | Yang, Yezhou (University of Maryland, College Park) | Baral, Chitta (Arizona State University) | Fermuller, Cornelia (Associate Research Scientist, University of Maryland, College Park) | Aloimonos, Yiannis (University of Maryland, College Park)
In this paper we explore the use of visual common-sense knowledge and other kinds of knowledge (such as domain knowledge, background knowledge, linguistic knowledge) for scene understanding. In particular, we combine visual processing with techniques from natural language understanding (especially semantic parsing), common-sense reasoning and knowledge representation and reasoning to improve visual perception to reason about finer aspects of activities.
Latent Predicate Networks: Concept Learning with Probabilistic Context-Sensitive Grammars
Dechter, Eyal (Massachusetts Institute of Technology) | Rule, Joshua (Massachusetts Institute of Technology) | Tenenbaum, Joshua B. (Massachusetts Institute of Technology)
For humans, learning abstract concepts and learning language go hand in hand: we acquire abstract knowledge primarily through linguistic experience, and acquiring abstract concepts is a crucial step in learning the meanings of linguistic expressions. Number knowledge is a case in point: we largely acquire concepts such as seventy-three through linguistic means, and we can only know what the sentence ``seventy-three is more than twice as big as thirty-one" means if we can grasp the meanings of its component number words. How do we begin to solve this problem? One approach is to estimate the distribution from which sentences are drawn, and, in doing so, infer the latent concepts and relationships that best explain those sentences. We present early work on a learning framework called Latent Predicate Networks (LPNs) which learns concepts by inferring the parameters of probabilistic context-sensitive grammars over sentences. We show that for a small fragment of sentences expressing relationships between English number words, we can use hierarchical Bayesian inference to learn grammars that can answer simple queries about previously unseen relationships within this domain. These generalizations demonstrate LPNs' promise as a tool for learning and representing conceptual knowledge in language.