If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."
However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …
Most textual entailment models focus on lexical gaps between the premise text and the hypothesis, but rarely on knowledge gaps. We focus on filling these knowledge gaps in the Science Entailment task, by leveraging an external structured knowledge base (KB) of science facts. Our new architecture combines standard neural entailment models with a knowledge lookup module. To facilitate this lookup, we propose a fact-level decomposition of the hypothesis, and verifying the resulting sub-facts against both the textual premise and the structured KB. Our model, NSnet, learns to aggregate predictions from these heterogeneous data formats. On the SciTail dataset, NSnet outperforms a simpler combination of the two predictions by 3% and the base entailment model by 5%.
Comprehending procedural text, e.g., a paragraph describing photosynthesis, requires modeling actions and the state changes they produce, so that questions about entities at different timepoints can be answered. Although several recent systems have shown impressive progress in this task, their predictions can be globally inconsistent or highly improbable. In this paper, we show how the predicted effects of actions in the context of a paragraph can be improved in two ways: (1) by incorporating global, commonsense constraints (e.g., a non-existent entity cannot be destroyed), and (2) by biasing reading with preferences from large-scale corpora (e.g., trees rarely move). Unlike earlier methods, we treat the problem as a neural structured prediction task, allowing hard and soft constraints to steer the model away from unlikely predictions. We show that the new model significantly outperforms earlier systems on a benchmark dataset for procedural text comprehension (+8% relative gain), and that it also avoids some of the nonsensical predictions that earlier systems make.
This document gives a knowledge-oriented analysis of about 20 interesting Recognizing Textual Entailment (RTE) examples, drawn from the 2005 RTE5 competition test set. The analysis ignores shallow statistical matching techniques between T and H, and rather asks: What would it take to reasonably infer that T implies H? What world knowledge would be needed for this task? Although such knowledge-intensive techniques have not had much success in RTE evaluations, ultimately an intelligent system should be expected to know and deploy this kind of world knowledge required to perform this kind of reasoning. The selected examples are typically ones which our RTE system (called BLUE) got wrong and ones which require world knowledge to answer. In particular, the analysis covers cases where there was near-perfect lexical overlap between T and H, yet the entailment was NO, i.e., examples that most likely all current RTE systems will have got wrong. A nice example is #341 (page 26), that requires inferring from "a river floods" that "a river overflows its banks". Seems it should be easy, right? Enjoy!
Our goal is to answer questions about paragraphs describing processes (e.g., photosynthesis). Texts of this genre are challenging because the effects of actions are often implicit (unstated), requiring background knowledge and inference to reason about the changing world states. To supply this knowledge, we leverage VerbNet to build a rulebase (called the Semantic Lexicon) of the preconditions and effects of actions, and use it along with commonsense knowledge of persistence to answer questions about change. Our evaluation shows that our system, ProComp, significantly outperforms two strong reading comprehension (RC) baselines. Our contributions are two-fold: the Semantic Lexicon rulebase itself, and a demonstration of how a simulation-based approach to machine reading can outperform RC methods that rely on surface cues alone. Since this work was performed, we have developed neural systems that outperform ProComp, described elsewhere (Dalvi et al., NAACL'18). However, the Semantic Lexicon remains a novel and potentially useful resource, and its integration with neural systems remains a currently unexplored opportunity for further improvements in machine reading about processes.
This working note discusses the topic of story generation, with a view to identifying the knowledge required to understand aviation incident narratives (which have structural similarities to stories), following the premise that to understand aviation incidents, one should at least be able to generate examples of them. We give a brief overview of aviation incidents and their relation to stories, and then describe two of our earlier attempts (using `scripts' and `story grammars') at incident generation which did not evolve promisingly. Following this, we describe a simple incident generator which did work (at a `toy' level), using a `world simulation' approach. This generator is based on Meehan's TALE-SPIN story generator (1977). We conclude with a critique of the approach.
We present a new dataset and model for textual entailment, derived from treating multiple-choice question-answering as an entailment problem. SciTail is the first entailment set that is created solely from natural sentences that already exist independently ``in the wild'' rather than sentences authored specifically for the entailment task. Different from existing entailment datasets, we create hypotheses from science questions and the corresponding answer candidates, and premises from relevant web sentences retrieved from a large corpus. These sentences are often linguistically challenging. This, combined with the high lexical similarity of premise and hypothesis for both entailed and non-entailed pairs, makes this new entailment task particularly difficult. The resulting challenge is evidenced by state-of-the-art textual entailment systems achieving mediocre performance on SciTail, especially in comparison to a simple majority class baseline. As a step forward, we demonstrate that one can improve accuracy on SciTail by 5% using a new neural model that exploits linguistic structure.
Given recent successes in AI (e.g., AlphaGo's victory against Lee Sedol in the game of GO), it's become increasingly important to assess: how close are AI systems to human-level intelligence? This paper describes the Allen AI Science Challenge---an approach towards that goal which led to a unique Kaggle Competition, its results, the lessons learned, and our next steps.
Clark, Peter (Allen Institute for AI) | Etzioni, Oren (Allen Institute for AI) | Khot, Tushar (Allen Institute for AI) | Sabharwal, Ashish (Allen Institute for AI) | Tafjord, Oyvind (Allen Institute for AI) | Turney, Peter (Allen Institute for AI) | Khashabi, Daniel (Univ. Illinois at Urbana-Champaign)
What capabilities are required for an AI system to pass standard 4th Grade Science Tests? Previous work has examined the use of Markov Logic Networks (MLNs) to represent the requisite background knowledge and interpret test questions, but did not improve upon an information retrieval (IR) baseline. In this paper, we describe an alternative approach that operates at three levels of representation and reasoning: information retrieval, corpus statistics, and simple inference over a semi-automatically constructed knowledge base, to achieve substantially improved results. We evaluate the methods on six years of unseen, unedited exam questions from the NY Regents Science Exam (using only non-diagram, multiple choice questions), and show that our overall system’s score is 71.3%, an improvement of 23.8% (absolute) over the MLN-based method described in previous work. We conclude with a detailed analysis, illustrating the complementary strengths of each method in the ensemble. Our datasets are being released to enable further research.
Given the well-known limitations of the Turing Test, there is a need for objective tests to both focus attention on, and measure progress towards, the goals of AI. In this paper we argue that machine performance on standardized tests should be a key component of any new measure of AI, because attaining a high level of performance requires solving significant AI problems involving language understanding and world modeling - critical skills for any machine that lays claim to intelligence. In addition, standardized tests have all the basic requirements of a practical test: they are accessible, easily comprehensible, clearly measurable, and offer a graduated progression from simple tasks to those requiring deep understanding of the world.
Given the well-known limitations of the Turing Test, there is a need for objective tests to both focus attention on, and measure progress towards, the goals of AI. In this paper we argue that machine performance on standardized tests should be a key component of any new measure of AI, because attaining a high level of performance requires solving significant AI problems involving language understanding and world modeling - critical skills for any machine that lays claim to intelligence. In addition, standardized tests have all the basic requirements of a practical test: they are accessible, easily comprehensible, clearly measurable, and offer a graduated progression from simple tasks to those requiring deep understanding of the world. Here we propose this task as a challenge problem for the community, summarize our state-of-the-art results on math and science tests, and provide supporting datasets