Clark, Peter
SciTaiL: A Textual Entailment Dataset from Science Question Answering
Khot, Tushar (Allen Institute for Artificial Intelligence) | Sabharwal, Ashish (Allen Institute for Artificial Intelligence) | Clark, Peter (Allen Institute for Artificial Intelligence)
We present a new dataset and model for textual entailment, derived from treating multiple-choice question-answering as an entailment problem. SciTail is the first entailment set that is created solely from natural sentences that already exist independently ``in the wild'' rather than sentences authored specifically for the entailment task. Different from existing entailment datasets, we create hypotheses from science questions and the corresponding answer candidates,ย and premises from relevant web sentences retrieved from a large corpus. These sentences are often linguistically challenging. This, combined with the high lexical similarity of premise and hypothesis for both entailed and non-entailed pairs, makes this new entailment task particularly difficult.ย The resulting challenge is evidenced by state-of-the-art textual entailment systems achieving mediocre performance on SciTail, especially in comparison to a simple majority class baseline. As a step forward, we demonstrate that one can improve accuracy on SciTail by 5% using a new neural model that exploits linguistic structure.
Moving Beyond the Turing Test with the Allen AI Science Challenge
Schoenick, Carissa, Clark, Peter, Tafjord, Oyvind, Turney, Peter, Etzioni, Oren
Given recent successes in AI (e.g., AlphaGo's victory against Lee Sedol in the game of GO), it's become increasingly important to assess: how close are AI systems to human-level intelligence? This paper describes the Allen AI Science Challenge---an approach towards that goal which led to a unique Kaggle Competition, its results, the lessons learned, and our next steps.
Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions
Clark, Peter (Allen Institute for AI) | Etzioni, Oren (Allen Institute for AI) | Khot, Tushar (Allen Institute for AI) | Sabharwal, Ashish (Allen Institute for AI) | Tafjord, Oyvind (Allen Institute for AI) | Turney, Peter (Allen Institute for AI) | Khashabi, Daniel (Univ. Illinois at Urbana-Champaign)
What capabilities are required for an AI system to pass standard 4th Grade Science Tests? Previous work has examined the use of Markov Logic Networks (MLNs) to represent the requisite background knowledge and interpret test questions, but did not improve upon an information retrieval (IR) baseline. In this paper, we describe an alternative approach that operates at three levels of representation and reasoning: information retrieval, corpus statistics, and simple inference over a semi-automatically constructed knowledge base, to achieve substantially improved results. We evaluate the methods on six years of unseen, unedited exam questions from the NY Regents Science Exam (using only non-diagram, multiple choice questions), and show that our overall systemโs score is 71.3%, an improvement of 23.8% (absolute) over the MLN-based method described in previous work. We conclude with a detailed analysis, illustrating the complementary strengths of each method in the ensemble. Our datasets are being released to enable further research.
My Computer Is an Honor Student -- but How Intelligent Is It? Standardized Tests as a Measure of AI
Clark, Peter (Allen Institute for Artificial Intelligence) | Etzioni, Oren (Allen Institute for Artificial Intelligence)
Given the well-known limitations of the Turing Test, there is a need for objective tests to both focus attention on, and measure progress towards, the goals of AI. In this paper we argue that machine performance on standardized tests should be a key component of any new measure of AI, because attaining a high level of performance requires solving significant AI problems involving language understanding and world modeling - critical skills for any machine that lays claim to intelligence. In addition, standardized tests have all the basic requirements of a practical test: they are accessible, easily comprehensible, clearly measurable, and offer a graduated progression from simple tasks to those requiring deep understanding of the world.
My Computer Is an Honor Student โ but How Intelligent Is It? Standardized Tests as a Measure of AI
Clark, Peter (Allen Institute for Artificial Intelligence) | Etzioni, Oren (Allen Institute for Artificial Intelligence)
Given the well-known limitations of the Turing Test, there is a need for objective tests to both focus attention on, and measure progress towards, the goals of AI. In this paper we argue that machine performance on standardized tests should be a key component of any new measure of AI, because attaining a high level of performance requires solving significant AI problems involving language understanding and world modeling - critical skills for any machine that lays claim to intelligence. In addition, standardized tests have all the basic requirements of a practical test: they are accessible, easily comprehensible, clearly measurable, and offer a graduated progression from simple tasks to those requiring deep understanding of the world. Here we propose this task as a challenge problem for the community, summarize our state-of-the-art results on math and science tests, and provide supporting datasets
Elementary School Science and Math Tests as a Driver for AI: Take the Aristo Challenge!
Clark, Peter (Allen Institute for AI)
While there has been an explosion of impressive, data-driven AI applications in recent years, machines still largely lack a deeper understanding of the world to answer questions that go beyond information explicitly stated in text, and to explain and discuss those answers. To reach this next generation of AI applications, it is imperative to make faster progress in areas of knowledge, modeling, reasoning, and language. Standardized tests have often been proposed as a driver for such progress, with good reason: Many of the questions require sophisticated understanding of both language and the world, pushing the boundaries of AI, while other questions are easier, supporting incremental progress. In Project Aristo at the Allen Institute for AI, we are working on a specific version of this challenge, namely having the computer pass Elementary School Science and Math exams. Even at this level there is a rich variety of problems and question types, the most difficult requiring significant progress in AI. Here we propose this task as a challenge problem for the community, and are providing supporting datasets. Solutions to many of these problems would have a major impact on the field so we encourage you: Take the Aristo Challenge!
Inquire Biology: A Textbook that Answers Questions
Chaudhri, Vinay K. (SRI International) | Cheng, Britte (SRI International) | Overtholtzer, Adam (SRI International) | Roschelle, Jeremy (SRI International) | Spaulding, Aaron (SRI International) | Clark, Peter (Vulcan Inc.) | Greaves, Mark (Pacific Northwest National Laboratory) | Gunning, Dave (Palo Alto Research Center)
Inquire Biology is a prototype of a new kind of intelligent textbook -- one that answers students' questions, engages their interest, and improves their understanding. Inquire Biology provides unique capabilities via a knowledge representation that captures conceptual knowledge from the textbook and uses inference procedures to answer students' questions. In an initial controlled experiment, community college students using the Inquire Biology prototype outperformed students using either a hardcopy or conventional E-book version of the same biology textbook. While additional research is needed to fully develop Inquire Biology, the initial prototype clearly demonstrates the promise of applying knowledge representation and question-answering technology to electronic textbooks.
Inquire Biology: A Textbook that Answers Questions
Chaudhri, Vinay K. (SRI International) | Cheng, Britte (SRI International) | Overtholtzer, Adam (SRI International) | Roschelle, Jeremy (SRI International) | Spaulding, Aaron (SRI International) | Clark, Peter (Vulcan Inc.) | Greaves, Mark (Pacific Northwest National Laboratory) | Gunning, Dave (Palo Alto Research Center)
Inquire Biology is a prototype of a new kind of intelligent textbook โ one that answers studentsโ questions, engages their interest, and improves their understanding. Inquire Biology provides unique capabilities via a knowledge representation that captures conceptual knowledge from the textbook and uses inference procedures to answer studentsโ questions. Students ask questions by typing free-form natural language queries or by selecting passages of text. The system then attempts to answer the question and also generates suggested questions related to the query or selection. The questions supported by the system were chosen to be educationally useful, for example: what is the structure of X? compare X and Y? how does X relate to Y? In user studies, students found this question-answering capability to be extremely useful while reading and while doing problem solving. In an initial controlled experiment, community college students using the Inquire Biology prototype outperformed students using either a hardcopy or conventional E-book version of the same biology textbook. While additional research is needed to fully develop Inquire Biology, the initial prototype clearly demonstrates the promise of applying knowledge representation and question-answering technology to electronic textbooks.
Project Halo: Towards a Digital Aristotle
Friedland, Noah S., Allen, Paul G., Matthews, Gavin, Witbrock, Michael, Baxter, David, Curtis, Jon, Shepard, Blake, Miraglia, Pierluigi, Angele, Jurgen, Staab, Steffen, Moench, Eddie, Oppermann, Henrik, Wenke, Dirk, Israel, David, Chaudhri, Vinay, Porter, Bruce, Barker, Ken, Fan, James, Chaw, Shaw Yi, Yeh, Peter, Tecuci, Dan, Clark, Peter
Vulcan selected three teams, each of which was to formally represent 70 pages from the advanced placement (AP) chemistry syllabus and deliver knowledge-based systems capable of answering questions on that syllabus. The evaluation quantified each system's coverage of the syllabus in terms of its ability to answer novel, previously unseen questions and to provide human- readable answer justifications. These justifications will play a critical role in building user trust in the question-answering capabilities of Digital Aristotle. This article presents the motivation and longterm goals of Project Halo, describes in detail the six-month first phase of the project -- the Halo Pilot -- its KR&R challenge, empirical evaluation, results, and failure analysis.
Project Halo: Towards a Digital Aristotle
Friedland, Noah S., Allen, Paul G., Matthews, Gavin, Witbrock, Michael, Baxter, David, Curtis, Jon, Shepard, Blake, Miraglia, Pierluigi, Angele, Jurgen, Staab, Steffen, Moench, Eddie, Oppermann, Henrik, Wenke, Dirk, Israel, David, Chaudhri, Vinay, Porter, Bruce, Barker, Ken, Fan, James, Chaw, Shaw Yi, Yeh, Peter, Tecuci, Dan, Clark, Peter
Project Halo is a multistaged effort, sponsored by Vulcan Inc, aimed at creating Digital Aristotle, an application that will encompass much of the world's scientific knowledge and be capable of applying sophisticated problem solving to answer novel questions. Vulcan envisions two primary roles for Digital Aristotle: as a tutor to instruct students in the sciences and as an interdisciplinary research assistant to help scientists in their work. As a first step towards this goal, we have just completed a six-month pilot phase designed to assess the state of the art in applied knowledge representation and reasoning (KR&/R). Vulcan selected three teams, each of which was to formally represent 70 pages from the advanced placement (AP) chemistry syllabus and deliver knowledge-based systems capable of answering questions on that syllabus. The evaluation quantified each system's coverage of the syllabus in terms of its ability to answer novel, previously unseen questions and to provide human- readable answer justifications. These justifications will play a critical role in building user trust in the question-answering capabilities of Digital Aristotle. Prior to the final evaluation, a "failure taxonomy' was collaboratively developed in an attempt to standardize failure analysis and to facilitate cross-platform comparisons. Despite differences in approach, all three systems did very well on the challenge, achieving performance comparable to the human median. The analysis also provided key insights into how the approaches might be scaled, while at the same time suggesting how the cost of producing such systems might be reduced. This outcome leaves us highly optimistic that the technical challenges facing this effort in the years to come can be identified and overcome. This article presents the motivation and longterm goals of Project Halo, describes in detail the six-month first phase of the project -- the Halo Pilot -- its KR&R challenge, empirical evaluation, results, and failure analysis. The pilot's outcome is used to define challenges for the next phase of the project and beyond.