Chu-Carroll, Jennifer
LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic
Kalyanpur, Aditya, Saravanakumar, Kailash, Barres, Victor, Chu-Carroll, Jennifer, Melville, David, Ferrucci, David
We introduce LLM-ARC, a neuro-symbolic framework designed to enhance the logical reasoning capabilities of Large Language Models (LLMs), by combining them with an Automated Reasoning Critic (ARC). LLM-ARC employs an Actor-Critic method where the LLM Actor generates declarative logic programs along with tests for semantic correctness, while the Automated Reasoning Critic evaluates the code, runs the tests and provides feedback on test failures for iterative refinement. Implemented using Answer Set Programming (ASP), LLM-ARC achieves a new state-of-the-art accuracy of 88.32% on the FOLIO benchmark which tests complex logical reasoning capabilities. Our experiments demonstrate significant improvements over LLM-only baselines, highlighting the importance of logic test generation and iterative self-refinement. We achieve our best result using a fully automated self-supervised training loop where the Actor is trained on end-to-end dialog traces with Critic feedback. We discuss potential enhancements and provide a detailed error analysis, showcasing the robustness and efficacy of LLM-ARC for complex natural language reasoning tasks.
Beyond LLMs: Advancing the Landscape of Complex Reasoning
Chu-Carroll, Jennifer, Beck, Andrew, Burnham, Greg, Melville, David OS, Nachman, David, รzcan, A. Erdem, Ferrucci, David
Since the advent of Large Language Models a few years ago, they have often been considered the de facto solution for many AI problems. However, in addition to the many deficiencies of LLMs that prevent them from broad industry adoption, such as reliability, cost, and speed, there is a whole class of common real world problems that Large Language Models perform poorly on, namely, constraint satisfaction and optimization problems. These problems are ubiquitous and current solutions are highly specialized and expensive to implement. At Elemental Cognition, we developed our EC AI platform which takes a neuro-symbolic approach to solving constraint satisfaction and optimization problems. The platform employs, at its core, a precise and high performance logical reasoning engine, and leverages LLMs for knowledge acquisition and user interaction. This platform supports developers in specifying application logic in natural and concise language while generating application user interfaces to interact with users effectively. We evaluated LLMs against systems built on the EC AI platform in three domains and found the EC AI systems to significantly outperform LLMs on constructing valid and optimal solutions, on validating proposed solutions, and on repairing invalid solutions.
Open-Domain Frame Semantic Parsing Using Transformers
Kalyanpur, Aditya, Biran, Or, Breloff, Tom, Chu-Carroll, Jennifer, Diertani, Ariel, Rambow, Owen, Sammons, Mark
Frame semantic parsing is a complex problem which includes multiple underlying subtasks. Recent approaches have employed joint learning of subtasks (such as predicate and argument detection), and multi-task learning of related tasks (such as syntactic and semantic parsing). In this paper, we explore multi-task learning of all subtasks with transformer-based models. We show that a purely generative encoder-decoder architecture handily beats the previous state of the art in FrameNet 1.7 parsing, and that a mixed decoding multi-task approach achieves even better performance. Finally, we show that the multi-task model also outperforms recent state of the art systems for PropBank SRL parsing on the CoNLL 2012 benchmark.
GLUCOSE: GeneraLized and COntextualized Story Explanations
Mostafazadeh, Nasrin, Kalyanpur, Aditya, Moon, Lori, Buchanan, David, Berkowitz, Lauren, Biran, Or, Chu-Carroll, Jennifer
When humans read or listen, they make implicit commonsense inferences that frame their understanding of what happened and why. As a step toward AI systems that can build similar mental models, we introduce GLUCOSE, a large-scale dataset of implicit commonsense causal knowledge, encoded as causal mini-theories about the world, each grounded in a narrative context. To construct GLUCOSE, we drew on cognitive psychology to identify ten dimensions of causal explanation, focusing on events, states, motivations, and emotions. Each GLUCOSE entry includes a story-specific causal statement paired with an inference rule generalized from the statement. This paper details two concrete contributions: First, we present our platform for effectively crowdsourcing GLUCOSE data at scale, which uses semi-structured templates to elicit causal explanations. Using this platform, we collected 440K specific statements and general rules that capture implicit commonsense knowledge about everyday situations. Second, we show that existing knowledge resources and pretrained language models do not include or readily predict GLUCOSE's rich inferential content. However, when state-of-the-art neural models are trained on this knowledge, they can start to make commonsense inferences on unseen stories that match humans' mental models.
To Test Machine Comprehension, Start by Defining Comprehension
Dunietz, Jesse, Burnham, Gregory, Bharadwaj, Akash, Rambow, Owen, Chu-Carroll, Jennifer, Ferrucci, David
Many tasks aim to measure machine reading comprehension (MRC), often focusing on question types presumed to be difficult. Rarely, however, do task designers start by considering what systems should in fact comprehend. In this paper we make two key contributions. First, we argue that existing approaches do not adequately define comprehension; they are too unsystematic about what content is tested. Second, we present a detailed definition of comprehension -- a "Template of Understanding" -- for a widely useful class of texts, namely short narratives. We then conduct an experiment that strongly suggests existing systems are not up to the task of narrative understanding as we define it.
WatsonPaths: Scenario-Based Question Answering and Inference over Unstructured Information
Lally, Adam (Information Technology and Services) | Bagchi, Sugato (IBM Research) | Barborak, Michael A. (IBM T. J. Watson Research Center) | Buchanan, David W. (IBM T. J. Watson Research Center) | Chu-Carroll, Jennifer (IBM Research) | Ferrucci, David A. (Bridgewater) | Glass, Michael R. (IBM Research) | Kalyanpur, Aditya (IBM T. J. Watson Research Center) | Mueller, Erik T. (Capital One) | Murdock, J. William (IBM T. J. Watson Research Center) | Patwardhan, Siddharth (IBM T. J. Watson Research Center) | Prager, John M. (IBM T. J. Watson Research Center)
WatsonPaths: Scenario-Based Question Answering and Inference over Unstructured Information
Lally, Adam (Information Technology and Services) | Bagchi, Sugato (IBM Research) | Barborak, Michael A. (IBM T. J. Watson Research Center) | Buchanan, David W. (IBM T. J. Watson Research Center) | Chu-Carroll, Jennifer (IBM Research) | Ferrucci, David A. (Bridgewater) | Glass, Michael R. (IBM Research) | Kalyanpur, Aditya (IBM T. J. Watson Research Center) | Mueller, Erik T. (Capital One) | Murdock, J. William (IBM T. J. Watson Research Center) | Patwardhan, Siddharth (IBM T. J. Watson Research Center) | Prager, John M. (IBM T. J. Watson Research Center)
We present WatsonPaths, a novel system that can answer scenario-based questions. These include medical questions that present a patient summary and ask for the most likely diagnosis or most appropriate treatment. WatsonPaths builds on the IBM Watson question answering system. WatsonPaths breaks down the input scenario into individual pieces of information, asks relevant subquestions of Watson to conclude new information, and represents these results in a graphical model. Probabilistic inference is performed over the graph to conclude the answer. On a set of medical test preparation questions, WatsonPaths shows a significant improvement in accuracy over multiple baselines.
Leveraging Wikipedia Characteristics for Search and Candidate Generation in Question Answering
Chu-Carroll, Jennifer (IBM T. J. Watson Research Center) | Fan, James (IBM T. J. Watson Research Center)
Most existing Question Answering (QA) systems adopt a type-and-generate approach to candidate generation that relies on a pre-defined domain ontology. This paper describes a type independent search and candidate generation paradigm for QA that leverages Wikipedia characteristics. This approach is particularly useful for adapting QA systems to domains where reliable answer type identification and type-based answer extraction are not available. We present a three-pronged search approach motivated by relations an answer-justifying title-oriented document may have with the question/answer pair. We further show how Wikipedia metadata such as anchor texts and redirects can be utilized to effectively extract candidate answers from search results without a type ontology. Our experimental results show that our strategies obtained high binary recall in both search and candidate generation on TREC questions, a domain that has mature answer type extraction technology, as well as on Jeopardy! questions, a domain without such technology. Our high-recall search and candidate generation approach has also led to high overall QA performance in Watson, our end-to-end system.
Building Watson: An Overview of the DeepQA Project
Ferrucci, David (IBM T. J. Watson Research Center) | Brown, Eric (IBM T. J. Watson Research Center) | Chu-Carroll, Jennifer (IBM T. J. Watson Research Center) | Fan, James (IBM T. J. Watson Research Center) | Gondek, David (IBM T. J. Watson Research Center) | Kalyanpur, Aditya A. (IBM T. J. Watson Research Center) | Lally, Adam (IBM T. J. Watson Research Center) | Murdock, J. William (IBM T. J. Watson Research Center) | Nyberg, Eric (Carnegie Mellon University) | Prager, John (IBM T. J. Watson Research Center) | Schlaefer, Nico (Carnegie Mellon University) | Welty, Chris (IBM T. J. Watson Research Center)
IBM Research undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV Quiz show, Jeopardy! The extent of the challenge includes fielding a real-time automatic contestant on the show, not merely a laboratory exercise. The Jeopardy! Challenge helped us address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After 3 years of intense research and development by a core team of about 20 researches, Watson is performing at human expert-levels in terms of precision, confidence and speed at the Jeopardy! Quiz show. Our results strongly suggest that DeepQA is an effective and extensible architecture that may be used as a foundation for combining, deploying, evaluating and advancing a wide range of algorithmic techniques to rapidly advance the field of QA.
The AAAI Spring Symposia
Green, Nancy, Chu-Carroll, Jennifer, Kortenkamp, David, Schultz, Alan, Coen, Michael H., Radev, Dragomir R., Hovy, Eduard, Haddawy, Peter, Hanks, Steve, Freuder, Eugene, Ortiz, Charlie, Sen, Sandip
The Association for the Advancement of Artificial Intelligence, in cooperation with Stanford University's Department of Computer Science, held the 1998 Spring Symposium Series on 23 to 25 March at Stanford University. The topics of the eight symposia were (1) Applying Machine Learning to Discourse Processing, (2) Integrating Robotic Research: Taking the Next Leap, (3) Intelligent Environments, (4) Intelligent Text Summarization, (5) Interactive and Mixed-Initiative Decision-Theoretic Systems, (6) Multimodal Reasoning, (7) Prospects for a Common-Sense Theory of Causation, and (8) Satisficing Models.