Goto

Collaborating Authors

 Commonsense Reasoning


Researchers propose game-based benchmark for AI's commonsense reasoning

#artificialintelligence

In a paper accepted to last week's International Conference on Machine Learning, researchers at University College London and the University of Oxford propose an environment -- WordCraft -- to benchmark AI agents' commonsense reasoning capabilities. Based on Little Alchemy 2, a game that tasks players with mixing ingredients to create new items, they say WordCraft is both lightweight and built upon entities and relations inspired by real-world semantics. As the researchers note, personal assistants and household robots require agents that can learn quickly and generalize well to novel situations. That's likely not possible without the ability to reason using common sense and general knowledge about the world. For instance, an agent tasked with performing common household chores that hasn't seen a dirty ashtray would need to know a reasonable set of actions, including how to clean the ashtray and to avoid feeding it to a pet.


Consolidating Commonsense Knowledge

arXiv.org Artificial Intelligence

Commonsense reasoning is an important aspect of building robust AI systems and is receiving significant attention in the natural language understanding, computer vision, and knowledge graphs communities. At present, a number of valuable commonsense knowledge sources exist, with different foci, strengths, and weaknesses. In this paper, we list representative sources and their properties. Based on this survey, we propose principles and a representation model in order to consolidate them into a Common Sense Knowledge Graph (CSKG). We apply this approach to consolidate seven separate sources into a first integrated CSKG. We present statistics of CSKG, present initial investigations of its utility on four QA datasets, and list learned lessons.


Unsupervised Evaluation of Interactive Dialog with DialoGPT

arXiv.org Artificial Intelligence

It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.


Conversational Neuro-Symbolic Commonsense Reasoning

arXiv.org Artificial Intelligence

One aspect of human commonsense reasoning is the ability to make presumptions about daily experiences, activities and social interactions with others. We propose a new commonsense reasoning benchmark where the task is to uncover commonsense presumptions implied by imprecisely stated natural language commands in the form of if-then-because statements. For example, in the command "If it snows at night then wake me up early because I don't want to be late for work" the speaker relies on commonsense reasoning of the listener to infer the implicit presumption that it must snow enough to cause traffic slowdowns. Such if-then-because commands are particularly important when users instruct conversational agents. We release a benchmark data set for this task, collected from humans and annotated with commonsense presumptions. We develop a neuro-symbolic theorem prover that extracts multi-hop reasoning chains and apply it to this problem. We further develop an interactive conversational framework that evokes commonsense knowledge from humans for completing reasoning chains.


$R^3$: Reverse, Retrieve, and Rank for Sarcasm Generation with Commonsense Knowledge

arXiv.org Artificial Intelligence

We propose an unsupervised approach for sarcasm generation based on a non-sarcastic input sentence. Our method employs a retrieve-and-edit framework to instantiate two major characteristics of sarcasm: reversal of valence and semantic incongruity with the context which could include shared commonsense or world knowledge between the speaker and the listener. While prior works on sarcasm generation predominantly focus on context incongruity, we show that combining valence reversal and semantic incongruity based on the commonsense knowledge generates sarcasm of higher quality. Human evaluation shows that our system generates sarcasm better than human annotators 34% of the time, and better than a reinforced hybrid baseline 90% of the time.


Machine Common Sense

arXiv.org Artificial Intelligence

Machine common sense remains a broad, potentially unbounded problem in artificial intelligence (AI). There is a wide range of strategies that can be employed to make progress on this challenge. This article deals with the aspects of modeling commonsense reasoning focusing on such domain as interpersonal interactions. The basic idea is that there are several types of commonsense reasoning: one is manifested at the logical level of physical actions, the other deals with the understanding of the essence of human-human interactions. Existing approaches, based on formal logic and artificial neural networks, allow for modeling only the first type of common sense. To model the second type, it is vital to understand the motives and rules of human behavior. This model is based on real-life heuristics, i.e., the rules of thumb, developed through knowledge and experience of different generations. Such knowledge base allows for development of an expert system with inference and explanatory mechanisms (commonsense reasoning algorithms and personal models). Algorithms provide tools for a situation analysis, while personal models make it possible to identify personality traits. The system so designed should perform the function of amplified intelligence for interactions, including human-machine.


Seventy years of highs and lows in the history of machine learning

#artificialintelligence

Cold War concerns U.S. government agencies like the Defense Advanced Research Projects Agency (DARPA) fund AI research at universities such as MIT, hoping for machines that will translate Russian instantly. I'm afraid I can't do that." The winter lasts two decades, with just a few heat waves of progress. Common-sense AI Douglas Lenat sets out to construct an AI that can do common-sense reasoning. He develops it for 30 years before it is used commercially.


Towards Question Format Independent Numerical Reasoning: A Set of Prerequisite Tasks

arXiv.org Artificial Intelligence

Numerical reasoning is often important to accurately understand the world. Recently, several format-specific datasets have been proposed, such as numerical reasoning in the settings of Natural Language Inference (NLI), Reading Comprehension (RC), and Question Answering (QA). Several format-specific models and architectures in response to those datasets have also been proposed. However, there exists a strong need for a benchmark which can evaluate the abilities of models, in performing question format independent numerical reasoning, as (i) the numerical reasoning capabilities we want to teach are not controlled by question formats, (ii) for numerical reasoning technology to have the best possible application, it must be able to process language and reason in a way that is not exclusive to a single format, task, dataset or domain. In pursuit of this goal, we introduce NUMBERGAME, a multifaceted benchmark to evaluate model performance across numerical reasoning tasks of eight diverse formats. We add four existing question types in our compilation. Two of the new types we add are about questions that require external numerical knowledge, commonsense knowledge and domain knowledge. For building a more practical numerical reasoning system, NUMBERGAME demands four capabilities beyond numerical reasoning: (i) detecting question format directly from data (ii) finding intermediate common format to which every format can be converted (iii) incorporating commonsense knowledge (iv) handling data imbalance across formats. We build several baselines, including a new model based on knowledge hunting using a cheatsheet. However, all baselines perform poorly in contrast to the human baselines, indicating the hardness of our benchmark. Our work takes forward the recent progress in generic system development, demonstrating the scope of these under-explored tasks.


WinoWhy: A Deep Diagnosis of Essential Commonsense Knowledge for Answering Winograd Schema Challenge

arXiv.org Artificial Intelligence

In this paper, we present the first comprehensive categorization of essential commonsense knowledge for answering the Winograd Schema Challenge (WSC). For each of the questions, we invite annotators to first provide reasons for making correct decisions and then categorize them into six major knowledge categories. By doing so, we better understand the limitation of existing methods (i.e., what kind of knowledge cannot be effectively represented or inferred with existing methods) and shed some light on the commonsense knowledge that we need to acquire in the future for better commonsense reasoning. Moreover, to investigate whether current WSC models can understand the commonsense or they simply solve the WSC questions based on the statistical bias of the dataset, we leverage the collected reasons to develop a new task called WinoWhy, which requires models to distinguish plausible reasons from very similar but wrong reasons for all WSC questions. Experimental results prove that even though pre-trained language representation models have achieved promising progress on the original WSC dataset, they are still struggling at WinoWhy. Further experiments show that even though supervised models can achieve better performance, the performance of these models can be sensitive to the dataset distribution. WinoWhy and all codes are available at: https://github.com/HKUST-KnowComp/WinoWhy.


INFOTABS: Inference on Tables as Semi-structured Data

arXiv.org Artificial Intelligence

In this paper, we observe that semi-structured tabulated text is ubiquitous; understanding them requires not only comprehending the meaning of text fragments, but also implicit relationships between them. We argue that such data can prove as a testing ground for understanding how we reason about information. To study this, we introduce a new dataset called INFOTABS, comprising of human-written textual hypotheses based on premises that are tables extracted from Wikipedia info-boxes. Our analysis shows that the semi-structured, multi-domain and heterogeneous nature of the premises admits complex, multi-faceted reasoning. Experiments reveal that, while human annotators agree on the relationships between a table-hypothesis pair, several standard modeling strategies are unsuccessful at the task, suggesting that reasoning about tables can pose a difficult modeling challenge.