Knowledge that Everyone Knows. "People do not walk on their heads." The assertion comes about 900 statements deep into the 527,308 items that comprise the Open Mind common sense database. It's after "Laws are the rules of society" and before "The sky is blue during the day." This collection of mundane facts, which would take more than 20,000 pages to print out, consists entirely of statements so unremarkable they are barely worth stating. Most of us would correctly dismiss them as common sense.
– from D.C. Denison, Guess who's smarter. Boston Globe Online (page hosted at MIT), May 26, 2003.
In the new paper Does BERT Solve Commonsense Task via Commonsense Knowledge?, a team of researchers from Westlake University, Fudan University and Microsoft Research Asia dive deep into the large language model to discover how it encodes the structured commonsense knowledge it leverages on downstream commonsense tasks. The proven successes of pretrained language models such as BERT on various downstream tasks has stimulated research investigating the linguistic knowledge inside the model. Previous studies have revealed shallow syntactic, semantic and word sense knowledge in BERT, however, the question of how BERT deals with commonsense tasks has been relatively unexamined. CommonsenseQA is a multiple-choice question answering dataset built upon the CONCEPTNET knowledge graph. The researchers extracted multiple target concepts with the same semantic relation to a single source concept from CONCEPTNET, where each question has one of three target concepts as the correct answer. For example, "bird" is the source concept in the question "Where does a wild bird usually live?" and "countryside" is the correct answer from the possible target concepts "cage," "windowsill," and "countryside."
In a paper accepted to last week's International Conference on Machine Learning, researchers at University College London and the University of Oxford propose an environment -- WordCraft -- to benchmark AI agents' commonsense reasoning capabilities. Based on Little Alchemy 2, a game that tasks players with mixing ingredients to create new items, they say WordCraft is both lightweight and built upon entities and relations inspired by real-world semantics. As the researchers note, personal assistants and household robots require agents that can learn quickly and generalize well to novel situations. That's likely not possible without the ability to reason using common sense and general knowledge about the world. For instance, an agent tasked with performing common household chores that hasn't seen a dirty ashtray would need to know a reasonable set of actions, including how to clean the ashtray and to avoid feeding it to a pet.
The national research cloud would address a problem that is a byproduct of impressive progress in recent years. The striking gains made in tasks like language understanding, computer vision, game playing and common-sense reasoning have been attained thanks to a branch of A.I. called deep learning. That technology increasingly requires immense computing firepower. A report last year from the Allen Institute for Artificial Intelligence, working with data from OpenAI, another artificial intelligence lab, observed that the volume of calculations needed to be a leader in advanced A.I. had soared an estimated 300,000 times in the previous six years. The cost of training deep learning models, cycling endlessly through troves of data, can be millions of dollars.
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.
Commonsense reasoning is an important aspect of building robust AI systems and is receiving significant attention in the natural language understanding, computer vision, and knowledge graphs communities. At present, a number of valuable commonsense knowledge sources exist, with different foci, strengths, and weaknesses. In this paper, we list representative sources and their properties. Based on this survey, we propose principles and a representation model in order to consolidate them into a Common Sense Knowledge Graph (CSKG). We apply this approach to consolidate seven separate sources into a first integrated CSKG. We present statistics of CSKG, present initial investigations of its utility on four QA datasets, and list learned lessons.
One aspect of human commonsense reasoning is the ability to make presumptions about daily experiences, activities and social interactions with others. We propose a new commonsense reasoning benchmark where the task is to uncover commonsense presumptions implied by imprecisely stated natural language commands in the form of if-then-because statements. For example, in the command "If it snows at night then wake me up early because I don't want to be late for work" the speaker relies on commonsense reasoning of the listener to infer the implicit presumption that it must snow enough to cause traffic slowdowns. Such if-then-because commands are particularly important when users instruct conversational agents. We release a benchmark data set for this task, collected from humans and annotated with commonsense presumptions. We develop a neuro-symbolic theorem prover that extracts multi-hop reasoning chains and apply it to this problem. We further develop an interactive conversational framework that evokes commonsense knowledge from humans for completing reasoning chains.
We propose an unsupervised approach for sarcasm generation based on a non-sarcastic input sentence. Our method employs a retrieve-and-edit framework to instantiate two major characteristics of sarcasm: reversal of valence and semantic incongruity with the context which could include shared commonsense or world knowledge between the speaker and the listener. While prior works on sarcasm generation predominantly focus on context incongruity, we show that combining valence reversal and semantic incongruity based on the commonsense knowledge generates sarcasm of higher quality. Human evaluation shows that our system generates sarcasm better than human annotators 34% of the time, and better than a reinforced hybrid baseline 90% of the time.
Machine common sense remains a broad, potentially unbounded problem in artificial intelligence (AI). There is a wide range of strategies that can be employed to make progress on this challenge. This article deals with the aspects of modeling commonsense reasoning focusing on such domain as interpersonal interactions. The basic idea is that there are several types of commonsense reasoning: one is manifested at the logical level of physical actions, the other deals with the understanding of the essence of human-human interactions. Existing approaches, based on formal logic and artificial neural networks, allow for modeling only the first type of common sense. To model the second type, it is vital to understand the motives and rules of human behavior. This model is based on real-life heuristics, i.e., the rules of thumb, developed through knowledge and experience of different generations. Such knowledge base allows for development of an expert system with inference and explanatory mechanisms (commonsense reasoning algorithms and personal models). Algorithms provide tools for a situation analysis, while personal models make it possible to identify personality traits. The system so designed should perform the function of amplified intelligence for interactions, including human-machine.
Cold War concerns U.S. government agencies like the Defense Advanced Research Projects Agency (DARPA) fund AI research at universities such as MIT, hoping for machines that will translate Russian instantly. I'm afraid I can't do that." The winter lasts two decades, with just a few heat waves of progress. Common-sense AI Douglas Lenat sets out to construct an AI that can do common-sense reasoning. He develops it for 30 years before it is used commercially.
Numerical reasoning is often important to accurately understand the world. Recently, several format-specific datasets have been proposed, such as numerical reasoning in the settings of Natural Language Inference (NLI), Reading Comprehension (RC), and Question Answering (QA). Several format-specific models and architectures in response to those datasets have also been proposed. However, there exists a strong need for a benchmark which can evaluate the abilities of models, in performing question format independent numerical reasoning, as (i) the numerical reasoning capabilities we want to teach are not controlled by question formats, (ii) for numerical reasoning technology to have the best possible application, it must be able to process language and reason in a way that is not exclusive to a single format, task, dataset or domain. In pursuit of this goal, we introduce NUMBERGAME, a multifaceted benchmark to evaluate model performance across numerical reasoning tasks of eight diverse formats. We add four existing question types in our compilation. Two of the new types we add are about questions that require external numerical knowledge, commonsense knowledge and domain knowledge. For building a more practical numerical reasoning system, NUMBERGAME demands four capabilities beyond numerical reasoning: (i) detecting question format directly from data (ii) finding intermediate common format to which every format can be converted (iii) incorporating commonsense knowledge (iv) handling data imbalance across formats. We build several baselines, including a new model based on knowledge hunting using a cheatsheet. However, all baselines perform poorly in contrast to the human baselines, indicating the hardness of our benchmark. Our work takes forward the recent progress in generic system development, demonstrating the scope of these under-explored tasks.