Commonsense Reasoning
VIPHY: Probing "Visible" Physical Commonsense Knowledge
Singh, Shikhar, Qasemi, Ehsan, Chen, Muhao
In recent years, vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks (e.g. attributes, location). While such tasks measure the requisite knowledge to ground and reason over a given visual instance, they do not, however, measure the ability of VLMs to retain and generalize such knowledge. In this work, we evaluate their ability to acquire "visible" physical knowledge -- the information that is easily accessible from images of static scenes, particularly across the dimensions of object color, size and space. We build an automatic pipeline to derive a comprehensive knowledge resource for calibrating and probing these models. Our results indicate a severe gap between model and human performance across all three tasks. Furthermore, our caption pretrained baseline (CapBERT) significantly outperforms VLMs on both size and spatial tasks -- highlighting that despite sufficient access to ground language with visual modality, they struggle to retain such knowledge. The dataset and code are available at https://github.com/Axe--/ViPhy .
Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization
Kim, Seungone, Joo, Se June, Chae, Hyungjoo, Kim, Chaehyeong, Hwang, Seung-won, Yeo, Jinyoung
In this paper, we propose to leverage the unique characteristics of dialogues sharing commonsense knowledge across participants, to resolve the difficulties in summarizing them. We present SICK, a framework that uses commonsense inferences as additional context. Compared to previous work that solely relies on the input dialogue, SICK uses an external knowledge model to generate a rich set of commonsense inferences and selects the most probable one with a similarity-based selection method. Built upon SICK, SICK++ utilizes commonsense as supervision, where the task of generating commonsense inferences is added upon summarizing the dialogue in a multi-task learning setting. Experimental results show that with injected commonsense knowledge, our framework generates more informative and consistent summaries than existing methods.
Why Do Neural Language Models Still Need Commonsense Knowledge to Handle Semantic Variations in Question Answering?
Kwon, Sunjae, Kang, Cheongwoong, Han, Jiyeon, Choi, Jaesik
Many contextualized word representations are now learned by intricate neural network models, such as masked neural language models (MNLMs) which are made up of huge neural network structures and trained to restore the masked text. Such representations demonstrate superhuman performance in some reading comprehension (RC) tasks which extract a proper answer in the context given a question. However, identifying the detailed knowledge trained in MNLMs is challenging owing to numerous and intermingled model parameters. This paper provides new insights and empirical analyses on commonsense knowledge included in pretrained MNLMs. First, we use a diagnostic test that evaluates whether commonsense knowledge is properly trained in MNLMs. We observe that a large proportion of commonsense knowledge is not appropriately trained in MNLMs and MNLMs do not often understand the semantic meaning of relations accurately. In addition, we find that the MNLM-based RC models are still vulnerable to semantic variations that require commonsense knowledge. Finally, we discover the fundamental reason why some knowledge is not trained. We further suggest that utilizing an external commonsense knowledge repository can be an effective solution. We exemplify the possibility to overcome the limitations of the MNLM-based RC models by enriching text with the required knowledge from an external commonsense knowledge repository in controlled experiments.
Why Commonsense Knowledge is not (and can not be) Learned
Commonsense (background) knowledge, at least the kind of knowledge that we fetch and relay upon in the process of language understanding: (i) cannot be learned by processing vast amounts of text because that knowledge is never explicitly stated in the text -- and you cannot find what's not there; and (ii) that background knowledge cannot be learned perceptually from observation since the vast amount of the crucial background knowledge is universal, is not probablistic nor approximate, and so it cannot be susceptible to individual observations. The shared background knowledge needed in the process of language understanding is the kind of knowledge that obeys and respects the laws of nature and as such it has to be codified. In fact, that knowledge must be codified in a symbolic system that quantifies over variables of specific ontological types. There's a consensus among researchers investigating the neurological, psychological and evolutionary aspects of human linguistic communication that languages have evolved according to the information-theoretic principle of least effort. Specifically, it has been established that interacting communicative agents tend to produce utterances that minimize the complexity of coding a thought as well as minimize the process of decoding linguistic utterances back to the intended thought [1] -- thus finding an optimal point where the effort of both speaker and listener is minimal.
Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense Reasoning
Peng, Letian, Li, Zuchao, Zhao, Hai
Commonsense reasoning is an appealing topic in natural language processing (NLP) as it plays a fundamental role in supporting the human-like actions of NLP systems. With large-scale language models as the backbone, unsupervised pre-training on numerous corpora shows the potential to capture commonsense knowledge. Current pre-trained language model (PLM)-based reasoning follows the traditional practice using perplexity metric. However, commonsense reasoning is more than existing probability evaluation, which is biased by word frequency. This paper reconsiders the nature of commonsense reasoning and proposes a novel commonsense reasoning metric, Non-Replacement Confidence (NRC). In detail, it works on PLMs according to the Replaced Token Detection (RTD) pre-training objective in ELECTRA, in which the corruption detection objective reflects the confidence on contextual integrity that is more relevant to commonsense reasoning than existing probability. Our proposed novel method boosts zero-shot performance on two commonsense reasoning benchmark datasets and further seven commonsense question-answering datasets. Our analysis shows that pre-endowed commonsense knowledge, especially for RTD-based PLMs, is essential in downstream reasoning.
Accurate Action Recommendation for Smart Home via Two-Level Encoders and Commonsense Knowledge
Jeon, Hyunsik, Kim, Jongjin, Yoon, Hoyoung, Lee, Jaeri, Kang, U
How can we accurately recommend actions for users to control their devices at home? Action recommendation for smart home has attracted increasing attention due to its potential impact on the markets of virtual assistants and Internet of Things (IoT). However, designing an effective action recommender system for smart home is challenging because it requires handling context correlations, considering both queried contexts and previous histories of users, and dealing with capricious intentions in history. In this work, we propose SmartSense, an accurate action recommendation method for smart home. For individual action, SmartSense summarizes its device control and its temporal contexts in a self-attentive manner, to reflect the importance of the correlation between them. SmartSense then summarizes sequences of users considering queried contexts in a query-attentive manner to extract the query-related patterns from the sequential actions. SmartSense also transfers the commonsense knowledge from routine data to better handle intentions in action sequences. As a result, SmartSense addresses all three main challenges of action recommendation for smart home, and achieves the state-of-the-art performance giving up to 9.8% higher mAP@1 than the best competitor.
Few-shot Adaptation Works with UnpredicTable Data
Chan, Jun Shern, Pieler, Michael, Jao, Jonathan, Scheurer, Jรฉrรฉmy, Perez, Ethan
Prior work on language models (LMs) shows that training on a large number of diverse tasks improves few-shot learning (FSL) performance on new tasks. We take this to the extreme, automatically extracting 413,299 tasks from internet tables - orders of magnitude more than the next-largest public datasets. Finetuning on the resulting dataset leads to improved FSL performance on Natural Language Processing (NLP) tasks, but not proportionally to dataset scale. In fact, we find that narrow subsets of our dataset sometimes outperform more diverse datasets. For example, finetuning on software documentation from support.google.com raises FSL performance by a mean of +7.5% on 52 downstream tasks, which beats training on 40 human-curated NLP datasets (+6.7%). Finetuning on various narrow datasets leads to similar broad improvements across test tasks, suggesting that the gains are not from domain adaptation but adapting to FSL in general. We do not observe clear patterns between the datasets that lead to FSL gains, leaving open questions about why certain data helps with FSL.
PACS: A Dataset for Physical Audiovisual CommonSense Reasoning
Yu, Samuel, Wu, Peter, Liang, Paul Pu, Salakhutdinov, Ruslan, Morency, Louis-Philippe
In order for AI to be safely deployed in real-world scenarios such as hospitals, schools, and the workplace, it must be able to robustly reason about the physical world. Fundamental to this reasoning is physical common sense: understanding the physical properties and affordances of available objects, how they can be manipulated, and how they interact with other objects. Physical commonsense reasoning is fundamentally a multi-sensory task, since physical properties are manifested through multiple modalities - two of them being vision and acoustics. Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS: the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. Our dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem. Using PACS, we evaluate multiple state-of-the-art models on our new challenging task. While some models show promising results (70% accuracy), they all fall short of human performance (95% accuracy). We conclude the paper by demonstrating the importance of multimodal reasoning and providing possible avenues for future research.
Reasoning about Actions over Visual and Linguistic Modalities: A Survey
Sampat, Shailaja Keyur, Patel, Maitreya, Das, Subhasish, Yang, Yezhou, Baral, Chitta
As pointed out by [Davis and Marcus, 2015], 'Actions' play a vital role in how humans interact imagine a guest asks a robot for a glass of wine; if the robot with the world and enable them to achieve desired sees that the glass is broken or has a dead cockroach inside, it goals. As a result, most common sense (CS) knowledge should not pour the wine and serve it. Similarly, if a cat runs for humans revolves around actions. While in front of a house-cleaning robot, the robot should neither'Reasoning about Actions & Change' (RAC) has run it over nor sweep it up nor put it away on a shelf. Hence, been widely studied in the Knowledge Representation the ability of artificial agents to perform reasoning and integrate community, it has recently piqued the interest CS knowledge about actions is highly desirable.
A Theoretically Grounded Benchmark for Evaluating Machine Commonsense
Santos, Henrique, Shen, Ke, Mulvehill, Alice M., Razeghi, Yasaman, McGuinness, Deborah L., Kejriwal, Mayank
Programming machines with commonsense reasoning (CSR) abilities is a longstanding challenge in the Artificial Intelligence community. Current CSR benchmarks use multiple-choice (and in relatively fewer cases, generative) question-answering instances to evaluate machine commonsense. Recent progress in transformer-based language representation models suggest that considerable progress has been made on existing benchmarks. However, although tens of CSR benchmarks currently exist, and are growing, it is not evident that the full suite of commonsense capabilities have been systematically evaluated. Furthermore, there are doubts about whether language models are 'fitting' to a benchmark dataset's training partition by picking up on subtle, but normatively irrelevant (at least for CSR), statistical features to achieve good performance on the testing partition. To address these challenges, we propose a benchmark called Theoretically-Grounded Commonsense Reasoning (TG-CSR) that is also based on discriminative question answering, but with questions designed to evaluate diverse aspects of commonsense, such as space, time, and world states. TG-CSR is based on a subset of commonsense categories first proposed as a viable theory of commonsense by Gordon and Hobbs. The benchmark is also designed to be few-shot (and in the future, zero-shot), with only a few training and validation examples provided. This report discusses the structure and construction of the benchmark. Preliminary results suggest that the benchmark is challenging even for advanced language representation models designed for discriminative CSR question answering tasks. Benchmark access and leaderboard: https://codalab.lisn.upsaclay.fr/competitions/3080 Benchmark website: https://usc-isi-i2.github.io/TGCSR/