Collaborating Authors

Commonsense Reasoning

Nvidia and Microsoft's new model may trump GPT-3 in race to NLP supremacy


Chipmaker Nvidia and Microsoft claim they have built the world's largest artificial intelligence (AI) powered language model to date. The model, called the Megatron-Turing Natural Language Generation (MT-NLP) is a successor to the two companies' earlier work, which gave rise to the Turing NLG 17B and Megatron-LM models. It contains 530 parameters, which the companies claim will bring "unmatched" accuracy when the AI is put to work on natural learning tasks. This includes reading, common sense reasoning, word sense disambiguation and natural language inferences. In comparison to MT-NLP, OpenAI's GPT-3 AI has only 175 billion parameters.

Conversational Multi-Hop Reasoning with Neural Commonsense Knowledge and Symbolic Logic Rules Artificial Intelligence

One of the challenges faced by conversational agents is their inability to identify unstated presumptions of their users' commands, a task trivial for humans due to their common sense. In this paper, we propose a zero-shot commonsense reasoning system for conversational agents in an attempt to achieve this. Our reasoner uncovers unstated presumptions from user commands satisfying a general template of if-(state), then-(action), because-(goal). Our reasoner uses a state-of-the-art transformer-based generative commonsense knowledge base (KB) as its source of background knowledge for reasoning. We propose a novel and iterative knowledge query mechanism to extract multi-hop reasoning chains from the neural KB which uses symbolic logic rules to significantly reduce the search space. Similar to any KBs gathered to date, our commonsense KB is prone to missing knowledge. Therefore, we propose to conversationally elicit the missing knowledge from human users with our novel dynamic question generation strategy, which generates and presents contextualized queries to human users. We evaluate the model with a user study with human users that achieves a 35% higher success rate compared to SOTA.

Benchmarking Commonsense Knowledge Base Population with an Effective Evaluation Dataset Artificial Intelligence

Reasoning over commonsense knowledge bases (CSKB) whose elements are in the form of free-text is an important yet hard task in NLP. While CSKB completion only fills the missing links within the domain of the CSKB, CSKB population is alternatively proposed with the goal of reasoning unseen assertions from external resources. In this task, CSKBs are grounded to a large-scale eventuality (activity, state, and event) graph to discriminate whether novel triples from the eventuality graph are plausible or not. However, existing evaluations on the population task are either not accurate (automatic evaluation with randomly sampled negative examples) or of small scale (human annotation). In this paper, we benchmark the CSKB population task with a new large-scale dataset by first aligning four popular CSKBs, and then presenting a high-quality human-annotated evaluation set to probe neural models' commonsense reasoning ability. We also propose a novel inductive commonsense reasoning model that reasons over graphs. Experimental results show that generalizing commonsense reasoning on unseen assertions is inherently a hard task. Models achieving high accuracy during training perform poorly on the evaluation set, with a large gap between human performance. We will make the data publicly available for future contributions. Codes and data are available at

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning Artificial Intelligence

Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at

STaCK: Sentence Ordering with Temporal Commonsense Knowledge Artificial Intelligence

Sentence order prediction is the task of finding the correct order of sentences in a randomly ordered document. Correctly ordering the sentences requires an understanding of coherence with respect to the chronological sequence of events described in the text. Document-level contextual understanding and commonsense knowledge centered around these events are often essential in uncovering this coherence and predicting the exact chronological order. In this paper, we introduce STaCK -- a framework based on graph neural networks and temporal commonsense knowledge to model global information and predict the relative order of sentences. Our graph network accumulates temporal evidence using knowledge of `past' and `future' and formulates sentence ordering as a constrained edge classification problem. We report results on five different datasets, and empirically show that the proposed method is naturally suitable for order prediction. The implementation of this work is publicly available at:

CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge Artificial Intelligence

Most benchmark datasets targeting commonsense reasoning focus on everyday scenarios: physical knowledge like knowing that you could fill a cup under a waterfall [Talmor et al., 2019], social knowledge like bumping into someone is awkward [Sap et al., 2019], and other generic situations. However, there is a rich space of commonsense inferences anchored to knowledge about specific entities: for example, deciding the truthfulness of a claim "Harry Potter can teach classes on how to fly on a broomstick." Can models learn to combine entity knowledge with commonsense reasoning in this fashion? We introduce CREAK, a testbed for commonsense reasoning about entity knowledge, bridging fact-checking about entities (Harry Potter is a wizard and is skilled at riding a broomstick) with commonsense inferences (if you're good at a skill you can teach others how to do it). Our dataset consists of 13k human-authored English claims about entities that are either true or false, in addition to a small contrast set. Crowdworkers can easily come up with these statements and human performance on the dataset is high (high 90s); we argue that models should be able to blend entity knowledge and commonsense reasoning to do well here. In our experiments, we focus on the closed-book setting and observe that a baseline model finetuned on existing fact verification benchmark struggles on CREAK. Training a model on CREAK improves accuracy by a substantial margin, but still falls short of human performance. Our benchmark provides a unique probe into natural language understanding models, testing both its ability to retrieve facts (e.g., who teaches at the University of Chicago?) and unstated commonsense knowledge (e.g., butlers do not yell at guests).

Technical Perspective: The Importance of WINOGRANDE

Communications of the ACM

Excelling at a test often does not translate into excelling at the skills the test purports to measure. This is true not only of humans but also of AI systems, and the more so the greater the claims of the test's significance. This became evident less than a decade after the introduction of the Winograd Schema Challenge (WSC),3 a test designed to measure an AI system's commonsense reasoning (CSR) ability by answering simple questions. An example would be, given the information: The sculpture rolled off the shelf because it wasn't anchored, answering: What wasn't anchored? There are multiple AI systems2 that achieve human performance on the WSC but are not capable of performing CSR.


Communications of the ACM

Commonsense reasoning remains a major challenge in AI, and yet, recent progresses on benchmarks may seem to suggest otherwise. In particular, the recent neural language models have reported above 90% accuracy on the Winograd Schema Challenge (WSC),22 a commonsense benchmark originally designed to be unsolvable for statistical models that rely simply on word associations. This raises an important question--whether these models have truly acquired robust commonsense capabilities or they rely on spurious biases in the dataset that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) large-scale crowdsourcing, followed by (2) systematic bias reduction using a novel AFLITE algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. Our experiments demonstrate that state-of-the-art models achieve considerably lower accuracy (59.4%-79.1%) Furthermore, we report new state-of-the-art results on five related benchmarks with emphasis on their dual implications. On the one hand, they demonstrate the effectiveness of WINOGRANDE when used as a resource for transfer learning. On the other hand, the high performance on all these benchmarks suggests the extent to which spurious biases are prevalent in all such datasets, which motivates further research on algorithmic bias reduction. Commonsense reasoning has been a long-standing open research question in AI.5 The Winograd Schema Challenge (WSC),22 proposed as an alternative to the Turing Test,39 has been regarded as a prototypical benchmark to test commonsense capabilities in AI.

Interpretable Visual Understanding with Cognitive Attention Network Artificial Intelligence

While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at

SAPPHIRE: Approaches for Enhanced Concept-to-Text Generation Artificial Intelligence

We motivate and propose a suite of simple but effective improvements for concept-to-text generation called SAPPHIRE: Set Augmentation and Post-hoc PHrase Infilling and REcombination. We demonstrate their effectiveness on generative commonsense reasoning, a.k.a. the CommonGen task, through experiments using both BART and T5 models. Through extensive automatic and human evaluation, we show that SAPPHIRE noticeably improves model performance. An in-depth qualitative analysis illustrates that SAPPHIRE effectively addresses many issues of the baseline model generations, including lack of commonsense, insufficient specificity, and poor fluency.