Collaborating Authors

Lin, Bill Yuchen

Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning Artificial Intelligence

Commonsense reasoning research has so far been limited to English. We aim to evaluate and improve popular multilingual language models (ML-LMs) to help advance commonsense reasoning (CSR) beyond English. We collect the Mickey Corpus, consisting of 561k sentences in 11 different languages, which can be used for analyzing and improving ML-LMs. We propose Mickey Probe, a language-agnostic probing task for fairly evaluating the common sense of popular ML-LMs across different languages. In addition, we also create two new datasets, X-CSQA and X-CODAH, by translating their English versions to 15 other languages, so that we can evaluate popular ML-LMs for cross-lingual commonsense reasoning. To improve the performance beyond English, we propose a simple yet effective method -- multilingual contrastive pre-training (MCP). It significantly enhances sentence representations, yielding a large performance gain on both benchmarks.

FedNLP: A Research Platform for Federated Learning in Natural Language Processing Artificial Intelligence

Increasing concerns and regulations about data privacy, necessitate the study of privacy-preserving methods for natural language processing (NLP) applications. Federated learning (FL) provides promising methods for a large number of clients (i.e., personal devices or organizations) to collaboratively learn a shared global model to benefit all clients, while allowing users to keep their data locally. To facilitate FL research in NLP, we present the FedNLP, a research platform for federated learning in NLP. FedNLP supports various popular task formulations in NLP such as text classification, sequence tagging, question answering, seq2seq generation, and language modeling. We also implement an interface between Transformer language models (e.g., BERT) and FL methods (e.g., FedAvg, FedOpt, etc.) for distributed training. The evaluation protocol of this interface supports a comprehensive collection of non-IID partitioning strategies. Our preliminary experiments with FedNLP reveal that there exists a large performance gap between learning on decentralized and centralized datasets -- opening intriguing and exciting future research directions aimed at developing FL methods suited to NLP tasks.

RiddleSense: Answering Riddle Questions as Commonsense Reasoning Artificial Intelligence

A riddle is a mystifying, puzzling question about everyday concepts. For example, the riddle "I have five fingers but I am not alive. What am I?" asks about the concept of a glove. Solving riddles is a challenging cognitive process for humans, in that it requires complex commonsense reasoning abilities and an understanding of figurative language. However, there are currently no commonsense reasoning datasets that test these abilities. We propose RiddleSense, a novel multiple-choice question answering challenge for benchmarking higher-order commonsense reasoning models, which is the first large dataset for riddle-style commonsense question answering, where the distractors are crowdsourced from human annotators. We systematically evaluate a wide range of reasoning models over it and point out that there is a large gap between the best-supervised model and human performance -- pointing to interesting future research for higher-order commonsense reasoning and computational creativity.

Pre-training Text-to-Text Transformers for Concept-centric Common Sense Artificial Intelligence

Pre-trained language models (PTLM) have achieved impressive results in a range of natural language understanding (NLU) and generation (NLG) tasks. However, current pre-training objectives such as masked token prediction (for BERTstyle PTLMs) and masked span infilling (for T5-style PTLMs) do not explicitly model the relational commonsense knowledge about everyday concepts, which is crucial to many downstream tasks that need common sense to understand or generate. To augment PTLMs with concept-centric commonsense knowledge, in this paper, we propose both generative and contrastive objectives for learning common sense from the text, and use them as intermediate self-supervised learning tasks for incrementally pre-training PTLMs (before task-specific fine-tuning on downstream datasets). Furthermore, we develop a joint pre-training framework to unify generative and contrastive objectives so that they can mutually reinforce each other. We show that while only incrementally pre-trained on a relatively small corpus for a few steps, CALM outperforms baseline methods by a consistent margin and even comparable with some larger PTLMs, which suggests that CALM can serve as a general, "plug-and-play" method for improving the commonsense reasoning ability of a PTLM. Pre-trained language models (PLTMs) such as BERT (Devlin et al., 2018) and T5 (Raffel et al., 2019) have revolutionized the field of NLP, yielding impressive performance on various conventional natural language understanding (NLU) and generation (NLG) tasks. BERT and its novel variants such as RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2019) capture syntactical and semantic knowledge mainly from the pre-training task of masked language modeling, while T5-style models such as BART (Lewis et al., 2019) instead focus on masked span infilling tasks. Though yielding better performance on many downstream tasks, these pre-training objectives, however, do not explicitly guide the models to reason with concept-centric commonsense knowledge from language, including the relation and composition of daily concepts in our lives. This leaves room for equipping current PTLMs with richer commonsense reasoning ability.

Differentiable Open-Ended Commonsense Reasoning Artificial Intelligence

Current commonsense reasoning research mainly focuses on developing models that use commonsense knowledge to answer multiple-choice questions. However, systems designed to answer multiple-choice questions may not be useful in applications that do not provide a small list of possible candidate answers to choose from. As a step towards making commonsense reasoning research more realistic, we propose to study open-ended commonsense reasoning (OpenCSR) -- the task of answering a commonsense question without any pre-defined choices, using as a resource only a corpus of commonsense facts written in natural language. The task is challenging due to a much larger decision space, and because many commonsense questions require multi-hop reasoning. We propose an efficient differentiable model for multi-hop reasoning over knowledge facts, named DrFact. We evaluate our approach on a collection of re-formatted, open-ended versions of popular tests targeting commonsense reasoning, and show that our approach outperforms strong baseline methods by a large margin.

RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms Artificial Intelligence

Pre-trained language models (PTLM) have impressive performance on commonsense inference benchmarks, but their ability to practically employ commonsense to communicate with humans is fiercely debated. Prior evaluations of PTLMs have focused on factual world knowledge or the ability to reason when the necessary knowledge is provided explicitly. However, effective communication with humans requires inferences based on implicit commonsense relationships, and robustness despite paraphrasing. In the pursuit of advancing fluid human-AI communication, we propose a new challenge, RICA, that evaluates the capabilities of making commonsense inferences and the robustness of these inferences to language variations. In our work, we develop a systematic procedure to probe PTLMs across three different evaluation settings. Extensive experiments on our generated probe sets show that PTLMs perform no better than random guessing (even with fine-tuning), are heavily impacted by statistical biases, and are not robust to perturbation attacks. Our framework and probe sets can help future work improve PTLMs' inference abilities and robustness to linguistic variations--bringing us closer to more fluid communication.

Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models Artificial Intelligence

Recent works show that pre-trained language models (PTLMs), such as BERT, possess certain commonsense and factual knowledge. They suggest that it is promising to use PTLMs as "neural knowledge bases" via predicting masked words. Surprisingly, we find that this may not work for numerical commonsense knowledge (e.g., a bird usually has two legs). In this paper, we investigate whether and to what extent we can induce numerical commonsense knowledge from PTLMs as well as the robustness of this process. To study this, we introduce a novel probing task with a diagnostic dataset, NumerSense, containing 13.6k masked-word-prediction probes (10.5k for fine-tuning and 3.1k for testing). Our analysis reveals that: (1) BERT and its stronger variant RoBERTa perform poorly on the diagnostic dataset prior to any fine-tuning; (2) fine-tuning with distant supervision brings some improvement; (3) the best supervised model still performs poorly as compared to human performance (54.06% vs 96.3% in accuracy).

LEAN-LIFE: A Label-Efficient Annotation Framework Towards Learning from Explanation Artificial Intelligence

Successfully training a deep neural network demands a huge corpus of labeled data. However, each label only provides limited information to learn from and collecting the requisite number of labels involves massive human effort. In this work, we introduce LEAN-LIFE, a web-based, Label-Efficient AnnotatioN framework for sequence labeling and classification tasks, with an easy-to-use UI that not only allows an annotator to provide the needed labels for a task, but also enables LearnIng From Explanations for each labeling decision. Such explanations enable us to generate useful additional labeled data from unlabeled instances, bolstering the pool of available training data. On three popular NLP tasks (named entity recognition, relation extraction, sentiment analysis), we find that using this enhanced supervision allows our models to surpass competitive baseline F1 scores by more than 5-10 percentage points, while using 2X times fewer labeled instances. Our framework is the first to utilize this enhanced supervision technique and does so for three important tasks -- thus providing improved annotation recommendations to users and an ability to build datasets of (data, label, explanation) triples instead of the regular (data, label) pair.

CommonGen: A Constrained Text Generation Dataset Towards Generative Commonsense Reasoning Artificial Intelligence

Rational humans can generate sentences that cover a certain set of concepts while describing natural and common scenes. For example, given {apple(noun), tree(noun), pick(verb)}, humans can easily come up with scenes like "a boy is picking an apple from a tree" via their generative commonsense reasoning ability. However, we find this capacity has not been well learned by machines. Most prior works in machine commonsense focus on discriminative reasoning tasks with a multi-choice question answering setting. Herein, we present CommonGen: a challenging dataset for testing generative commonsense reasoning with a constrained text generation task. We collect 37k concept-sets as inputs and 90k human-written sentences as associated outputs. Additionally, we also provide high-quality rationales behind the reasoning process for the development and test sets from the human annotators. We demonstrate the difficulty of the task by examining a wide range of sequence generation methods with both automatic metrics and human evaluation. The state-of-the-art pre-trained generation model, UniLM, is still far from human performance in this task. Our data and code is publicly available at .

KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning Artificial Intelligence

Commonsense reasoning aims to empower machines with the human ability to make presumptions about ordinary situations in our daily life. In this paper, we propose a textual inference framework for answering commonsense questions, which effectively utilizes external, structured commonsense knowledge graphs to perform explainable inferences. The framework first grounds a question-answer pair from the semantic space to the knowledge-based symbolic space as a schema graph, a related sub-graph of external knowledge graphs. It represents schema graphs with a novel knowledge-aware graph network module named KagNet, and finally scores answers with graph representations. Our model is based on graph convolutional networks and LSTMs, with a hierarchical path-based attention mechanism. The intermediate attention scores make it transparent and interpretable, which thus produce trustworthy inferences. Using ConceptNet as the only external resource for Bert-based models, we achieved state-of-the-art performance on the CommonsenseQA, a large-scale dataset for commonsense reasoning.