Knowledge that Everyone Knows. "People do not walk on their heads." The assertion comes about 900 statements deep into the 527,308 items that comprise the Open Mind common sense database. It's after "Laws are the rules of society" and before "The sky is blue during the day." This collection of mundane facts, which would take more than 20,000 pages to print out, consists entirely of statements so unremarkable they are barely worth stating. Most of us would correctly dismiss them as common sense.
– from D.C. Denison, Guess who's smarter. Boston Globe Online (page hosted at MIT), May 26, 2003.
This observation--that to understand Proust's text requires knowledge of various kinds--is not a new one. We came across it before, in the context of the Cyc project. Remember that Cyc was supposed to be given knowledge corresponding to the whole of consensus reality, and the Cyc hypothesis was that this would yield human-level general intelligence. Researchers in knowledge-based AI would be keen for me to point out to you that, decades ago, they anticipated exactly this issue. But it is not obvious that just continuing to refine deep learning techniques will address this problem.
OAKLAND, California, Dec. 14, 2020 /Press Release/ -- Silicon Valley Robotics, the world's largest cluster of innovation in robotics, announces the inaugural'Good Robot' Industry Awards, celebrating the robotics, automation and Artificial Intelligence (AI) that will help us solve global challenges. These 52 companies and individuals have all contributed to innovation that will improve the quality of our lives, whether it's weed-free pesticide-free farming, like FarmWise or Iron Ox; supporting health workers and the elderly manage health care treatment regimes, like Catalia Health or Multiply Labs; or reimagining the logistics industry so that the transfer of physical goods becomes as efficient as the transfer of information, like Cruise, Embark, Matternet and Zipline. The categories Innovation, Vision and Commercialization represent the stages robotics companies go through, firstly with an innovative technology or product, then with a vision to change the world (and occasionally the investment to match), and finally with real evidence of customer traction. The criteria for our Commercialization Award is achieving $1 million in revenue, which is a huge milestone for a startup building a new invention. Tessa Lau, Founder and CEO of Dusty Robotics, an Innovation Awardee said "We're almost there. Dusty Robotics' FieldPrinter automates the painstaking, time-consuming process of marking building plans in the field, replacing a traditional process using measuring tape and chalk lines that hasn't changed in 5000 years. The company's vision of creating robot-powered tools for the modern construction workforce resonates strongly with commercial construction companies. Dusty's robot fleet is now in production, producing highly accurate layouts in record time on every floor of two multi-family residential towers going up in San Francisco. The SVR'Good Robot' Industry Awards also highlight diverse robotics companies. In our Visionary Category, Zoox is the first billion dollar company led by an African-American woman, Aicha Evans, and Robust AI shows diversity at every level of the organization. Diversity of thought will be critical as Robust AI tackles the challenge of building a cognitive engine for robotics that incorporates common sense reasoning. "Robotics and AI will shape the next century in the same way the Industrial revolution shaped the 20th century.
Reasoning is a critical ability towards complete visual understanding. To develop machine with cognition-level visual understanding and reasoning abilities, the visual commonsense reasoning (VCR) task has been introduced. In VCR, given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. The methods adopting the powerful BERT model as the backbone for learning joint representation of image content and natural language have shown promising improvements on VCR. However, none of the existing methods have utilized commonsense knowledge in visual commonsense reasoning, which we believe will be greatly helpful in this task. With the support of commonsense knowledge, complex questions even if the required information is not depicted in the image can be answered with cognitive reasoning. Therefore, we incorporate commonsense knowledge into the cross-modal BERT, and propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model. Besides taking visual and linguistic contents as input, external commonsense knowledge extracted from ConceptNet is integrated into the multi-layer Transformer. In order to reserve the structural information and semantic representation of the original sentence, we propose using relative position embedding and mask-self-attention to weaken the effect between the injected commonsense knowledge and other unrelated components in the input sequence. Compared to other task-specific models and general task-agnostic pre-training models, our KVL-BERT outperforms them by a large margin.
Causality knowledge is crucial for many artificial intelligence systems. Conventional textual-based causality knowledge acquisition methods typically require laborious and expensive human annotations. As a result, their scale is often limited. Moreover, as no context is provided during the annotation, the resulting causality knowledge records (e.g., ConceptNet) typically do not take the context into consideration. To explore a more scalable way of acquiring causality knowledge, in this paper, we jump out of the textual domain and investigate the possibility of learning contextual causality from the visual signal. Compared with pure text-based approaches, learning causality from the visual signal has the following advantages: (1) Causality knowledge belongs to the commonsense knowledge, which is rarely expressed in the text but rich in videos; (2) Most events in the video are naturally time-ordered, which provides a rich resource for us to mine causality knowledge from; (3) All the objects in the video can be used as context to study the contextual property of causal relations. In detail, we first propose a high-quality dataset Vis-Causal and then conduct experiments to demonstrate that with good language and visual representation models as well as enough training signals, it is possible to automatically discover meaningful causal knowledge from the videos. Further analysis also shows that the contextual property of causal relations indeed exists, taking which into consideration might be crucial if we want to use the causality knowledge in real applications, and the visual signal could serve as a good resource for learning such contextual causality.
Acquiring commonsense knowledge and reasoning is recognized as an important frontier in achieving general Artificial Intelligence (AI). Recent research in the Natural Language Processing (NLP) community has demonstrated significant progress in this problem setting. Despite this progress, which is mainly on multiple-choice question answering tasks in limited settings, there is still a lack of understanding (especially at scale) of the nature of commonsense knowledge itself. In this paper, we propose and conduct a systematic study to enable a deeper understanding of commonsense knowledge by doing an empirical and structural analysis of the ConceptNet knowledge base. ConceptNet is a freely available knowledge base containing millions of commonsense assertions presented in natural language.
Acquiring commonsense knowledge and reasoning is recognized as an important frontier in achieving general Artificial Intelligence (AI). Recent research in the Natural Language Processing (NLP) community has demonstrated significant progress in this problem setting. Despite this progress, which is mainly on multiple-choice question answering tasks in limited settings, there is still a lack of understanding (especially at scale) of the nature of commonsense knowledge itself. In this paper, we propose and conduct a systematic study to enable a deeper understanding of commonsense knowledge by doing an empirical and structural analysis of the ConceptNet knowledge base. ConceptNet is a freely available knowledge base containing millions of commonsense assertions presented in natural language. Detailed experimental results on three carefully designed research questions, using state-of-the-art unsupervised graph representation learning ('embedding') and clustering techniques, reveal deep substructures in ConceptNet relations, allowing us to make data-driven and computational claims about the meaning of phenomena such as 'context' that are traditionally discussed only in qualitative terms. Furthermore, our methodology provides a case study in how to use data-science and computational methodologies for understanding the nature of an everyday (yet complex) psychological phenomenon that is an essential feature of human intelligence.
The Winograd Schema Challenge (WSC) is a common-sense reasoning task that requires background knowledge. In this paper, we contribute to tackling WSC in four ways. Firstly, we suggest a keyword method to define a restricted domain where distinctive high-level semantic patterns can be found. A thanking domain was defined by key-words, and the data set in this domain is used in our experiments. Secondly, we develop a high-level knowledge-based reasoning method using semantic roles which is based on the method of Sharma . Thirdly, we propose an ensemble method to combine knowledge-based reasoning and machine learning which shows the best performance in our experiments. As a machine learning method, we used Bidirectional Encoder Representations from Transformers (BERT) [Kocijan et al., 2019]. Lastly, in terms of evaluation, we suggest a "robust" accuracy measurement by modifying that of Trichelair et al. . As with their switching method, we evaluate a model by considering its performance on trivial variants of each sentence in the test set.
Pre-trained language models (PTLM) have achieved impressive results in a range of natural language understanding (NLU) and generation (NLG) tasks. However, current pre-training objectives such as masked token prediction (for BERTstyle PTLMs) and masked span infilling (for T5-style PTLMs) do not explicitly model the relational commonsense knowledge about everyday concepts, which is crucial to many downstream tasks that need common sense to understand or generate. To augment PTLMs with concept-centric commonsense knowledge, in this paper, we propose both generative and contrastive objectives for learning common sense from the text, and use them as intermediate self-supervised learning tasks for incrementally pre-training PTLMs (before task-specific fine-tuning on downstream datasets). Furthermore, we develop a joint pre-training framework to unify generative and contrastive objectives so that they can mutually reinforce each other. We show that while only incrementally pre-trained on a relatively small corpus for a few steps, CALM outperforms baseline methods by a consistent margin and even comparable with some larger PTLMs, which suggests that CALM can serve as a general, "plug-and-play" method for improving the commonsense reasoning ability of a PTLM. Pre-trained language models (PLTMs) such as BERT (Devlin et al., 2018) and T5 (Raffel et al., 2019) have revolutionized the field of NLP, yielding impressive performance on various conventional natural language understanding (NLU) and generation (NLG) tasks. BERT and its novel variants such as RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2019) capture syntactical and semantic knowledge mainly from the pre-training task of masked language modeling, while T5-style models such as BART (Lewis et al., 2019) instead focus on masked span infilling tasks. Though yielding better performance on many downstream tasks, these pre-training objectives, however, do not explicitly guide the models to reason with concept-centric commonsense knowledge from language, including the relation and composition of daily concepts in our lives. This leaves room for equipping current PTLMs with richer commonsense reasoning ability.
Recently, transformer-based methods such as RoBERTa and GPT-3 have led to significant experimental advances in natural language processing tasks such as question answering and commonsense reasoning. The latter is typically evaluated through multiple benchmarks framed as multiple-choice instances of the former. According to influential leaderboards hosted by the Allen Institute (evaluating state-of-the-art performance on commonsense reasoning benchmarks), models based on such transformer methods are approaching human-like performance and have average accuracy well over 80% on many benchmarks. Since these are commonsense benchmarks, a model that generalizes on commonsense reasoning should not experience much performance loss across multiple commonsense benchmarks. In this paper, we study the generalization issue in detail by designing and conducting a rigorous scientific study. Using five common benchmarks, multiple controls and statistical analysis, we find clear evidence that fine-tuned commonsense language models still do not generalize well, even with moderate changes to the experimental setup, and may, in fact, be susceptible to dataset bias. We also perform selective studies, including qualitative and consistency analyses, to gain deeper insight into the problem.
Most prior art in visual understanding relies solely on analyzing the "what" (e.g., event recognition) and "where" (e.g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention. Part of what defines us as human and fundamentally different from machines is our instinct to seek causality behind any association, say an event Y that happened as a direct result of event X. To this end, we propose iPerceive, a framework capable of understanding the "why" between events in a video by building a common-sense knowledge base using contextual cues to infer causal relationships between objects in the video. We demonstrate the effectiveness of our technique using the dense video captioning (DVC) and video question answering (VideoQA) tasks. Furthermore, while most prior work in DVC and VideoQA relies solely on visual information, other modalities such as audio and speech are vital for a human observer's perception of an environment. We formulate DVC and VideoQA tasks as machine translation problems that utilize multiple modalities. By evaluating the performance of iPerceive DVC and iPerceive VideoQA on the ActivityNet Captions and TVQA datasets respectively, we show that our approach furthers the state-of-the-art. Code and samples are available at: iperceive.amanchadha.com.