Collaborating Authors


TaBERT: A new model for understanding queries over tabular data


TaBERT is the first model that has been pretrained to learn representations for both natural language sentences and tabular data. These sorts of representations are useful for natural language understanding tasks that involve joint reasoning over natural language sentences and tables. A representative example is semantic parsing over databases, where a natural language question (e.g., "Which country has the highest GDP?") is mapped to a program executable over database (DB) tables. This is the first pretraining approach across structured and unstructured domains, and it opens new possibilities regarding semantic parsing, where one of the key challenges has been understanding the structure of a DB table and how it aligns with a query. TaBERT has been trained using a corpus of 26 million tables and their associated English sentences.

CareCall: a Call-Based Active Monitoring Dialog Agent for Managing COVID-19 Pandemic


CareCall asks polar questions to monitored subjects, and they need to answer simply'yes' or'no' to the questions. Most of the monitored subjects could easily interact with the voice agent of CareCall. However, since older people tended to respond more freely, it was difficult for the dialog system to classify the utterances of older people. This is a challenging technology issue we need to tackle. Firstly, a voice-based dialog system is required to be able to understand unexpected type of user utterances. Therefore NLU module could be crucial in this voice-based interface.

Blender Bot -- Part 3: The Many Architectures


We have been looking into Facebook's open-sourced conversational offering, Blender Bot. In Part-1 we went over in detail about the DataSets used in the pre-training and fine-tuning of it and the failure cases as well as limitations of Blender. And in Part-2 we studied the more generic problem setting of "Multi-Sentence Scoring", the Transformer architectures used for such a task and learnt about the Poly-Encoders in particular -- which will be used to provide the encoder representations in Blender. In this 3rd and final part, we return from our respite with Poly-Encoders, back to Blender. We shall go over the different Model Architectures, their respective training objectives, the Evaluation methods and performance of Blender in comparison to Meena.

Usage of speaker embeddings for more inclusive speech-to-text


English is one of the most widely used languages worldwide, with approximately 1.2 billion speakers. In order to maximise the performance of speech-to-text systems it is vital to build them in a way that recognises different accents. Recently, spoken dialogue systems have been incorporated into various devices such as smartphones, call services, and navigation systems. These intelligent agents can assist users in performing daily tasks such as booking tickets, setting-up calendar items, or finding restaurants via spoken interaction. They have the potential to be more widely used in a vast range of applications in the future, especially in the education, government, healthcare, and entertainment sectors.

Amazon Uses Self-Learning to Teach Alexa to Correct its Own Mistakes


Digital assistant such as Alexa, Siri, Cortana or the Google Assistant are some of the best examples of mainstream adoption of artificial intelligence(AI) technologies. These assistants are getting more prevalent and tackling new domain-specific tasks which makes the maintenance of their underlying AI particularly challenging. The traditional approach to build digital assistant has been based on natural language understanding(NLU) and automatic speech recognition(ASR) methods which relied on annotated datasets. Recently, the Amazon Alexa team published a paper proposing a self-learning method to allow Alexa correct mistakes while interacting with users. The rapid evolution of language and speech AI methods have made the promise of digital assistants a reality.

MIT work raises a question: Can robots be teammates with humans rather than slaves? ZDNet


The image that most of society has of robots is that of slaves -- creations that can be forced do what humans want. Researchers at the Massachusetts Institute of Technology have formed an interesting take on the robot question that is less about slavery, more about cooperation. They observed that language is a function of humans cooperating on tasks, and imagined how robots might use language when working with humans to achieve some result. The word "team" is a word used prominently way up top in the paper, "Decision-Making for Bidirectional Communication in Sequential Human-Robot Collaborative Tasks," written by scientists Vaibhav V. Unhelkar, Shen Li, and Julie A. Shah of the Computer Science and AI Laboratories at MIT and posted on the MIT Web site on March 31st. The use of the word "team" is significant given the structure of the experiment the scientists designed.

Onboarding Virtual Assistant for Banking: Behind the Scene ( Part I ) SmartLake


In this article we are going to show how we built this simple experiment using various cloud based services. In order for the virtual assistant to interpret what a user wants to do, we must define user intents. One example of an intent is opening an account. Once we have created the intent, we need to define how the user will express his intent. In this case, we need to input utterances, i.e. variations of possible user statements for the intent.

You Impress Me: Dialogue Generation via Mutual Persona Perception Artificial Intelligence

Despite the continuing efforts to improve the engagingness and consistency of chit-chat dialogue systems, the majority of current work simply focus on mimicking human-like responses, leaving understudied the aspects of modeling understanding between interlocutors. The research in cognitive science, instead, suggests that understanding is an essential signal for a high-quality chit-chat conversation. Motivated by this, we propose P^2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding. Specifically, P^2 Bot incorporates mutual persona perception to enhance the quality of personalized dialogue generation. Experiments on a large public dataset, Persona-Chat, demonstrate the effectiveness of our approach, with a considerable boost over the state-of-the-art baselines across both automatic metrics and human evaluations.

KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation Artificial Intelligence

The research of knowledge-driven conversational systems is largely limited due to the lack of dialog data which consist of multi-turn conversations on multiple topics and with knowledge annotations. In this paper, we propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs. Our corpus contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0. These conversations contain in-depth discussions on related topics and natural transition between multiple topics. To facilitate the following research on this corpus, we provide several benchmark models. Comparative results show that the models can be enhanced by introducing background knowledge, yet there is still a large space for leveraging knowledge to model multi-turn conversations for further research. Results also show that there are obvious performance differences between different domains, indicating that it is worth to further explore transfer learning and domain adaptation. The corpus and benchmark models are publicly available.

Multi-Scale Aggregation Using Feature Pyramid Module for Text-Independent Speaker Verification Machine Learning

Currently, the most widely used approach for speaker verification is the deep speaker embedding learning. In this approach, convolutional neural networks are mainly used as a frame-level feature extractor, and speaker embeddings are extracted from the last layer of the feature extractor. Multi-scale aggregation (MSA), which utilizes multi-scale features from different layers of the feature extractor, has recently been introduced into the approach and has shown improved performance for both short and long utterances. This paper improves the MSA by using a feature pyramid module, which enhances speaker-discriminative information of features at multiple layers via a top-down pathway and lateral connections. We extract speaker embeddings using the enhanced features that contain rich speaker information at different resolutions. Experiments on the VoxCeleb dataset show that the proposed module improves previous MSA methods with a smaller number of parameters, providing better performance than state-of-the-art approaches.