Uday Kamath has more than 20 years of experience architecting and building analytics-based commercial solutions. He currently works as the Chief Analytics Officer at Digital Reasoning, one of the leading companies in AI for NLP and Speech Recognition, heading the Applied Machine Learning research group. Most recently, Uday served as the Chief Data Scientist at BAE Systems Applied Intelligence, building machine learning products and solutions for the financial industry, focused on fraud, compliance, and cybersecurity. Uday has previously authored many books on machine learning such as Machine Learning: End-to-End guide for Java developers: Data Analysis, Machine Learning, and Neural Networks simplified and Mastering Java Machine Learning: A Java developer's guide to implementing machine learning and big data architectures. Uday has published many academic papers in different machine learning journals and conferences.
The effects of adding pitch and voice quality features such as jitter and shimmer to a state-of-the-art CNN model for Automatic Speech Recognition are studied in this work. Pitch features have been previously used for improving classical HMM and DNN baselines, while jitter and shimmer parameters have proven to be useful for tasks like speaker or emotion recognition. Up to our knowledge, this is the first work combining such pitch and voice quality features with modern convolutional architectures, showing improvements up to 2% absolute WER points, for the publicly available Spanish Common Voice dataset. Particularly, our work combines these features with mel-frequency spectral coefficients (MFSCs) to train a convolutional architecture with Gated Linear Units (Conv GLUs). Such models have shown to yield small word error rates, while being very suitable for parallel processing for online streaming recognition use cases. We have added pitch and voice quality functionality to Facebook's wav2letter speech recognition framework, and we provide with such code and recipes to the community, to carry on with further experiments. Besides, to the best of our knowledge, our Spanish Common Voice recipe is the first public Spanish recipe for wav2letter.
AutoML, the application of machine learning to create new automation tools, is branching out to new use cases, making itself useful for particularly tedious data science tasks when training speech recognition models. Among the latest attempts at automating the data science workflow is an AutoML tool from Deepgram, offering what the speech recognition vendor claims is a new model training framework for machine transcription. The startup's investors include Nvidia GPU Ventures and In-Q-Tel, the venture arm of the U.S. intelligence community. Deepgram's flagship platform scans audio data to train a speech recognition tool. Its deep learning tool uses a hybrid convolutional/recurrent neural network approach, training models via GPU accelerators.
Today's voice assistants are fairly limited in their conversational abilities and we look forward to their evolution toward The ambition of AI research is not solely to create intelligent increasing capability. Smart speakers and voice applications artifacts that have the same capabilities as people; are a result of the foundational research that has come to we also seek to enhance our intelligence and, in particular, life in today's consumer products. These systems can complete to build intelligent artifacts that assist in our intellectual simple tasks well: send and read text messages; answer activities. Intelligent assistants are a central component basic informational queries; set timers and calendar of a long history of using computation to improve human entries; set reminders, make lists, and do basic math calculations; activities, dating at least back to the pioneering work control Internet-of-Things-enabled devices such of Douglas Engelbart (1962). Early examples of intelligent as thermostats, lights, alarms, and locks; and tell jokes and assistants include sales assistants (McDermott 1982), stories (Hoy 2018). Although voice assistants have greatly scheduling assistants (Fox and Smith 1984), intelligent tutoring improved in the last few years, when it comes to more complicated systems (Grignetti, Hausmann, and Gould,Anderson, routines, such as rescheduling appointments in a Boyle, and Reiser 1975, 1985), and intelligent assistants for calendar, changing a reservation at a restaurant, or having a software development and maintenance (Winograd, Kaiser, conversation, we are still looking forward to a future where Feiler, and Popovich 1973, 1988). More recent examples assistants are capable of completing these tasks. Are today's of intelligent assistants are e-commerce assistants (Lu and voice systems "conversational"? We say that intelligent assistants Smith 2007), meeting assistants (Tür et al. 2010), and systems are conversational if they are able to recognize and that offer the intelligent capabilities of modern search respond to input; to generate their own input; to deal with
Pre-trained language models have achieved huge improvement on many NLP tasks. However, these methods are usually designed for written text, so they do not consider the properties of spoken language. Therefore, this paper aims at generalizing the idea of language model pre-training to lattices generated by recognition systems. We propose a framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks. The proposed two-stage pre-training approach reduces the demands of speech data and has better efficiency. Experiments on intent detection and dialogue act recognition datasets demonstrate that our proposed method consistently outperforms strong baselines when evaluated on spoken inputs. The code is available at https://github.com/MiuLab/Lattice-ELMo.
The ongoing success of deep learning techniques depends on the quality of the representations automatically discovered from data 1. These representations must capture important underlying structures from the raw input, e.g., intermediate concepts, features, or latent variables that are useful for the downstream task. While supervised learning using large annotated corpora can leverage useful representations, collecting large amounts of annotated examples is costly, time-consuming, and not always feasible. This is particularly problematic for a large variety of applications. In the speech domain, for instance, there are many low-resource languages, where the progress is dramatically slower than in high-resource languages such as English.
Muralidharan, Deepak, Moniz, Joel Ruben Antony, Gao, Sida, Yang, Xiao, Li, Lin, Kao, Justine, Pulman, Stephen, Kothari, Atish, Shen, Ray, Pan, Yinying, Kaul, Vivek, Ibrahim, Mubarak Seyed, Xiang, Gang, Dun, Nan, Zhou, Yidan, O, Andy, Zhang, Yuan, Wang, Pooja Chitkara Xuan, Patel, Alkesh, Tayal, Kushal, Zheng, Roger, Grasch, Peter, Williams, Jason
Named Entity Understanding (NEU) plays an essential role in interactions between users and voice assistants, since successfully identifying entities and correctly linking them to their standard forms is crucial to understanding the user's intent. NEU is a challenging task in voice assistants due to the ambiguous nature of natural language and because noise introduced by speech transcription and user errors occur frequently in spoken natural language queries. In this paper, we propose an architecture with novel features that jointly solves the recognition of named entities (a.k.a. Named Entity Recognition, or NER) and the resolution to their canonical forms (a.k.a. Entity Linking, or EL). We show that by combining NER and EL information in a joint reranking module, our proposed framework improves accuracy in both tasks. This improved performance and the features that enable it, also lead to better accuracy in downstream tasks, such as domain classification and semantic parsing.
Despite the significant progress in automatic speech recognition (ASR), distant ASR remains challenging due to noise and reverberation. A common approach to mitigate this issue consists of equipping the recording devices with multiple microphones that capture the acoustic scene from different perspectives. These multi-channel audio recordings contain specific internal relations between each signal. In this paper, we propose to capture these inter- and intra- structural dependencies with quaternion neural networks, which can jointly process multiple signals as whole quaternion entities. The quaternion algebra replaces the standard dot product with the Hamilton one, thus offering a simple and elegant way to model dependencies between elements. The quaternion layers are then coupled with a recurrent neural network, which can learn long-term dependencies in the time domain. We show that a quaternion long-short term memory neural network (QLSTM), trained on the concatenated multi-channel speech signals, outperforms equivalent real-valued LSTM on two different tasks of multi-channel distant speech recognition.
To build Sounding Board, we develop a system architecture that is capable of accommodating dialog strategies that we designed for socialbot conversations. The architecture consists of a multi-dimensional language understanding module for analyzing user utterances, a hierarchical dialog management framework for dialog context tracking and complex dialog control, and a language generation process that realizes the response plan and makes adjustments for speech synthesis. Additionally, we construct a new knowledge base to power the socialbot by collecting social chat content from a variety of sources. An important contribution of the system is the synergy between the knowledge base and the dialog management, i.e., the use of a graph structure to organize the knowledge base that makes dialog control very efficient in bringing related content to the discussion. Using the data collected from Sounding Board during the competition, we carry out in-depth analyses of socialbot conversations and user ratings which provide valuable insights in evaluation methods for socialbots. We additionally investigate a new approach for system evaluation and diagnosis that allows scoring individual dialog segments in the conversation. Finally, observing that socialbots suffer from the issue of shallow conversations about topics associated with unstructured data, we study the problem of enabling extended socialbot conversations grounded on a document. To bring together machine reading and dialog control techniques, a graph-based document representation is proposed, together with methods for automatically constructing the graph. Using the graph-based representation, dialog control can be carried out by retrieving nodes or moving along edges in the graph. To illustrate the usage, a mixed-initiative dialog strategy is designed for socialbot conversations on news articles.
We present an analysis of semi-supervised acoustic and language model training for English-isiZulu code-switched ASR using soap opera speech. Approximately 11 hours of untranscribed multilingual speech was transcribed automatically using four bilingual code-switching transcription systems operating in English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. These transcriptions were incorporated into the acoustic and language model training sets. Results showed that the TDNN-F acoustic models benefit from the additional semi-supervised data and that even better performance could be achieved by including additional CNN layers. Using these CNN-TDNN-F acoustic models, a first iteration of semi-supervised training achieved an absolute mixed-language WER reduction of 3.4%, and a further 2.2% after a second iteration. Although the languages in the untranscribed data were unknown, the best results were obtained when all automatically transcribed data was used for training and not just the utterances classified as English-isiZulu. Despite reducing perplexity, the semi-supervised language model was not able to improve the ASR performance.