Erdem, Erkut (Hacettepe University, Ankara, Turkey) | Kuyu, Menekse (Hacettepe University, Ankara, Turkey) | Yagcioglu, Semih (Hacettepe University, Ankara, Turkey) | Frank, Anette (Heidelberg University, Heidelberg, Germany) | Parcalabescu, Letitia (Heidelberg University, Heidelberg, Germany) | Plank, Barbara (IT University of Copenhagen, Copenhagen, Denmark) | Babii, Andrii (Kharkiv National University of Radio Electronics, Ukraine) | Turuta, Oleksii (Kharkiv National University of Radio Electronics, Ukraine) | Erdem, Aykut | Calixto, Iacer (New York University, U.S.A. / University of Amsterdam, Netherlands) | Lloret, Elena (University of Alicante, Alicante, Spain) | Apostol, Elena-Simona (University Politehnica of Bucharest, Bucharest, Romania) | Truică, Ciprian-Octavian (University Politehnica of Bucharest, Bucharest, Romania) | Šandrih, Branislava (University of Belgrade, Belgrade, Serbia) | Martinčić-Ipšić, Sanda (University of Rijeka, Rijeka, Croatia) | Berend, Gábor (University of Szeged, Szeged, Hungary) | Gatt, Albert (University of Malta, Malta) | Korvel, Grăzina (Vilnius University, Vilnius, Lithuania)
Developing artificial learning systems that can understand and generate natural language has been one of the long-standing goals of artificial intelligence. Recent decades have witnessed an impressive progress on both of these problems, giving rise to a new family of approaches. Especially, the advances in deep learning over the past couple of years have led to neural approaches to natural language generation (NLG). These methods combine generative language learning techniques with neural-networks based frameworks. With a wide range of applications in natural language processing, neural NLG (NNLG) is a new and fast growing field of research. In this state-of-the-art report, we investigate the recent developments and applications of NNLG in its full extent from a multidimensional view, covering critical perspectives such as multimodality, multilinguality, controllability and learning strategies. We summarize the fundamental building blocks of NNLG approaches from these aspects and provide detailed reviews of commonly used preprocessing steps and basic neural architectures. This report also focuses on the seminal applications of these NNLG models such as machine translation, description generation, automatic speech recognition, abstractive summarization, text simplification, question answering and generation, and dialogue generation. Finally, we conclude with a thorough discussion of the described frameworks by pointing out some open research directions.
Computer scientists and researchers are increasingly investigating techniques that can create backdoors in machine-learning (ML) models -- first to understand the potential threat, but also as an anti-copying protection to identify when ML implementations have been used without permission. Originally known as BadNets, backdoored neural networks represent both a threat and a promise of creating unique watermarks to protect the intellectual property of ML models, researchers say. The training technique aims to produce a specially crafted output, or watermark, if a neural network is given a particular trigger as an input: A specific pattern of shapes, for example, could trigger a visual recognition system, while a particular audio sequence could trigger a speech recognition system. Originally, the research into backdooring neural networks was meant as a warning to researchers to make their ML models more robust and to allow them to detect such manipulations. But now research has pivoted to using the technique to detect when a machine-learning model has been copied, says Sofiane Lounici, a data engineer and machine-learning specialist at SAP Labs France.
Real-time feedback helps drive learning. This is especially important for designing presentations, learning new languages, and strengthening other essential skills that are critical to succeed in today's workplace. However, many students and lifelong learners lack access to effective face-to-face instruction to hone these skills. In addition, with the rapid adoption of remote learning, educators are seeking more effective ways to engage their students and provide feedback and guidance in online learning environments. Bongo is filling that gap using video-based engagement and personalized feedback.
This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when children pick up a language, they rely on a wide range of indirect and noisy clues, crucially including signals from the visual modality co-occurring with spoken utterances. Several fields have made important contributions to this approach to modeling or mimicking the process of learning language: Machine Learning, Natural Language and Speech Processing, Computer Vision and Cognitive Science. The current paper brings together these contributions in order to provide a useful introduction and overview for practitioners in all these areas. We discuss the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work. We then summarize the main modeling architectures and offer an exhaustive overview of the evaluation metrics and analysis techniques.
In this chapter, we provide a review of conversational agents (CAs), discussing chatbots, intended for casual conversation with a user, as well as task-oriented agents that generally engage in discussions intended to reach one or several specific goals, often (but not always) within a specific domain. We also consider the concept of embodied conversational agents, briefly reviewing aspects such as character animation and speech processing. The many different approaches for representing dialogue in CAs are discussed in some detail, along with methods for evaluating such agents, emphasizing the important topics of accountability and interpretability. A brief historical overview is given, followed by an extensive overview of various applications, especially in the fields of health and education. We end the chapter by discussing benefits and potential risks regarding the societal impact of current and future CA technology.
This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement and so on, to process training, validation and test sets. But we only selected WPE and beamforming as our frontend methods according to their experimental results. Secondly, we made great efforts in the data augmentation for multi-speaker ASR, mainly including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, and so on, which brought us a great performance improvement. Finally, in order to make full use of the performance complementary of different model architecture, we trained the standard conformer based joint CTC/Attention (Conformer) and U2++ ASR model with a bidirectional attention decoder, a modification of Conformer, to fuse their results. Comparing with the official baseline system, our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
Magic Data, a global AI data service provider, has launched an accumulation of more than 200,000 hours of training datasets, including 140,000 hours of conversational AI training datasets and 60,000 hours of read speech datasets, covering Asian languages, English dialects, and European languages, boosting the rapid development of human-computer interaction in artificial intelligence. Experiments show conversational data has better performance on ASR machine learning. Magic Data R&D Center works on conversational speech data and read speech data comparison, where 3,000 hours of conversational speech training data and read speech training data were respectively used to train Automatic Speech Recognition (ASR) models under customer service scenario, broadcasting, and navigation command. It shows that compared with read speech data, conversational speech data word accuracy is improved relatively up to 84%. The result shows the more the conversational data is used, the higher the word accuracy comes.
We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be missing. We express STC as the composition of weighted finite-state transducers (WFSTs) and use GTN (a framework for automatic differentiation with WFSTs) to compute gradients. We perform extensive experiments on automatic speech recognition. These experiments show that STC can recover most of the performance of supervised baseline when up to 70% of the labels are missing. We also perform experiments in handwriting recognition to show that our method easily applies to other sequence classification tasks.
The next wave of AI will be powered by the democratization of data. Open-source frameworks such as TensorFlow and Pytorch have brought machine learning to a huge developer base, but most state-of-the-art models still rely on training datasets which are either wholly proprietary or prohibitively expensive to license . As a result, the best automated speech recognition (ASR) models for converting speech audio into text are only available commercially, and are trained on data unavailable to the general public. Furthermore, only widely-spoken languages receive industry attention due to market incentives, limiting the availability of cutting-edge speech technology to English and a handful of other languages. The first is prohibitive licensing: Several free datasets do exist, but most of sufficient size and quality to make models truly shine are barred from commercial use. As a response, we created The People's Speech, a massive English-language dataset of audio transcriptions of full sentences (see Sample 1).
People perceive speech both by listening to it and watching the lip movements of speakers. In fact, studies show that visual cues play a key role in language learning. By contrast, AI speech recognition systems are built mostly -- or entirely -- on audio. And they require a substantial amount of data to train, typically ranging in the tens of thousands of hours of recordings. To investigate whether visuals -- specifically footage of mouth movement -- can improve the performance of speech recognition systems, researchers at Meta (formerly Facebook) developed Audio-Visual Hidden Unit BERT (AV-HuBERT), a framework that learns to understand speech by both watching and hearing people speak.