Parthasarathi, Sree Hari Krishnan
Fixed-point quantization aware training for on-device keyword-spotting
Macha, Sashank, Oza, Om, Escott, Alex, Caliva, Francesco, Armitano, Robbie, Cheekatmalla, Santosh Kumar, Parthasarathi, Sree Hari Krishnan, Liu, Yuzong
Computational requirements can be reduced further using lowprecision Fixed-point (FXP) inference has proven suitable for embedded inference via quantization, which allows increased operations devices with limited computational resources, and yet model training per accessed memory byte [5, 7]. Such quantization is typically is continually performed in floating-point (FLP). FXP training achieved by means of post-training-quantization (PTQ) [8], which has not been fully explored and the non-trivial conversion from however causes severe information loss affecting model accuracy. FLP to FXP presents unavoidable performance drop. We propose To maintain overall accuracy for quantized DNNs, quantization can a novel method to train and obtain FXP convolutional keywordspotting be incorporated in the training phase leading to quantization-awaretraining (KWS) models. We combine our methodology with two (QAT). QAT introduces quantization noise during training quantization-aware-training (QAT) techniques - squashed weight by means of deterministic rounding [9, 10, 11], reparametrization distribution and absolute cosine regularization for model parameters, [12, 13] or regularization [14, 15] among few techniques, and propose techniques for extending QAT over transient allowing DNNs to adapt to inference quantization. Notable work variables, otherwise neglected by previous paradigms. Experimental has shown that with QAT model parameters can be learned at binary results on the Google Speech Commands v2 dataset show that we and ternary precision [16, 17].
Conversational Text-to-SQL: An Odyssey into State-of-the-Art and Challenges Ahead
Parthasarathi, Sree Hari Krishnan, Zeng, Lu, Hakkani-Tur, Dilek
We Text-to-SQL is an important research topic in semantic parsing adapt the two reranking methods from [16], query plan (QP) and [1, 2, 3, 4, 5, 6, 7]. Spider [3] and CoSQL [5] datasets allow for schema linking (SL), and show that both methods can help improve making progress in complex, cross-domain, single and multi-turn multi-turn text-to-SQL. With accuracy on CoSQL being reported text-to-SQL tasks respectively, utilizing a common set of databases, using exact-set-match accuracy (EM) and execution accuracy (EX), with competitive leaderboards, demonstrating the difficulty in the with T5-Large we observed: a) MT leads to 2.4% and 1.7% absolute tasks. In contrast to Spider, CoSQL was collected as entire dialogues, improvement on EM and EX; b) combined reranking approaches and hence includes additional challenges for the text-to-SQL yield 1.9% and 2.2% improvements; c) combining MT with reranking, task in terms of integrating dialogue context. In addition to the with T5-Large we obtain improvements of 2.1% in EM and challenges in general-purpose code generation [8, 9], where the 3.7% in EX over a T5-Large PICARD baseline. This improvement output of the system is constrained to follow a grammar, the textto-SQL is consistent on larger models, using T5-3B yielded about 1.0% in problem is underspecified without a schema.
Realizing Petabyte Scale Acoustic Modeling
Parthasarathi, Sree Hari Krishnan, Sivakrishnan, Nitin, Ladkat, Pranav, Strom, Nikko
Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead of scaling manual transcription to impractical levels, we utilize semi-supervised learning (SSL) to learn acoustic models (AM) from the vast firehose of untranscribed audio data. Learning an AM from 1 Million hours of audio presents unique ML and system design challenges. We present the design and evaluation of a highly scalable and resource efficient SSL system for AM. Employing the student/teacher learning paradigm, we focus on the student learning subsystem: a scalable and robust data pipeline that generates features and targets from raw audio, and an efficient model pipeline, including the distributed trainer, that builds a student model. Our evaluations show that, even without extensive hyper-parameter tuning, we obtain relative accuracy improvements in the 10 to 20$\%$ range, with higher gains in noisier conditions. The end-to-end processing time of this SSL system was 12 days, and several components in this system can trivially scale linearly with more compute resources.
Lessons from Building Acoustic Models with a Million Hours of Speech
Parthasarathi, Sree Hari Krishnan, Strom, Nikko
This is a report of our lessons learned building acoustic models from 1 Million hours of unlabeled speech, while labeled speech is restricted to 7,000 hours. We employ student/teacher training on unlabeled data, helping scale out target generation in comparison to confidence model based methods, which require a decoder and a confidence model. To optimize storage and to parallelize target generation, we store high valued logits from the teacher model. Introducing the notion of scheduled learning, we interleave learning on unlabeled and labeled data. To scale distributed training across a large number of GPUs, we use BMUF with 64 GPUs, while performing sequence training only on labeled data with gradient threshold compression SGD using 16 GPUs. Our experiments show that extremely large amounts of data are indeed useful; with little hyper-parameter tuning, we obtain relative WER improvements in the 10 to 20% range, with higher gains in noisier conditions.
Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning
Mošner, Ladislav, Wu, Minhua, Raju, Anirudh, Parthasarathi, Sree Hari Krishnan, Kumatani, Kenichi, Sundaram, Shiva, Maas, Roland, Hoffmeister, Björn
In this work, we adopt the teacherstudent (T/S)learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for transferring data. We incorporate up to 8000 hours of untranscribed data for training and present our results on sequence trained models apartfrom cross entropy trained ones. The best sequence trained student model yields relative word error rate (WER) reductions of approximately 10.1%, 28.7% and 19.6% on our clean, simulated noisy and real test sets respectively comparing toa sequence trained teacher. Index Terms-- automatic speech recognition, noise robustness, teacher-studenttraining, domain adaptation 1. INTRODUCTION With the exponential growth of big data and computing power, automatic speech recognition (ASR) technology has been successfully used in many applications. People can do voice search using mobile devices.