AITopics | Raju, Anirudh

Plotting

Raju, Anirudh

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Wang, Jinhan, Chen, Long, Khare, Aparna, Raju, Anirudh, Dheram, Pranav, He, Di, Wu, Minhua, Stolcke, Andreas, Ravichandran, Venkatesh

arXiv.org Artificial IntelligenceJan-26-2024

We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2401.14717

Country: North America > United States > California (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Tankasala, Srinath, Chen, Long, Stolcke, Andreas, Raju, Anirudh, Deng, Qianli, Chandak, Chander, Khare, Aparna, Maas, Roland, Ravichandran, Venkatesh

arXiv.org Artificial IntelligenceMar-27-2023

We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK dataset demonstrate that our approach consistently improves ASR performance, as well as fairness across speaker groups with different accents. Our approach provides a low-cost solution for mitigating the majoritarian bias of ASR systems, without the need to train new domain- or accent-specific models.

machine learning, natural language, utterance, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49357.2023.10096820

2303.15132

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
(2 more...)

Add feedback

Adaptive Endpointing with Deep Contextual Multi-armed Bandits

Min, Do June, Stolcke, Andreas, Raju, Anirudh, Vaz, Colin, He, Di, Ravichandran, Venkatesh, Trinh, Viet Anh

arXiv.org Artificial IntelligenceMar-23-2023

Current endpointing (EP) solutions learn in a supervised framework, which does not allow the model to incorporate feedback and improve in an online setting. Also, it is a common practice to utilize costly grid-search to find the best configuration for an endpointing model. In this paper, we aim to provide a solution for adaptive endpointing by proposing an efficient method for choosing an optimal endpointing configuration given utterance-level audio features in an online setting, while avoiding hyperparameter grid-search. Our method does not require ground truth labels, and only uses online learning from reward signals without requiring annotated labels. Specifically, we propose a deep contextual multi-armed bandit-based approach, which combines the representational power of neural networks with the action exploration behavior of Thompson modeling algorithms. We compare our approach to several baselines, and show that our deep bandit models also succeed in reducing early cutoff errors while maintaining low latency.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49357.2023.10097142

2303.13407

Country: North America > United States (0.29)

Genre: Research Report (1.00)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.50)

Add feedback

Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Dheram, Pranav, Ramakrishnan, Murugesan, Raju, Anirudh, Chen, I-Fan, King, Brian, Powell, Katherine, Saboowala, Melissa, Shetty, Karan, Stolcke, Andreas

arXiv.org Artificial IntelligenceJul-22-2022

As for other forms of AI, speech recognition has recently been examined with respect to performance disparities across different user cohorts. One approach to achieve fairness in speech recognition is to (1) identify speaker cohorts that suffer from subpar performance and (2) apply fairness mitigation measures targeting the cohorts discovered. In this paper, we report on initial findings with both discovery and mitigation of performance disparities using data from a product-scale AI assistant speech recognition system. We compare cohort discovery based on geographic and demographic information to a more scalable method that groups speakers without human labels, using speaker embedding technology. For fairness mitigation, we find that oversampling of underrepresented cohorts, as well as modeling speaker cohort membership by additional input variables, reduces the gap between top- and bottom-performing cohorts, without deteriorating overall recognition accuracy.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2022-10816

2207.11345

Country: Europe (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding

Wei, Kai, Tran, Thanh, Chang, Feng-Ju, Sathyendra, Kanthashree Mysore, Muniyappa, Thejaswi, Liu, Jing, Raju, Anirudh, McGowan, Ross, Susanj, Nathan, Rastrow, Ariya, Strimel, Grant P.

arXiv.org Artificial IntelligenceDec-13-2021

Recent years have seen significant advances in end-to-end (E2E) spoken language understanding (SLU) systems, which directly predict intents and slots from spoken audio. While dialogue history has been exploited to improve conventional text-based natural language understanding systems, current E2E SLU approaches have not yet incorporated such critical contextual signals in multi-turn and task-oriented dialogues. In this work, we propose a contextual E2E SLU model architecture that uses a multi-head attention mechanism over encoded previous utterances and dialogue acts (actions taken by the voice assistant) of a multi-turn dialogue. We detail alternative methods to integrate these contexts into the state-ofthe-art recurrent and transformer-based models. When applied to a large de-identified dataset of utterances collected by a voice assistant, our method reduces average word and semantic error rates by 10.8% and 12.6%, respectively. We also present results on a publicly available dataset and show that our method significantly improves performance over a noncontextual baseline

machine learning, natural language, utterance, (17 more...)

arXiv.org Artificial Intelligence

2112.06743

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Add feedback

Multi-task Language Modeling for Improving Speech Recognition of Rare Words

Yang, Chao-Han Huck, Liu, Linda, Gandhe, Ankur, Gu, Yile, Raju, Anirudh, Filimonov, Denis, Bulyko, Ivan

arXiv.org Artificial IntelligenceNov-24-2020

End-to-end automatic speech recognition (ASR) systems are increasingly popular due to their relative architectural simplicity and competitive performance. However, even though the average accuracy of these systems may be high, the performance on rare content words often lags behind hybrid ASR systems. To address this problem, second-pass rescoring is often applied. In this paper, we propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance. We show that our rescoring model with trained with these additional tasks outperforms the baseline rescoring model, trained with only the language modeling task, by 1.4% on a general test and by 2.6% on a rare word test set in term of word-error-rate relative (WERR).

deep learning, intent detection, speech recognition, (19 more...)

arXiv.org Artificial Intelligence

2011.11715

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning

Mošner, Ladislav, Wu, Minhua, Raju, Anirudh, Parthasarathi, Sree Hari Krishnan, Kumatani, Kenichi, Sundaram, Shiva, Maas, Roland, Hoffmeister, Björn

arXiv.org Machine LearningJan-11-2019

In this work, we adopt the teacherstudent (T/S)learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for transferring data. We incorporate up to 8000 hours of untranscribed data for training and present our results on sequence trained models apartfrom cross entropy trained ones. The best sequence trained student model yields relative word error rate (WER) reductions of approximately 10.1%, 28.7% and 19.6% on our clean, simulated noisy and real test sets respectively comparing toa sequence trained teacher. Index Terms-- automatic speech recognition, noise robustness, teacher-studenttraining, domain adaptation 1. INTRODUCTION With the exponential growth of big data and computing power, automatic speech recognition (ASR) technology has been successfully used in many applications. People can do voice search using mobile devices.

artificial intelligence, deep learning, neural network, (16 more...)

arXiv.org Machine Learning

1901.02348

Country: Europe > Czechia (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

On Evaluating and Comparing Open Domain Dialog Systems

Venkatesh, Anu, Khatri, Chandra, Ram, Ashwin, Guo, Fenfei, Gabriel, Raefer, Nagar, Ashish, Prasad, Rohit, Cheng, Ming, Hedayatnia, Behnam, Metallinou, Angeliki, Goel, Rahul, Yang, Shaohua, Raju, Anirudh

arXiv.org Artificial IntelligenceDec-26-2018

Conversational agents are exploding in popularity. However, much work remains in the area of non goal-oriented conversations, despite significant growth in research interest over recent years. To advance the state of the art in conversational AI, Amazon launched the Alexa Prize, a 2.5-million dollar university competition where sixteen selected university teams built conversational agents to deliver the best social conversational experience. Alexa Prize provided the academic community with the unique opportunity to perform research with a live system used by millions of users. The subjectivity associated with evaluating conversations is key element underlying the challenge of building non-goal oriented dialogue systems. In this paper, we propose a comprehensive evaluation strategy with multiple metrics designed to reduce subjectivity by selecting metrics which correlate well with human judgement. The proposed metrics provide granular analysis of the conversational agents, which is not captured in human ratings. We show that these metrics can be used as a reasonable proxy for human judgment. We provide a mechanism to unify the metrics for selecting the top performing agents, which has also been applied throughout the Alexa Prize competition. To our knowledge, to date it is the largest setting for evaluating agents with millions of conversations and hundreds of thousands of ratings from users. We believe that this work is a step towards an automatic evaluation process for conversational AIs.

deep learning, neural network, socialbot, (21 more...)

arXiv.org Artificial Intelligence

1801.03625

Country: North America > United States (0.46)

Genre:

Research Report (1.00)
Contests & Prizes (0.76)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)

Add feedback

Data Augmentation for Robust Keyword Spotting under Playback Interference

Raju, Anirudh, Panchapagesan, Sankaran, Liu, Xing, Mandal, Arindam, Strom, Nikko

arXiv.org Machine LearningAug-1-2018

Accurate on-device keyword spotting (KWS) with low false accept and false reject rate is crucial to customer experience for far-field voice control of conversational agents. It is particularly challenging to maintain low false reject rate in real world conditions where there is (a) ambient noise from external sources such as TV, household appliances, or other speech that is not directed at the device (b) imperfect cancellation of the audio playback from the device, resulting in residual echo, after being processed by the Acoustic Echo Cancellation (AEC) system. In this paper, we propose a data augmentation strategy to improve keyword spotting performance under these challenging conditions. The training set audio is artificially corrupted by mixing in music and TV/movie audio, at different signal to interference ratios. Our results show that we get around 30-45% relative reduction in false reject rates, at a range of false alarm rates, under audio playback from such devices.

artificial intelligence, keyword, speech recognition, (17 more...)

arXiv.org Machine Learning

1808.00563

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.49)
Media > Television (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.30)

Add feedback

Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

Sun, Ming, Raju, Anirudh, Tucker, George, Panchapagesan, Sankaran, Fu, Gengshen, Mandal, Arindam, Matsoukas, Spyros, Strom, Nikko, Vitaladevuni, Shiv

arXiv.org Machine LearningMay-5-2017

We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields $67.6\%$ relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.

deep learning, keyword, neural network, (19 more...)

arXiv.org Machine Learning

doi: 10.1109/SLT.2016.7846306

1705.02411

Country: North America > United States > California > Santa Clara County (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback