AITopics

2210.1369

Country: Asia (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

arXiv.org Artificial IntelligenceMar-6-2023

Parameter-Free Attentive Scoring for Speaker Verification

Pelecanos, Jason, Wang, Quan, Huang, Yiling, Moreno, Ignacio Lopez

This paper presents a novel study of parameter-free attentive scoring for speaker verification. Parameter-free scoring provides the flexibility of comparing speaker representations without the need of an accompanying parametric scoring model. Inspired by the attention component in Transformer neural networks, we propose a variant of the scaled dot product attention mechanism to compare enrollment and test segment representations. In addition, this work explores the effect on performance of (i) different types of normalization, (ii) independent versus tied query/key estimation, (iii) varying the number of key-value pairs and (iv) pooling multiple enrollment utterance statistics. Experimental results for a 4 task average show that a simple parameter-free attentive scoring mechanism can improve the average EER by 10% over the best cosine similarity baseline.

artificial intelligence, machine learning, utterance, (18 more...)

2203.05642

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceFeb-24-2023

Locale Encoding For Scalable Multilingual Keyword Spotting Models

Zhu, Pai, Park, Hyun Jin, Park, Alex, Scarpati, Angelo Scorza, Moreno, Ignacio Lopez

A Multilingual Keyword Spotting (KWS) system detects spokenkeywords over multiple locales. Conventional monolingual KWSapproaches do not scale well to multilingual scenarios because ofhigh development/maintenance costs and lack of resource sharing.To overcome this limit, we propose two locale-conditioned universalmodels with locale feature concatenation and feature-wise linearmodulation (FiLM). We compare these models with two baselinemethods: locale-specific monolingual KWS, and a single universalmodel trained over all data. Experiments over 10 localized languagedatasets show that locale-conditioned models substantially improveaccuracy over baseline methods across all locales in different noiseconditions.FiLMperformed the best, improving on average FRRby 61% (relative) compared to monolingual KWS models of similarsizes.

artificial intelligence, locale, machine learning, (18 more...)

2302.12961

Genre: Research Report (0.50)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

arXiv.org Artificial IntelligenceDec-3-2022

Augmenting Transformer-Transducer Based Speaker Change Detection With Token-Level Training Loss

Zhao, Guanlong, Wang, Quan, Lu, Han, Huang, Yiling, Moreno, Ignacio Lopez

In this work we propose a novel token-based training strategy that improves Transformer-Transducer (T-T) based speaker change detection (SCD) performance. The conventional T-T based SCD model loss optimizes all output tokens equally. Due to the sparsity of the speaker changes in the training data, the conventional T-T based SCD model loss leads to sub-optimal detection accuracy. To mitigate this issue, we use a customized edit-distance algorithm to estimate the token-level SCD false accept (FA) and false reject (FR) rates during training and optimize model parameters to minimize a weighted combination of the FA and FR, focusing the model on accurately predicting speaker changes. We also propose a set of evaluation metrics that align better with commercial use cases. Experiments on a group of challenging real-world datasets show that the proposed training method can significantly improve the overall performance of the SCD model with the same number of parameters.

artificial intelligence, machine learning, natural language, (20 more...)

2211.06482

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Machine LearningApr-26-2021

SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System

Chojnacka, Roza, Pelecanos, Jason, Wang, Quan, Moreno, Ignacio Lopez

In this paper, we describe SpeakerStew - a hybrid system to perform speaker verification on 46 languages. Two core ideas were explored in this system: (1) Pooling training data of different languages together for multilingual generalization and reducing development cycles; (2) A triage mechanism between text-dependent and text-independent models to reduce runtime cost and expected latency. To the best of our knowledge, this is the first study of speaker verification systems at the scale of 46 languages. The problem is framed from the perspective of using a smart speaker device with interactions consisting of a wake-up keyword (text-dependent) followed by a speech query (text-independent).Experimental evidence suggests that training on multiple languages can generalize to unseen varieties while maintaining performance on seen varieties. We also found that it can reduce computational requirements for training models by an order of magnitude. Furthermore, during model inference on English data, we observe that leveraging a triage framework can reduce the number of calls to the more computationally expensive text-independent system by 73% (and reduce latency by 60%) while maintaining an EER no worse than the text-independent setup.

acoustic processing, speech recognition, td system, (19 more...)

2104.02125

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.71)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

arXiv.org Machine LearningSep-9-2020

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

Wang, Quan, Moreno, Ignacio Lopez, Saglam, Mert, Wilson, Kevin, Chiao, Alan, Liu, Renjie, He, Yanzhang, Li, Wei, Pelecanos, Jason, Nika, Marily, Gruenstein, Alexander

We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under all other acoustic conditions. Besides, this model must be tiny, fast, and perform inference in a streaming fashion, in order to have minimal impact on CPU, memory, battery and latency. We propose novel techniques to meet these multi-faceted requirements, including using a new asymmetric loss, and adopting adaptive runtime suppression strength. We also show that such a model can be quantized as a 8-bit integer model and run in realtime.

deep learning, speech recognition, voicefilter-lite model, (17 more...)

2009.04323

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

arXiv.org Machine LearningOct-21-2019

Signal Combination for Language Identification

Wang, Shengye, Wan, Li, Yu, Yang, Moreno, Ignacio Lopez

ABSTRACT Google's multilingual speech recognition system combines low-level acoustic signals with language-specific recognizer signals to better predict the language of an utterance. This paper presents our experience with different signal combination methods to improve overall language identification accuracy. We compare the performance of a lattice-based ensemble model and a deep neural network model to combine signals from recognizers with that of a baseline that only uses low-level acoustic signals. Experimental results show that the deep neural network model outperforms the lattice-based ensemble model, and it reduced the error rate from 5 .5% in the baseline to 4 .3%, Index T erms-- Signal combination, language identification, lattice regression, deep neural network 1. INTRODUCTION Multilingual speech recognition is an important feature for modern speech recognition systems allowing users to speak in more than a single, preset language.

deep learning, language identification, speech recognition, (18 more...)

1910.09687

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.77)

arXiv.org Machine LearningAug-12-2019

Personal VAD: Speaker-Conditioned Voice Activity Detection

Ding, Shaojin, Wang, Quan, Chang, Shuo-yiin, Wan, Li, Moreno, Ignacio Lopez

ABSTRACT In this paper, we propose "personal V AD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption. We achieve this by training a V ADalike neural network that is conditioned on the target speaker embedding or the speaker verification score. With our optimal setup, we are able to train a 130KB model that outperforms a baseline system where individually trained standard V AD and speaker recognition network are combined to perform the same task. Index T erms-- Personal V AD, voice activity detection, speaker recognition, speech recognition 1. INTRODUCTION In modern speech processing systems, voice activity detection (V AD) usually lives in the upstream of other speech components such as speech recognition and speaker recognition. As a gating module, V AD not only improves the performance of downstream components by discarding non-speech signal, but also significantly reduces the overall computational cost due to its relatively small size.

deep learning, speech recognition, target speaker, (18 more...)

1908.04284

Country: North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology (0.46)
Energy (0.34)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)