test utterance
MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains
Xue, Leyan, Zhang, Changqing, Xue, Kecheng, Liu, Xiaohong, Wang, Guangyu, Han, Zongbo
Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.
ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification
Ma, Yi, Wang, Shuai, Liu, Tianchi, Li, Haizhou
In speaker verification, we use computational method to verify if an utterance matches the identity of an enrolled speaker. This task is similar to the manual task of forensic voice comparison, where linguistic analysis is combined with auditory measurements to compare and evaluate voice samples. Despite much success, we have yet to develop a speaker verification system that offers explainable results comparable to those from manual forensic voice comparison. A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait which describes the speaker's characteristics at the phonetic level, resembling what forensic comparison does. ExPO not only generates utterance-level speaker embeddings but also allows for fine-grained analysis and visualization of phonetic traits, offering an explainable speaker verification process. Furthermore, we investigate phonetic traits from within-speaker and between-speaker variation perspectives to determine which trait is most effective for speaker verification, marking an important step towards explainable speaker verification. Our code is available at https://github.com/mmmmayi/ExPO.
Vocal Tract Length Warped Features for Spoken Keyword Spotting
Sarkar, Achintya kr., Dwivedi, Priyanka, Tan, Zheng-Hua
In this paper, we propose several methods that incorporate vocal tract length (VTL) warped features for spoken keyword spotting (KWS). The first method, VTL-independent KWS, involves training a single deep neural network (DNN) that utilizes VTL features with various warping factors. During training, a specific VTL feature is randomly selected per epoch, allowing the exploration of VTL variations. During testing, the VTL features with different warping factors of a test utterance are scored against the DNN and combined with equal weight. In the second method scores the conventional features of a test utterance (without VTL warping) against the DNN. The third method, VTL-concatenation KWS, concatenates VTL warped features to form high-dimensional features for KWS. Evaluations carried out on the English Google Command dataset demonstrate that the proposed methods improve the accuracy of KWS.
GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting
Zhu, Pai, Bartel, Jacob W., Agarwal, Dhruuv, Partridge, Kurt, Park, Hyun Jin, Wang, Quan
We propose GE2E-KWS -- a generalized end-to-end training and evaluation framework for customized keyword spotting. Specifically, enrollment utterances are separated and grouped by keywords from the training batch and their embedding centroids are compared to all other test utterance embeddings to compute the loss. This simulates runtime enrollment and verification stages, and improves convergence stability and training speed by optimizing matrix operations compared to SOTA triplet loss approaches. To benchmark different models reliably, we propose an evaluation process that mimics the production environment and compute metrics that directly measure keyword matching accuracy. Trained with GE2E loss, our 419KB quantized conformer model beats a 7.5GB ASR encoder by 23.6% relative AUC, and beats a same size triplet loss model by 60.7% AUC. Our KWS models are natively streamable with low memory footprints, and designed to continuously run on-device with no retraining needed for new keywords (zero-shot).
Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection
Do, Phat, Coler, Matt, Dijkstra, Jelske, Klabbers, Esther
We compare using a PHOIBLE-based phone mapping method and using phonological features input in transfer learning for TTS in low-resource languages. We use diverse source languages (English, Finnish, Hindi, Japanese, and Russian) and target languages (Bulgarian, Georgian, Kazakh, Swahili, Urdu, and Uzbek) to test the language-independence of the methods and enhance the findings' applicability. We use Character Error Rates from automatic speech recognition and predicted Mean Opinion Scores for evaluation. Results show that both phone mapping and features input improve the output quality and the latter performs better, but these effects also depend on the specific language combination. We also compare the recently-proposed Angular Similarity of Phone Frequencies (ASPF) with a family tree-based distance measure as a criterion to select source languages in transfer learning. ASPF proves effective if label-based phone input is used, while the language distance does not have expected effects.
Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition
Lin, Yist Y., Han, Tao, Xu, Haihua, Pham, Van Tung, Khassanov, Yerbolat, Chong, Tze Yuang, He, Yi, Lu, Lu, Ma, Zejun
One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (~3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (~10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.
ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
A challenging, unsolved problem in the speech recognition com(cid:173) munity is recognizing speech signals that are corrupted by loud, highly nonstationary noise. One approach to noisy speech recog(cid:173) nition is to automatically remove the noise from the cepstrum se(cid:173) quence before feeding it in to a clean speech recognizer. In previous work published in Eurospeech, we showed how a probability model trained on clean speech and a separate probability model trained on noise could be combined for the purpose of estimating the noise(cid:173) free speech from the noisy speech. We showed how an iterative 2nd order vector Taylor series approximation could be used for prob(cid:173) abilistic inference in this model. In many circumstances, it is not possible to obtain examples of noise without speech.
Parameter-Free Attentive Scoring for Speaker Verification
Pelecanos, Jason, Wang, Quan, Huang, Yiling, Moreno, Ignacio Lopez
This paper presents a novel study of parameter-free attentive scoring for speaker verification. Parameter-free scoring provides the flexibility of comparing speaker representations without the need of an accompanying parametric scoring model. Inspired by the attention component in Transformer neural networks, we propose a variant of the scaled dot product attention mechanism to compare enrollment and test segment representations. In addition, this work explores the effect on performance of (i) different types of normalization, (ii) independent versus tied query/key estimation, (iii) varying the number of key-value pairs and (iv) pooling multiple enrollment utterance statistics. Experimental results for a 4 task average show that a simple parameter-free attentive scoring mechanism can improve the average EER by 10% over the best cosine similarity baseline.
Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs
Kye, Seong Min, Jung, Youngmoon, Lee, Hae Beom, Hwang, Sung Ju, Kim, Hoirin
In realistic settings, a speaker recognition system needs to identify a speaker given a short utterance, while the utterance used to enroll may be relatively long. However, existing speaker recognition models perform poorly with such short utterances. To solve this problem, we introduce a meta-learning scheme with imbalance length pairs. Specifically, we use a prototypical network and train it with a support set of long utterances and a query set of short utterances. However, since optimizing for only the classes in the given episode is not sufficient to learn discriminative embeddings for other classes in the entire dataset, we additionally classify both support set and query set against the entire classes in the training set to learn a well-discriminated embedding space. By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models learned in a standard supervised learning framework on short utterance (1-2 seconds) on VoxCeleb dataset. We also validate our proposed model for unseen speaker identification, on which it also achieves significant gain over existing approaches.
Neural Network based End-to-End Query by Example Spoken Term Detection
Ram, Dhananjay, Miculicich, Lesly, Bourlard, Hervé
--This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State-of-the-art approaches primarily rely on dynamic time warping (DTW) based template matching techniques using phone posterior or bottleneck features extracted from a deep neural network (DNN). We use both monolingual and multilingual bottleneck features, and show that multilingual features perform increasingly better with more training languages. Previously, it has been shown that the DTW based matching can be replaced with a CNN based matching while using posterior features. Here, we show that the CNN based matching outperforms DTW based matching using bottleneck features as well. In this case, the feature extraction and pattern matching stages of our QbE-STD system are optimized independently of each other . We propose to integrate these two stages in a fully neural network based end-to-end learning framework to enable joint optimization of those two stages simultaneously. The proposed approaches are evaluated on two challenging multilingual datasets: Spoken Web Search 2013 and Query by Example Search on Speech T ask 2014, demonstrating in each case significant improvements. Query-by-example spoken term detection (QbE-STD) is defined as the task of detecting all files from an audio archive which contain a spoken query provided by a user (see Figure 1). It enables users to search through multilingual audio archives using their own speech. The primary difference from keyword spotting is that QbE-STD relies on spoken queries instead of textual queries making it a language independent task. In general, the queries and test utterances are generated by different speakers in different languages with varying acoustic conditions and without constraints on vocabulary, pronunciation lexicon, accents etc. Thus, the search is performed relying only on acoustic data of the query and test utterances with no language specific resources, as a zero-resource task. It is essentially a pattern matching problem in the context of speech data where the targeted pattern is the information represented using speech signal and given to the system as a spoken query.