AITopics | test utterance

Collaborating Authors

test utterance

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains

Xue, Leyan, Zhang, Changqing, Xue, Kecheng, Liu, Xiaohong, Wang, Guangyu, Han, Zongbo

arXiv.org Artificial IntelligenceNov-17-2025

Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.06452

Country:

North America > United States (1.00)
Asia > China (0.68)
Europe (0.67)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.93)
Health & Medicine > Therapeutic Area > Neurology (0.93)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)
(2 more...)

Add feedback

ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

Ma, Yi, Wang, Shuai, Liu, Tianchi, Li, Haizhou

arXiv.org Artificial IntelligenceJan-14-2025

In speaker verification, we use computational method to verify if an utterance matches the identity of an enrolled speaker. This task is similar to the manual task of forensic voice comparison, where linguistic analysis is combined with auditory measurements to compare and evaluate voice samples. Despite much success, we have yet to develop a speaker verification system that offers explainable results comparable to those from manual forensic voice comparison. A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait which describes the speaker's characteristics at the phonetic level, resembling what forensic comparison does. ExPO not only generates utterance-level speaker embeddings but also allows for fine-grained analysis and visualization of phonetic traits, offering an explainable speaker verification process. Furthermore, we investigate phonetic traits from within-speaker and between-speaker variation perspectives to determine which trait is most effective for speaker verification, marking an important step towards explainable speaker verification. Our code is available at https://github.com/mmmmayi/ExPO.

phonetic trait, speaker verification, verification, (16 more...)

arXiv.org Artificial Intelligence

2501.05729

Country:

Asia > China > Guangdong Province > Shenzhen (0.05)
Asia > Singapore (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(2 more...)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)

Add feedback

Vocal Tract Length Warped Features for Spoken Keyword Spotting

Sarkar, Achintya kr., Dwivedi, Priyanka, Tan, Zheng-Hua

arXiv.org Artificial IntelligenceJan-6-2025

In this paper, we propose several methods that incorporate vocal tract length (VTL) warped features for spoken keyword spotting (KWS). The first method, VTL-independent KWS, involves training a single deep neural network (DNN) that utilizes VTL features with various warping factors. During training, a specific VTL feature is randomly selected per epoch, allowing the exploration of VTL variations. During testing, the VTL features with different warping factors of a test utterance are scored against the DNN and combined with equal weight. In the second method scores the conventional features of a test utterance (without VTL warping) against the DNN. The third method, VTL-concatenation KWS, concatenates VTL warped features to form high-dimensional features for KWS. Evaluations carried out on the English Google Command dataset demonstrate that the proposed methods improve the accuracy of KWS.

artificial intelligence, machine learning, warped feature, (13 more...)

arXiv.org Artificial Intelligence

2501.03523

Country: Europe > Denmark (0.14)

Genre: Research Report > Experimental Study (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting

Zhu, Pai, Bartel, Jacob W., Agarwal, Dhruuv, Partridge, Kurt, Park, Hyun Jin, Wang, Quan

arXiv.org Artificial IntelligenceOct-21-2024

We propose GE2E-KWS -- a generalized end-to-end training and evaluation framework for customized keyword spotting. Specifically, enrollment utterances are separated and grouped by keywords from the training batch and their embedding centroids are compared to all other test utterance embeddings to compute the loss. This simulates runtime enrollment and verification stages, and improves convergence stability and training speed by optimizing matrix operations compared to SOTA triplet loss approaches. To benchmark different models reliably, we propose an evaluation process that mimics the production environment and compute metrics that directly measure keyword matching accuracy. Trained with GE2E loss, our 419KB quantized conformer model beats a 7.5GB ASR encoder by 23.6% relative AUC, and beats a same size triplet loss model by 60.7% AUC. Our KWS models are natively streamable with low memory footprints, and designed to continuously run on-device with no retraining needed for new keywords (zero-shot).

large language model, machine learning, utterance, (20 more...)

arXiv.org Artificial Intelligence

2410.16647

Country:

North America > United States > New York (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection

Do, Phat, Coler, Matt, Dijkstra, Jelske, Klabbers, Esther

arXiv.org Artificial IntelligenceJun-21-2023

We compare using a PHOIBLE-based phone mapping method and using phonological features input in transfer learning for TTS in low-resource languages. We use diverse source languages (English, Finnish, Hindi, Japanese, and Russian) and target languages (Bulgarian, Georgian, Kazakh, Swahili, Urdu, and Uzbek) to test the language-independence of the methods and enhance the findings' applicability. We use Character Error Rates from automatic speech recognition and predicted Mean Opinion Scores for evaluation. Results show that both phone mapping and features input improve the output quality and the latter performs better, but these effects also depend on the specific language combination. We also compare the recently-proposed Angular Similarity of Phone Frequencies (ASPF) with a family tree-based distance measure as a criterion to select source languages in transfer learning. ASPF proves effective if label-based phone input is used, while the language distance does not have expected effects.

artificial intelligence, machine learning, speech recognition, (17 more...)

arXiv.org Artificial Intelligence

2306.1204

Country:

Europe > Netherlands (0.04)
North America > Canada > Quebec > Montreal (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.83)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.52)

Add feedback

Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

Lin, Yist Y., Han, Tao, Xu, Haihua, Pham, Van Tung, Khassanov, Yerbolat, Chong, Tze Yuang, He, Yi, Lu, Lu, Ma, Zejun

arXiv.org Artificial IntelligenceMay-25-2023

One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (~3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (~10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.

artificial intelligence, machine learning, utterance, (16 more...)

arXiv.org Artificial Intelligence

2210.15876

Country:

Oceania > New Zealand (0.04)
Oceania > Australia (0.04)
North America > Canada (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition

Neural Information Processing SystemsApr-6-2023, 16:46:54 GMT

A challenging, unsolved problem in the speech recognition com(cid:173) munity is recognizing speech signals that are corrupted by loud, highly nonstationary noise. One approach to noisy speech recog(cid:173) nition is to automatically remove the noise from the cepstrum se(cid:173) quence before feeding it in to a clean speech recognizer. In previous work published in Eurospeech, we showed how a probability model trained on clean speech and a separate probability model trained on noise could be combined for the purpose of estimating the noise(cid:173) free speech from the noisy speech. We showed how an iterative 2nd order vector Taylor series approximation could be used for prob(cid:173) abilistic inference in this model. In many circumstances, it is not possible to obtain examples of noise without speech.

cid, learning dynamic noise model, robust speech recognition, (5 more...)

Neural Information Processing Systems

Country: North America > United States > New York > New York County > New York City (0.07)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.96)

Add feedback

Parameter-Free Attentive Scoring for Speaker Verification

Pelecanos, Jason, Wang, Quan, Huang, Yiling, Moreno, Ignacio Lopez

arXiv.org Artificial IntelligenceMar-6-2023

This paper presents a novel study of parameter-free attentive scoring for speaker verification. Parameter-free scoring provides the flexibility of comparing speaker representations without the need of an accompanying parametric scoring model. Inspired by the attention component in Transformer neural networks, we propose a variant of the scaled dot product attention mechanism to compare enrollment and test segment representations. In addition, this work explores the effect on performance of (i) different types of normalization, (ii) independent versus tied query/key estimation, (iii) varying the number of key-value pairs and (iv) pooling multiple enrollment utterance statistics. Experimental results for a 4 task average show that a simple parameter-free attentive scoring mechanism can improve the average EER by 10% over the best cosine similarity baseline.

artificial intelligence, machine learning, utterance, (18 more...)

arXiv.org Artificial Intelligence

2203.05642

Country:

North America > United States (0.04)
Asia > India (0.04)
South America > Brazil (0.04)
Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs

Kye, Seong Min, Jung, Youngmoon, Lee, Hae Beom, Hwang, Sung Ju, Kim, Hoirin

arXiv.org Machine LearningApr-6-2020

In realistic settings, a speaker recognition system needs to identify a speaker given a short utterance, while the utterance used to enroll may be relatively long. However, existing speaker recognition models perform poorly with such short utterances. To solve this problem, we introduce a meta-learning scheme with imbalance length pairs. Specifically, we use a prototypical network and train it with a support set of long utterances and a query set of short utterances. However, since optimizing for only the classes in the given episode is not sufficient to learn discriminative embeddings for other classes in the entire dataset, we additionally classify both support set and query set against the entire classes in the training set to learn a well-discriminated embedding space. By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models learned in a standard supervised learning framework on short utterance (1-2 seconds) on VoxCeleb dataset. We also validate our proposed model for unseen speaker identification, on which it also achieves significant gain over existing approaches.

recognition, short utterance, utterance, (15 more...)

arXiv.org Machine Learning

2004.02863

Country: Asia > South Korea (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.84)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.72)

Add feedback

Neural Network based End-to-End Query by Example Spoken Term Detection

Ram, Dhananjay, Miculicich, Lesly, Bourlard, Hervé

arXiv.org Machine LearningNov-19-2019

--This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State-of-the-art approaches primarily rely on dynamic time warping (DTW) based template matching techniques using phone posterior or bottleneck features extracted from a deep neural network (DNN). We use both monolingual and multilingual bottleneck features, and show that multilingual features perform increasingly better with more training languages. Previously, it has been shown that the DTW based matching can be replaced with a CNN based matching while using posterior features. Here, we show that the CNN based matching outperforms DTW based matching using bottleneck features as well. In this case, the feature extraction and pattern matching stages of our QbE-STD system are optimized independently of each other . We propose to integrate these two stages in a fully neural network based end-to-end learning framework to enable joint optimization of those two stages simultaneously. The proposed approaches are evaluated on two challenging multilingual datasets: Spoken Web Search 2013 and Query by Example Search on Speech T ask 2014, demonstrating in each case significant improvements. Query-by-example spoken term detection (QbE-STD) is defined as the task of detecting all files from an audio archive which contain a spoken query provided by a user (see Figure 1). It enables users to search through multilingual audio archives using their own speech. The primary difference from keyword spotting is that QbE-STD relies on spoken queries instead of textual queries making it a language independent task. In general, the queries and test utterances are generated by different speakers in different languages with varying acoustic conditions and without constraints on vocabulary, pronunciation lexicon, accents etc. Thus, the search is performed relying only on acoustic data of the query and test utterances with no language specific resources, as a zero-resource task. It is essentially a pattern matching problem in the context of speech data where the targeted pattern is the information represented using speech signal and given to the system as a spoken query.

bottleneck feature, query, test utterance, (15 more...)

arXiv.org Machine Learning

1911.08332

Country:

Europe > Switzerland > Vaud > Lausanne (0.04)
Europe > Spain > Basque Country (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback