AITopics | term detection

Collaborating Authors

term detection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection

Singh, Anup, Demuynck, Kris, Arora, Vipul

arXiv.org Artificial IntelligenceDec-21-2024

Spoken term detection (STD) is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching, limiting its practicality. To address these challenges, we propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens. This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms. Our approach focuses on generating consistent token sequences across varying utterances of the same term. We also propose a bidirectional state space modeling within the Mamba encoder, trained in a self-supervised learning framework, to learn contextual frame-level features that are further encoded into discrete tokens. Our analysis shows that our speech tokens exhibit greater speaker invariance than those from existing tokenizers, making them more suitable for STD tasks. Empirical evaluation on LibriSpeech and TIMIT databases indicates that our method outperforms existing STD baselines while being more efficient.

arxiv preprint arxiv, sequence, utterance, (12 more...)

arXiv.org Artificial Intelligence

2411.141

Country:

North America > United States (0.14)
Europe > Spain > Basque Country (0.04)
Europe > Belgium (0.04)
Asia > India > Uttar Pradesh > Kanpur (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Information Management > Search (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.34)

Add feedback

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

Fatemeh, Allahdadi, Rahil, Mahdian Toroghi, Hassan, Zareian

arXiv.org Artificial IntelligenceOct-5-2024

Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture. By employing a pre-trained XLSR-53 network for feature extraction and a Hough transform for detection, our model effectively searches for user-defined spoken terms within any audio file. Experimental results across four languages demonstrate significant performance gains (19-54%) over a CNN-based baseline. While processing time is improved compared to DTW, accuracy remains inferior. Notably, our model offers the advantage of accurately counting query term repetitions within the target audio.

detection, representation, term detection, (12 more...)

arXiv.org Artificial Intelligence

2410.04091

Country:

Oceania > Australia > Australian Capital Territory > Canberra (0.06)
Europe > Spain > Basque Country (0.04)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
(4 more...)

Add feedback

Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units

Yusuf, Bolaji, Černocký, Jan "Honza", Saraçlar, Murat

arXiv.org Artificial IntelligenceJul-5-2024

End-to-end (E2E) keyword search (KWS) has emerged as an alternative and complimentary approach to conventional keyword search which depends on the output of automatic speech recognition (ASR) systems. While E2E methods greatly simplify the KWS pipeline, they generally have worse performance than their ASR-based counterparts, which can benefit from pretraining with untranscribed data. In this work, we propose a method for pretraining E2E KWS systems with untranscribed data, which involves using acoustic unit discovery (AUD) to obtain discrete units for untranscribed data and then learning to locate sequences of such units in the speech. We conduct experiments across languages and AUD systems: we show that finetuning such a model significantly outperforms a model trained from scratch, and the performance improvements are generally correlated with the quality of the AUD system used for pretraining.

acoustic unit, query, unit discovery, (13 more...)

arXiv.org Artificial Intelligence

2407.04652

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)
Asia > Middle East > Republic of Türkiye (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback

Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings

Švec, Jan, Šmídl, Luboš, Psutka, Josef V., Pražák, Aleš

arXiv.org Artificial IntelligenceOct-21-2022

The paper describes a novel approach to Spoken Term Detection (STD) in large spoken archives using deep LSTM networks. The work is based on the previous approach of using Siamese neural networks for STD and naturally extends it to directly localize a spoken term and estimate its relevance score. The phoneme confusion network generated by a phoneme recognizer is processed by the deep LSTM network which projects each segment of the confusion network into an embedding space. The searched term is projected into the same embedding space using another deep LSTM network. The relevance score is then computed using a simple dot-product in the embedding space and calibrated using a sigmoid function to predict the probability of occurrence. The location of the searched term is then estimated from the sequence of output probabilities. The deep LSTM networks are trained in a self-supervised manner from paired recognition hypotheses on word and phoneme levels. The method is experimentally evaluated on MALACH data in English and Czech languages.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2021-1704

2210.11895

Country: Europe > Czechia (0.04)

Genre:

Research Report (1.00)
Overview > Innovation (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Neural Network based End-to-End Query by Example Spoken Term Detection

Ram, Dhananjay, Miculicich, Lesly, Bourlard, Hervé

arXiv.org Machine LearningNov-19-2019

--This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State-of-the-art approaches primarily rely on dynamic time warping (DTW) based template matching techniques using phone posterior or bottleneck features extracted from a deep neural network (DNN). We use both monolingual and multilingual bottleneck features, and show that multilingual features perform increasingly better with more training languages. Previously, it has been shown that the DTW based matching can be replaced with a CNN based matching while using posterior features. Here, we show that the CNN based matching outperforms DTW based matching using bottleneck features as well. In this case, the feature extraction and pattern matching stages of our QbE-STD system are optimized independently of each other . We propose to integrate these two stages in a fully neural network based end-to-end learning framework to enable joint optimization of those two stages simultaneously. The proposed approaches are evaluated on two challenging multilingual datasets: Spoken Web Search 2013 and Query by Example Search on Speech T ask 2014, demonstrating in each case significant improvements. Query-by-example spoken term detection (QbE-STD) is defined as the task of detecting all files from an audio archive which contain a spoken query provided by a user (see Figure 1). It enables users to search through multilingual audio archives using their own speech. The primary difference from keyword spotting is that QbE-STD relies on spoken queries instead of textual queries making it a language independent task. In general, the queries and test utterances are generated by different speakers in different languages with varying acoustic conditions and without constraints on vocabulary, pronunciation lexicon, accents etc. Thus, the search is performed relying only on acoustic data of the query and test utterances with no language specific resources, as a zero-resource task. It is essentially a pattern matching problem in the context of speech data where the targeted pattern is the information represented using speech signal and given to the system as a spoken query.

bottleneck feature, query, test utterance, (15 more...)

arXiv.org Machine Learning

1911.08332

Country:

Europe > Switzerland > Vaud > Lausanne (0.04)
Europe > Spain > Basque Country (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Weakly Supervised One-Shot Detection with Attention Siamese Networks

Keren, Gil, Schmitt, Maximilian, Kehrenberg, Thomas, Schuller, Björn

arXiv.org Machine LearningJan-12-2018

We consider the task of weakly supervised one-shot detection. In this task, we attempt to perform a detection task over a set of unseen classes, when training only using weak binary labels that indicate the existence of a class instance in a given example. The model is conditioned on a single exemplar of an unseen class and a target example that may or may not contain an instance of the same class as the exemplar. A similarity map is computed by using a Siamese neural network to map the exemplar and regions of the target example to a latent representation space and then computing cosine similarity scores between representations. An attention mechanism weights different regions in the target example, and enables learning of the one-shot detection task using the weaker labels alone. The model can be applied to detection tasks from different domains, including computer vision object detection. We evaluate our attention Siamese networks on a one-shot detection task from the audio domain, where it detects audio keywords in spoken utterances. Our model considerably outperforms a baseline approach and yields a 42.6% average precision for detection across 10 unseen classes. Moreover, architectural developments from computer vision object detection models such as a region proposal network can be incorporated into the model architecture, and results show that performance is expected to improve by doing so.

artificial intelligence, detection, machine learning, (19 more...)

arXiv.org Machine Learning

1801.03329

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > Quebec > Montreal (0.05)
(14 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Fusion of Multiple Features and Supervised Learning for Chinese OOV Term Detection and POS Guessing

Zhang, Yuejie (Fudan University) | Cen, Lei (Fudan University) | Wu, Wei (Fudan University) | Jin, Cheng (Fudan University) | Xue, Xiangyang (Fudan University)

AAAI ConferencesJul-19-2011

In this paper, to support more precise Chinese Out-of-Vocabulary (OOV) term detection and Part-of-Speech (POS) guessing, a unified mechanism is proposed and formulated based on the fusion of multiple features and supervised learning. Besides all the traditional features, the new features for statistical information and global contexts are introduced, as well as some constraints and heuristic rules, which reveal the relationships among OOV term candidates. Our experiments on the Chinese corpora from both People’s Daily and SIGHAN 2005 have achieved the consistent results, which are better than those acquired by pure rule-based or statistics-based models. From the experimental results for combining our model with Chinese monolingual retrieval on the data sets of TREC-9, it is found that the obvious improvement for the retrieval performance can also be obtained.

information, oov term, term detection, (12 more...)

AAAI Conferences

Twenty-Second International Joint Conference on Artificial Intelligence

Country:

Asia > China > Shanghai > Shanghai (0.05)
North America > United States (0.04)
Europe > Serbia > Central Serbia > Belgrade (0.04)

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.61)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.49)

Add feedback

Likelihood-based semi-supervised model selection with applications to speech processing

White, Christopher M., Khudanpur, Sanjeev P., Wolfe, Patrick J.

arXiv.org Machine LearningNov-19-2009

In conventional supervised pattern recognition tasks, model selection is typically accomplished by minimizing the classification error rate on a set of so-called development data, subject to ground-truth labeling by human experts or some other means. In the context of speech processing systems and other large-scale practical applications, however, such labeled development data are typically costly and difficult to obtain. This article proposes an alternative semi-supervised framework for likelihood-based model selection that leverages unlabeled data by using trained classifiers representing each model to automatically generate putative labels. The errors that result from this automatic labeling are shown to be amenable to results from robust statistics, which in turn provide for minimax-optimal censored likelihood ratio tests that recover the nonparametric sign test as a limiting case. This approach is then validated experimentally using a state-of-the-art automatic speech recognition system to select between candidate word pronunciations using unlabeled speech data that only potentially contain instances of the words under test. Results provide supporting evidence for the utility of this approach, and suggest that it may also find use in other applications of machine learning.

artificial intelligence, machine learning, pronunciation, (16 more...)

arXiv.org Machine Learning

doi: 10.1109/JSTSP.2010.2076050

0911.3944

Country: North America > United States (0.67)

Genre: Research Report (0.50)

Industry: Law > Civil Rights & Constitutional Law (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.67)
(2 more...)

Add feedback