AITopics | keyword spotting

Collaborating Authors

keyword spotting

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Elementary, My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset

Elvers, Gereon, Landau, Gilad, Jones, Oiwi Parker

arXiv.org Artificial IntelligenceOct-31-2025

Non-invasive brain-computer interfaces (BCIs) are beginning to benefit from large, public benchmarks. However, current benchmarks target relatively simple, foundational tasks like Speech Detection and Phoneme Classification, while application-ready results on tasks like Brain-to-Text remain elusive. We propose Keyword Spotting (KWS) as a practically applicable, privacy-aware intermediate task. Using the deep 52-hour, within-subject LibriBrain corpus, we provide standardized train/validation/test splits for reproducible benchmarking, and adopt an evaluation protocol tailored to extreme class imbalance. Concretely, we use area under the precision-recall curve (AUPRC) as a robust evaluation metric, complemented by false alarms per hour (FA/h) at fixed recall to capture user-facing trade-offs. To simplify deployment and further experimentation within the research community, we are releasing an updated version of the pnpl library with word-level dataloaders and Colab-ready tutorials. As an initial reference model, we present a compact 1-D Conv/ResNet baseline with focal loss and top-k pooling that is trainable on a single consumer-class GPU. The reference model achieves approximately 13x the permutation baseline AUPRC on held-out sessions, demonstrating the viability of the task. Exploratory analyses reveal: (i) predictable within-subject scaling - performance improves log-linearly with more training hours - and (ii) the existence of word-level factors (frequency and duration) that systematically modulate detectability.

detection, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.21038

Country: Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre: Research Report > Experimental Study (0.68)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Real-Time Performance Benchmarking of TinyML Models in Embedded Systems (PICO: Performance of Inference, CPU, and Operations)

Dey, Abhishek, Srivastava, Saurabh, Singh, Gaurav, Pettit, Robert G.

arXiv.org Artificial IntelligenceSep-8-2025

This paper presents PICO-TINYML-BENCHMARK, a modular and platform-agnostic framework for benchmarking the real-time performance of TinyML models on resource-constrained embedded systems. Evaluating key metrics such as inference latency, CPU utilization, memory efficiency, and prediction stability, the framework provides insights into computational trade-offs and platform-specific optimizations. We benchmark three representative TinyML models -- Gesture Classification, Keyword Spotting, and MobileNet V2 -- on two widely adopted platforms, BeagleBone AI64 and Raspberry Pi 4, using real-world datasets. Results reveal critical trade-offs: the BeagleBone AI64 demonstrates consistent inference latency for AI-specific tasks, while the Raspberry Pi 4 excels in resource efficiency and cost-effectiveness. These findings offer actionable guidance for optimizing TinyML deployments, bridging the gap between theoretical advancements and practical applications in embedded systems.

machine learning, raspberry pi 4, real time system, (17 more...)

arXiv.org Artificial Intelligence

2509.04721

Genre: Research Report (0.84)

Industry: Information Technology (0.95)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Architecture > Real Time Systems (1.00)

Add feedback

Keyword Spotting with Hyper-Matched Filters for Small Footprint Devices

Segal-Feldman, Yael, Bradlow, Ann R., Goldrick, Matthew, Keshet, Joseph

arXiv.org Artificial IntelligenceAug-8-2025

--Open-vocabulary keyword spotting (KWS) refers to the task of detecting words or terms within speech recordings, regardless of whether they were included in the training data. This paper introduces an open-vocabulary keyword spotting model with state-of-the-art detection accuracy for small-footprint devices. The model is composed of a speech encoder, a target keyword encoder, and a detection network. The speech encoder is either a tiny Whisper or a tiny Conformer . The target keyword encoder is implemented as a hyper-network that takes the desired keyword as a character string and generates a unique set of weights for a convolutional layer, which can be considered as a keyword-specific matched filter . The detection network uses the matched-filter weights to perform a keyword-specific convolution, which guides the cross-attention mechanism of a Perceiver module in determining whether the target term appears in the recording. The results indicate that our system achieves state-of-the-art detection performance and generalizes effectively to out-of-domain conditions, including second-language (L2) speech. Notably, our smallest model, with just 4.2 million parameters, matches or outperforms models that are several times larger, demonstrating both efficiency and robustness. Keyword Spotting (KWS) is the task of identifying a predefined set of target keywords. It is an essential task in speech recognition systems, particularly crucial for voice assistants on small-footprint devices, such as smartphones and small speakers [1]. Throughout the years, various techniques have been explored for this task, beginning with basic deep neural networks based on Convolutional Neural Networks (CNNs) [2], [3] to Recurrent Neural Networks (RNNs) as audio encoders [4], [5]. V arious approaches have been proposed to address this challenge.

artificial intelligence, keyword, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2508.04857

Country:

Asia (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Industry:

Media (0.34)
Leisure & Entertainment (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multiple-Instance, Cascaded Classification for Keyword Spotting in Narrow-Band Audio

AbdulKader, Ahmad, Nassar, Kareem, El-Geish, Mohamed, Galvez, Daniel, Patil, Chetan

arXiv.org Artificial IntelligenceApr-28-2025

We propose using cascaded classifiers for a keyword spotting (KWS) task on narrow-band (NB), 8kHz audio acquired in non-IID environments -- a more challenging task than most state-of-the-art KWS systems face. We present a model that incorporates Deep Neural Networks (DNNs), cascading, multiple-feature representations, and multiple-instance learning. The cascaded classifiers handle the task's class imbalance and reduce power consumption on computationally-constrained devices via early termination. The KWS system achieves a false negative rate of 6% at an hourly false positive rate of 0.75

artificial intelligence, classifier, machine learning, (16 more...)

arXiv.org Artificial Intelligence

1711.08058

Country:

North America > United States (0.29)
Europe (0.29)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology

Dai, Weinan, Jiang, Yifeng, Liu, Yuanjing, Chen, Jinkun, Sun, Xin, Tao, Jinglei

arXiv.org Artificial IntelligenceAug-31-2024

This paper addresses the persistent challenge in Keyword Spotting (KWS), a fundamental component in speech technology, regarding the acquisition of substantial labeled data for training. Given the difficulty in obtaining large quantities of positive samples and the laborious process of collecting new target samples when the keyword changes, we introduce a novel approach combining unsupervised contrastive learning and a unique augmentation-based technique. Our method allows the neural network to train on unlabeled data sets, potentially improving performance in downstream tasks with limited labeled data sets. We also propose that similar high-level feature representations should be employed for speech utterances with the same keyword despite variations in speed or volume. To achieve this, we present a speech augmentation-based unsupervised learning method that utilizes the similarity between the bottleneck layer feature and the audio reconstructing information for auxiliary training. Furthermore, we propose a compressed convolutional architecture to address potential redundancy and non-informative information in KWS tasks, enabling the model to simultaneously learn local features and focus on long-term information. This method achieves strong performance on the Google Speech Commands V2 Dataset. Inspired by recent advancements in sign spotting and spoken term detection, our method underlines the potential of our contrastive learning approach in KWS and the advantages of Query-by-Example Spoken Term Detection strategies. The presented CAB-KWS provide new perspectives in the field of KWS, demonstrating effective ways to reduce data collection efforts and increase the system's robustness.

classification accuracy, keyword, learning, (13 more...)

arXiv.org Artificial Intelligence

2409.00356

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > United States > Texas > Brazos County > College Station (0.04)
North America > Canada > Nova Scotia > Halifax Regional Municipality > Halifax (0.04)
(3 more...)

Genre:

Research Report > Promising Solution (0.48)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors

Rusci, Manuele, Paci, Francesco, Fariselli, Marco, Flamand, Eric, Tuytelaars, Tinne

arXiv.org Artificial IntelligenceAug-22-2024

This paper proposes a self-learning framework to incrementally train (fine-tune) a personalized Keyword Spotting (KWS) model after the deployment on ultra-low power smart audio sensors. We address the fundamental problem of the absence of labeled training data by assigning pseudo-labels to the new recorded audio frames based on a similarity score with respect to few user recordings. By experimenting with multiple KWS models with a number of parameters up to 0.5M on two public datasets, we show an accuracy improvement of up to +19.2% and +16.0% vs. the initial models pretrained on a large set of generic keywords. The labeling task is demonstrated on a sensor system composed of a low-power microphone and an energy-efficient Microcontroller (MCU). By efficiently exploiting the heterogeneous processing engines of the MCU, the always-on labeling task runs in real-time with an average power cost of up to 8.2 mW. On the same platform, we estimate an energy cost for on-device training 10x lower than the labeling energy if sampling a new utterance every 5 s or 16.4 s with a DS-CNN-S or a DS-CNN-M model. Our empirical result paves the way to self-adaptive personalized KWS sensors at the extreme edge.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2408.12481

Country:

Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Russia > Northwestern Federal District > Leningrad Oblast > Saint Petersburg (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Park, Hyun Jin, Agarwal, Dhruuv, Chen, Neng, Sun, Rentao, Partridge, Kurt, Chen, Justin, Zhang, Harry, Zhu, Pai, Bartel, Jacob, Kastner, Kyle, Wang, Gary, Rosenberg, Andrew, Wang, Quan

arXiv.org Artificial IntelligenceAug-19-2024

The keyword spotting (KWS) problem requires large amounts of real speech training data to achieve high accuracy across diverse populations. Utilizing large amounts of text-to-speech (TTS) synthesized data can reduce the cost and time associated with KWS development. However, TTS data may contain artifacts not present in real speech, which the KWS model can exploit (overfit), leading to degraded accuracy on real speech. To address this issue, we propose applying an adversarial training method to prevent the KWS model from learning TTS-specific features when trained on large amounts of TTS data. Experimental results demonstrate that KWS model accuracy on real speech data can be improved by up to 12% when adversarial loss is used in addition to the original KWS loss. Surprisingly, we also observed that the adversarial setup improves accuracy by up to 8%, even when trained solely on TTS and real negative speech data, without any real positive examples.

adversarial training, keyword spotting, minimize tts data overfitting

arXiv.org Artificial Intelligence

2408.10463

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Advancing Airport Tower Command Recognition: Integrating Squeeze-and-Excitation and Broadcasted Residual Learning

Lin, Yuanxi, Zhou, Tonglin, Xiao, Yang

arXiv.org Artificial IntelligenceJun-28-2024

Accurate recognition of aviation commands is vital for flight safety and efficiency, as pilots must follow air traffic control instructions precisely. This paper addresses challenges in speech command recognition, such as noisy environments and limited computational resources, by advancing keyword spotting technology. We create a dataset of standardized airport tower commands, including routine and emergency instructions. We enhance broadcasted residual learning with squeeze-and-excitation and time-frame frequency-wise squeeze-and-excitation techniques, resulting in our BC-SENet model. This model focuses on crucial information with fewer parameters. Our tests on five keyword spotting models, including BC-SENet, demonstrate superior accuracy and efficiency. These findings highlight the effectiveness of our model advancements in improving speech command recognition for aviation safety and efficiency in noisy, high-stakes environments. Additionally, BC-SENet shows comparable performance on the common Google Speech Command dataset.

convolution, dataset, keyword, (15 more...)

arXiv.org Artificial Intelligence

2406.18313

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Asia > Singapore (0.04)
Asia > Russia (0.04)
Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Transportation > Air (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech (0.70)

Add feedback

ED-sKWS: Early-Decision Spiking Neural Networks for Rapid,and Energy-Efficient Keyword Spotting

Song, Zeyang, Liu, Qianhui, Yang, Qu, Peng, Yizhou, Li, Haizhou

arXiv.org Artificial IntelligenceJun-13-2024

Keyword Spotting (KWS) is essential in edge computing requiring rapid and energy-efficient responses. Spiking Neural Networks (SNNs) are well-suited for KWS for their efficiency and temporal capacity for speech. To further reduce the latency and energy consumption, this study introduces ED-sKWS, an SNN-based KWS model with an early-decision mechanism that can stop speech processing and output the result before the end of speech utterance. Furthermore, we introduce a Cumulative Temporal (CT) loss that can enhance prediction accuracy at both the intermediate and final timesteps. To evaluate early-decision performance, we present the SC-100 dataset including 100 speech commands with beginning and end timestamp annotation. Experiments on the Google Speech Commands v2 and our SC-100 datasets show that ED-sKWS maintains competitive accuracy with 61% timesteps and 52% energy consumption compared to SNN models without early-decision mechanism, ensuring rapid response and energy efficiency.

keyword, neural network, timestep, (15 more...)

arXiv.org Artificial Intelligence

2406.12726

Country:

Asia > China > Guangdong Province > Shenzhen (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Singapore > Central Region > Singapore (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.50)

Industry: Energy (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients

V, Kesavaraj, M, Anuprabha, Vuppala, Anil Kumar

arXiv.org Artificial IntelligenceMay-23-2024

Identifying user-defined keywords is crucial for personalizing interactions with smart devices. Previous approaches of user-defined keyword spotting (UDKWS) have relied on short-term spectral features such as mel frequency cepstral coefficients (MFCC) to detect the spoken keyword. However, these features may face challenges in accurately identifying closely related pronunciation of audio-text pairs, due to their limited capability in capturing the temporal dynamics of the speech signal. To address this challenge, we propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability (transition between connecting phonemes) by incorporating long-term temporal information. The performance of the SDC feature is compared with various baseline features across four different datasets using a cross-attention based end-to-end system. Additionally, various configurations of SDC are explored to find the suitable temporal context for the UDKWS task. The experimental results reveal that the SDC feature outperforms the MFCC baseline feature, exhibiting an improvement of 8.32% in area under the curve (AUC) and 8.69% in terms of equal error rate (EER) on the challenging Libriphrase-hard dataset. Moreover, the proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.

dataset, keyword, keyword spotting, (14 more...)

arXiv.org Artificial Intelligence

2405.14489

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > India > Telangana > Hyderabad (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.68)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback