Goto

Collaborating Authors

 vad


Multi-Stage Speaker Diarization for Noisy Classrooms

arXiv.org Artificial Intelligence

Speaker diarization, the process of identifying "who spoke when" in audio recordings, is essential for understanding classroom dynamics. However, classroom settings present distinct challenges, including poor recording quality, high levels of background noise, overlapping speech, and the difficulty of accurately capturing children's voices. This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline. We assess the impact of denoising on diarization accuracy and compare various voice activity detection (VAD) models, including self-supervised transformer-based frame-wise VAD models. We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions. We conduct experiments using two datasets from English speaking classrooms to separate teacher vs. student speech and to separate all speakers. Our results show that denoising significantly improves the Diarization Error Rate (DER) by reducing the rate of missed speech. Additionally, training on both denoised and noisy datasets leads to substantial performance gains in noisy conditions. The hybrid VAD model leads to further improvements in speech detection, achieving a DER as low as 17% in teacher-student experiments and 45% in all-speaker experiments. However, we also identified trade-offs between voice activity detection and speaker confusion. Overall, our study highlights the effectiveness of multi-stage diarization models and integrating ASR-based information for enhancing speaker diarization in noisy classroom environments.


GPT's Devastated and LLaMA's Content: Emotion Representation Alignment in LLMs for Keyword-based Generation

arXiv.org Artificial Intelligence

In controlled text generation using large language models (LLMs), gaps arise between the language model's interpretation and human expectations. We look at the problem of controlling emotions in keyword-based sentence generation for both GPT-4 and LLaMA-3. We selected four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. Our human evaluation looked at the Human-LLM alignment for each representation, as well as the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., "angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. However, we found that converting the originally-numeric VAD scales to Lexical scales (e.g., +4.0 becomes "High") dramatically improved agreement. Furthermore, the perception of how much a generated sentence conveys an emotion is highly dependent on the LLM, representation type, and which emotion it is.


An End-to-End Approach for Korean Wakeword Systems with Speaker Authentication

arXiv.org Artificial Intelligence

Wakeword detection plays a critical role in enabling AI assistants to listen to user voices and interact effectively. However, for languages other than English, there is a significant lack of pre-trained wakeword models. Additionally, systems that merely determine the presence of a wakeword can pose serious privacy concerns. In this paper, we propose an end-to-end approach that trains wakewords for Non-English languages, particulary Korean, and uses this to develop a Voice Authentication model to protect user privacy. Our implementation employs an open-source platform OpenWakeWord, which performs wakeword detection using an FCN (Fully-Connected Network) architecture. Once a wakeword is detected, our custom-developed code calculates cosine similarity for robust user authentication. Experimental results demonstrate the effectiveness of our approach, achieving a 16.79% and a 6.6% Equal Error Rate (EER) each in the Wakeword Detection and the Voice Authentication. These findings highlight the model's potential in providing secure and accurate wakeword detection and authentication for Korean users.


VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

arXiv.org Artificial Intelligence

The rapid advancement of vision-language models (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions of VERA are highly adaptable, significantly improving both detection performance and explainability of VLMs for VAD.


Deep Learning for Video Anomaly Detection: A Review

arXiv.org Artificial Intelligence

Video anomaly detection (VAD) aims to discover behaviors or events deviating from the normality in videos. As a long-standing task in the field of computer vision, VAD has witnessed much good progress. In the era of deep learning, with the explosion of architectures of continuously growing capability and capacity, a great variety of deep learning based methods are constantly emerging for the VAD task, greatly improving the generalization ability of detection algorithms and broadening the application scenarios. Therefore, such a multitude of methods and a large body of literature make a comprehensive survey a pressing necessity. In this paper, we present an extensive and comprehensive research review, covering the spectrum of five different categories, namely, semi-supervised, weakly supervised, fully supervised, unsupervised and open-set supervised VAD, and we also delve into the latest VAD works based on pre-trained large models, remedying the limitations of past reviews in terms of only focusing on semi-supervised VAD and small model based methods. For the VAD task with different levels of supervision, we construct a well-organized taxonomy, profoundly discuss the characteristics of different types of methods, and show their performance comparisons. In addition, this review involves the public datasets, open-source codes, and evaluation metrics covering all the aforementioned VAD tasks. Finally, we provide several important research directions for the VAD community.


Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation

arXiv.org Artificial Intelligence

This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2024) for Irish-to-English speech translation. We built end-to-end systems based on Whisper, and employed a number of data augmentation techniques, such as speech back-translation and noise augmentation. We investigate the effect of using synthetic audio data and discuss several methods for enriching signal diversity.


A Real-Time Voice Activity Detection Based On Lightweight Neural

arXiv.org Artificial Intelligence

Voice activity detection (VAD) is the task of detecting speech in an audio stream, which is challenging due to numerous unseen noises and low signal-to-noise ratios in real environments. Recently, neural network-based VADs have alleviated the degradation of performance to some extent. However, the majority of existing studies have employed excessively large models and incorporated future context, while neglecting to evaluate the operational efficiency and latency of the models. In this paper, we propose a lightweight and real-time neural network called MagicNet, which utilizes casual and depth separable 1-D convolutions and GRU. Without relying on future features as input, our proposed model is compared with two state-of-the-art algorithms on synthesized in-domain and out-domain test datasets. The evaluation results demonstrate that MagicNet can achieve improved performance and robustness with fewer parameter costs.


Supervised Anomaly Detection for Complex Industrial Images

arXiv.org Artificial Intelligence

Automating visual inspection in industrial production lines is essential for increasing product quality across various industries. Anomaly detection (AD) methods serve as robust tools for this purpose. However, existing public datasets primarily consist of images without anomalies, limiting the practical application of AD methods in production settings. To address this challenge, we present (1) the Valeo Anomaly Dataset (VAD), a novel real-world industrial dataset comprising 5000 images, including 2000 instances of challenging real defects across more than 20 subclasses. Acknowledging that traditional AD methods struggle with this dataset, we introduce (2) Segmentation-based Anomaly Detector (SegAD). First, SegAD leverages anomaly maps as well as segmentation maps to compute local statistics. Next, SegAD uses these statistics and an optional supervised classifier score as input features for a Boosted Random Forest (BRF) classifier, yielding the final anomaly score. Our SegAD achieves state-of-the-art performance on both VAD (+2.1% AUROC) and the VisA dataset (+0.4% AUROC). The code and the models are publicly available.


Online speaker diarization of meetings guided by speech separation

arXiv.org Artificial Intelligence

Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings with a variable number of speakers, as present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for the separation networks, with two or three output sources. To obtain the speaker diarization result, voice activity detection is applied on each estimated source. The final model is fine-tuned end-to-end, after first adapting the separation to real data using AMI. The system operates on short segments, and inference is performed by stitching the local predictions using speaker embeddings and incremental clustering. The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech). Finally, we show the strength of our system particularly on overlapped speech sections.


A Survey on Visual Anomaly Detection: Challenge, Approach, and Prospect

arXiv.org Artificial Intelligence

Visual Anomaly Detection (VAD) endeavors to pinpoint deviations from the concept of normality in visual data, widely applied across diverse domains, e.g., industrial defect inspection, and medical lesion detection. This survey comprehensively examines recent advancements in VAD by identifying three primary challenges: 1) scarcity of training data, 2) diversity of visual modalities, and 3) complexity of hierarchical anomalies. Starting with a brief overview of the VAD background and its generic concept definitions, we progressively categorize, emphasize, and discuss the latest VAD progress from the perspective of sample number, data modality, and anomaly hierarchy. Through an in-depth analysis of the VAD field, we finally summarize future developments for VAD and conclude the key findings and contributions of this survey.