AITopics | vad

Collaborating Authors

vad

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection

Neural Information Processing SystemsJun-11-2026, 12:48:58 GMT

Video Anomaly Detection (VAD) aims to locate unusual activities or behaviors within videos. Recently, offline VAD has garnered substantial research attention, which has been invigorated by the progress in large language models (LLMs) and vision-language models (VLMs), offering the potential for a more nuanced understanding of anomalies. However, online VAD has seldom received attention due to real-time constraints and computational intensity.

data mining, large language model, machine learning, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (0.81)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.38)

Add feedback

Multi-Stage Speaker Diarization for Noisy Classrooms

Khan, Ali Sartaz, Ogunremi, Tolulope, Attia, Ahmed Adel, Demszky, Dorottya

arXiv.org Artificial IntelligenceMay-28-2025

Speaker diarization, the process of identifying "who spoke when" in audio recordings, is essential for understanding classroom dynamics. However, classroom settings present distinct challenges, including poor recording quality, high levels of background noise, overlapping speech, and the difficulty of accurately capturing children's voices. This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline. We assess the impact of denoising on diarization accuracy and compare various voice activity detection (VAD) models, including self-supervised transformer-based frame-wise VAD models. We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions. We conduct experiments using two datasets from English speaking classrooms to separate teacher vs. student speech and to separate all speakers. Our results show that denoising significantly improves the Diarization Error Rate (DER) by reducing the rate of missed speech. Additionally, training on both denoised and noisy datasets leads to substantial performance gains in noisy conditions. The hybrid VAD model leads to further improvements in speech detection, achieving a DER as low as 17% in teacher-student experiments and 45% in all-speaker experiments. However, we also identified trade-offs between voice activity detection and speaker confusion. Overall, our study highlights the effectiveness of multi-stage diarization models and integrating ASR-based information for enhancing speaker diarization in noisy classroom environments.

experiment, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.10879

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.86)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

GPT's Devastated and LLaMA's Content: Emotion Representation Alignment in LLMs for Keyword-based Generation

Choudhury, Shadab, Kumar, Asha, Martin, Lara J.

arXiv.org Artificial IntelligenceMar-14-2025

In controlled text generation using large language models (LLMs), gaps arise between the language model's interpretation and human expectations. We look at the problem of controlling emotions in keyword-based sentence generation for both GPT-4 and LLaMA-3. We selected four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. Our human evaluation looked at the Human-LLM alignment for each representation, as well as the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., "angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. However, we found that converting the originally-numeric VAD scales to Lexical scales (e.g., +4.0 becomes "High") dramatically improved agreement. Furthermore, the perception of how much a generated sentence conveys an emotion is highly dependent on the LLM, representation type, and which emotion it is.

emotion, participant, representation, (12 more...)

arXiv.org Artificial Intelligence

2503.11881

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
(21 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An End-to-End Approach for Korean Wakeword Systems with Speaker Authentication

Seo, Geonwoo

arXiv.org Artificial IntelligenceJan-21-2025

Wakeword detection plays a critical role in enabling AI assistants to listen to user voices and interact effectively. However, for languages other than English, there is a significant lack of pre-trained wakeword models. Additionally, systems that merely determine the presence of a wakeword can pose serious privacy concerns. In this paper, we propose an end-to-end approach that trains wakewords for Non-English languages, particulary Korean, and uses this to develop a Voice Authentication model to protect user privacy. Our implementation employs an open-source platform OpenWakeWord, which performs wakeword detection using an FCN (Fully-Connected Network) architecture. Once a wakeword is detected, our custom-developed code calculates cosine similarity for robust user authentication. Experimental results demonstrate the effectiveness of our approach, achieving a 16.79% and a 6.6% Equal Error Rate (EER) each in the Wakeword Detection and the Voice Authentication. These findings highlight the model's potential in providing secure and accurate wakeword detection and authentication for Korean users.

artificial intelligence, machine learning, threshold, (13 more...)

arXiv.org Artificial Intelligence

2501.12194

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.37)

Add feedback

VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

Ye, Muchao, Liu, Weiyang, He, Pan

arXiv.org Artificial IntelligenceDec-1-2024

The rapid advancement of vision-language models (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions of VERA are highly adaptable, significantly improving both detection performance and explainability of VLMs for VAD.

anomaly, video, vlm, (16 more...)

arXiv.org Artificial Intelligence

2412.01095

Country:

North America > United States > Iowa (0.04)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
Asia > India (0.04)

Genre: Research Report (1.00)

Industry: Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Deep Learning for Video Anomaly Detection: A Review

Wu, Peng, Pan, Chengyu, Yan, Yuting, Pang, Guansong, Wang, Peng, Zhang, Yanning

arXiv.org Artificial IntelligenceSep-9-2024

Video anomaly detection (VAD) aims to discover behaviors or events deviating from the normality in videos. As a long-standing task in the field of computer vision, VAD has witnessed much good progress. In the era of deep learning, with the explosion of architectures of continuously growing capability and capacity, a great variety of deep learning based methods are constantly emerging for the VAD task, greatly improving the generalization ability of detection algorithms and broadening the application scenarios. Therefore, such a multitude of methods and a large body of literature make a comprehensive survey a pressing necessity. In this paper, we present an extensive and comprehensive research review, covering the spectrum of five different categories, namely, semi-supervised, weakly supervised, fully supervised, unsupervised and open-set supervised VAD, and we also delve into the latest VAD works based on pre-trained large models, remedying the limitations of past reviews in terms of only focusing on semi-supervised VAD and small model based methods. For the VAD task with different levels of supervision, we construct a well-organized taxonomy, profoundly discuss the characteristics of different types of methods, and show their performance comparisons. In addition, this review involves the public datasets, open-source codes, and evaluation metrics covering all the aforementioned VAD tasks. Finally, we provide several important research directions for the VAD community.

anomaly detection, detection, proceedings, (12 more...)

arXiv.org Artificial Intelligence

2409.05383

Country:

Asia > Singapore > Central Region > Singapore (0.04)
Asia > China (0.04)

Genre:

Research Report (1.00)
Overview (1.00)

Industry: Information Technology (0.67)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation

Moslem, Yasmin

arXiv.org Artificial IntelligenceJun-27-2024

This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2024) for Irish-to-English speech translation. We built end-to-end systems based on Whisper, and employed a number of data augmentation techniques, such as speech back-translation and noise augmentation. We investigate the effect of using synthetic audio data and discuss several methods for enriching signal diversity.

dataset, proceedings, translation, (12 more...)

arXiv.org Artificial Intelligence

2406.17363

Country:

Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > Massachusetts (0.04)
(12 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

A Real-Time Voice Activity Detection Based On Lightweight Neural

Jia, Jidong, Zhao, Pei, Wang, Di

arXiv.org Artificial IntelligenceMay-26-2024

Voice activity detection (VAD) is the task of detecting speech in an audio stream, which is challenging due to numerous unseen noises and low signal-to-noise ratios in real environments. Recently, neural network-based VADs have alleviated the degradation of performance to some extent. However, the majority of existing studies have employed excessively large models and incorporated future context, while neglecting to evaluate the operational efficiency and latency of the models. In this paper, we propose a lightweight and real-time neural network called MagicNet, which utilizes casual and depth separable 1-D convolutions and GRU. Without relying on future features as input, our proposed model is compared with two state-of-the-art algorithms on synthesized in-domain and out-domain test datasets. The evaluation results demonstrate that MagicNet can achieve improved performance and robustness with fewer parameter costs.

activity detection, neural network, voice activity detection, (13 more...)

arXiv.org Artificial Intelligence

2405.16797

Country:

Asia > China > Shanghai > Shanghai (0.05)
Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(7 more...)

Genre: Research Report > New Finding (0.34)

Industry: Media (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Supervised Anomaly Detection for Complex Industrial Images

Baitieva, Aimira, Hurych, David, Besnier, Victor, Bernard, Olivier

arXiv.org Artificial IntelligenceMay-11-2024

Automating visual inspection in industrial production lines is essential for increasing product quality across various industries. Anomaly detection (AD) methods serve as robust tools for this purpose. However, existing public datasets primarily consist of images without anomalies, limiting the practical application of AD methods in production settings. To address this challenge, we present (1) the Valeo Anomaly Dataset (VAD), a novel real-world industrial dataset comprising 5000 images, including 2000 instances of challenging real defects across more than 20 subclasses. Acknowledging that traditional AD methods struggle with this dataset, we introduce (2) Segmentation-based Anomaly Detector (SegAD). First, SegAD leverages anomaly maps as well as segmentation maps to compute local statistics. Next, SegAD uses these statistics and an optional supervised classifier score as input features for a Boosted Random Forest (BRF) classifier, yielding the final anomaly score. Our SegAD achieves state-of-the-art performance on both VAD (+2.1% AUROC) and the VisA dataset (+0.4% AUROC). The code and the models are publicly available.

anomaly detection, dataset, defect, (13 more...)

arXiv.org Artificial Intelligence

2405.04953

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Online speaker diarization of meetings guided by speech separation

Gruttadauria, Elio, Fontaine, Mathieu, Essid, Slim

arXiv.org Artificial IntelligenceJan-30-2024

Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings with a variable number of speakers, as present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for the separation networks, with two or three output sources. To obtain the speaker diarization result, voice activity detection is applied on each estimated source. The final model is fine-tuned end-to-end, after first adapting the separation to real data using AMI. The system operates on short segments, and inference is performed by stitching the local predictions using speaker embeddings and incremental clustering. The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech). Finally, we show the strength of our system particularly on overlapped speech sections.

diarization, separation, ssep model, (13 more...)

arXiv.org Artificial Intelligence

2402.00067

Country:

Europe > Switzerland (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback