AITopics | speaker detection

Collaborating Authors

speaker detection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

An Efficient and Streaming Audio Visual Active Speaker Detection System

Kundu, Arnav, Jin, Yanzi, Sekhavat, Mohammad, Horton, Max, Tormoen, Danny, Naik, Devang

arXiv.org Artificial IntelligenceSep-13-2024

This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person is speaking or not in a series of video frames. While previous works have made significant strides in improving network architectures and learning effective representations for ASD, a critical gap exists in the exploration of real-time system deployment. Existing models often suffer from high latency and memory usage, rendering them impractical for immediate applications. To bridge this gap, we present two scenarios that address the key challenges posed by real-time constraints. First, we introduce a method to limit the number of future context frames utilized by the ASD model. By doing so, we alleviate the need for processing the entire sequence of future frames before a decision is made, significantly reducing latency. Second, we propose a more stringent constraint that limits the total number of past frames the model can access during inference. This tackles the persistent memory issues associated with running streaming ASD systems. Beyond these theoretical frameworks, we conduct extensive experiments to validate our approach. Our results demonstrate that constrained transformer models can achieve performance comparable to or even better than state-of-the-art recurrent models, such as uni-directional GRUs, with a significantly reduced number of context frames. Moreover, we shed light on the temporal memory requirements of ASD systems, revealing that larger past context has a more profound impact on accuracy than future context. When profiling on a CPU we find that our efficient architecture is memory bound by the amount of past context it can use and that the compute cost is negligible as compared to the memory cost.

active speaker detection, encoder, future context, (16 more...)

arXiv.org Artificial Intelligence

2409.09018

Country:

Europe > Portugal > Braga > Braga (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report > New Finding (0.54)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Architecture > Real Time Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task

Ding, Bosong, Kirtay, Murat, Spigler, Giacomo

arXiv.org Artificial IntelligenceJul-16-2024

Head movements are crucial for social human-human interaction. They can transmit important cues (e.g., joint attention, speaker detection) that cannot be achieved with verbal interaction alone. This advantage also holds for human-robot interaction. Even though modeling human motions through generative AI models has become an active research area within robotics in recent years, the use of these methods for producing head movements in human-robot interaction remains underexplored. In this work, we employed a generative AI pipeline to produce human-like head movements for a Nao humanoid robot. In addition, we tested the system on a real-time active-speaker tracking task in a group conversation setting. Overall, the results show that the Nao robot successfully imitates human head movements in a natural manner while actively tracking the speakers during the conversation. Code and data from this study are available at https://github.com/dingdingding60/Humanoids2024HRI

head movement, robot, trajectory, (15 more...)

arXiv.org Artificial Intelligence

2407.11915

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Humanoid Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.54)

Add feedback

Leveraging Visual Supervision for Array-based Active Speaker Detection and Localization

Berghi, Davide, Jackson, Philip J. B.

arXiv.org Artificial IntelligenceDec-21-2023

Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the speaker is not visible. We demonstrate that a simple audio convolutional recurrent neural network (CRNN) trained with spatial input features extracted from multichannel audio can perform simultaneous horizontal active speaker detection and localization (ASDL), independently of the visual modality. To address the time and cost of generating ground truth labels to train such a system, we propose a new self-supervised training pipeline that embraces a ``student-teacher'' learning approach. A conventional pre-trained active speaker detector is adopted as a ``teacher'' network to provide the position of the speakers as pseudo-labels. The multichannel audio ``student'' network is trained to generate the same results. At inference, the student network can generalize and locate also the occluded speakers that the teacher network is not able to detect visually, yielding considerable improvements in recall rate. Experiments on the TragicTalkers dataset show that an audio network trained with the proposed self-supervised learning approach can exceed the performance of the typical audio-visual methods and produce results competitive with the costly conventional supervised training. We demonstrate that improvements can be achieved when minimal manual supervision is introduced in the learning pipeline. Further gains may be sought with larger training sets and integrating vision with the multichannel audio system.

detection, supervision, teacher network, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TASLP.2023.3346643

2312.14021

Country:

Europe > United Kingdom > England > Surrey > Guildford (0.04)
Europe > United Kingdom > England > Hampshire > Southampton (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism

Gurvich, Ilya, Leichter, Ido, Palle, Dharmendar Reddy, Asher, Yossi, Vinnikov, Alon, Abramovski, Igor, Gopal, Vishak, Cutler, Ross, Krupka, Eyal

arXiv.org Artificial IntelligenceSep-15-2023

We introduce a distinctive real-time, causal, neural network-based active speaker detection system optimized for low-power edge computing. This system drives a virtual cinematography module and is deployed on a commercial device. The system uses data originating from a microphone array and a 360-degree camera. Our network requires only 127 MFLOPs per participant, for a meeting with 14 participants. Unlike previous work, we examine the error rate of our network when the computational budget is exhausted, and find that it exhibits graceful degradation, allowing the system to operate reasonably well even in this case. Departing from conventional DOA estimation approaches, our network learns to query the available acoustic data, considering the detected head locations. We train and evaluate our algorithm on a realistic meetings dataset featuring up to 14 participants in the same meeting, overlapped speech, and other challenging scenarios.

dataset, detection, participant, (16 more...)

arXiv.org Artificial Intelligence

2309.08295

Country: Asia > Taiwan > Taiwan Province > Taipei (0.04)

Genre: Research Report (0.40)

Industry:

Media > Film (0.67)
Leisure & Entertainment (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection

Chen, Xuanjun, Wu, Haibin, Meng, Helen, Lee, Hung-yi, Jang, Jyh-Shing Roger

arXiv.org Artificial IntelligenceOct-3-2022

Audio-visual active speaker detection (AVASD) is well-developed, and now is an indispensable front-end for several multi-modal applications. However, to the best of our knowledge, the adversarial robustness of AVASD models hasn't been investigated, not to mention the effective defense against such attacks. In this paper, we are the first to reveal the vulnerability of AVASD models under audio-only, visual-only, and audio-visual adversarial attacks through extensive experiments. What's more, we also propose a novel audio-visual interaction loss (AVIL) for making attackers difficult to find feasible adversarial examples under an allocated attack budget. The loss aims at pushing the inter-class embeddings to be dispersed, namely non-speech and speech clusters, sufficiently disentangled, and pulling the intra-class embeddings as close as possible to keep them compact. Experimental results show the AVIL outperforms the adversarial training by 33.14 mAP (%) under multi-modal attacks.

adversarial attack, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2210.00753

Country:

Asia > Taiwan (0.04)
Asia > China > Hong Kong (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.84)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)

Add feedback

Rethinking Audio-visual Synchronization for Active Speaker Detection

Wuerkaixi, Abudukelimu, Zhang, You, Duan, Zhiyao, Zhang, Changshui

arXiv.org Artificial IntelligenceJul-10-2022

Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking. To address this problem, we propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue. Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.

artificial intelligence, machine learning, synchronization, (17 more...)

arXiv.org Artificial Intelligence

2206.10421

Country:

Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Bio-Inspired Modality Fusion for Active Speaker Detection

Assunção, Gustavo, Gonçalves, Nuno, Menezes, Paulo

arXiv.org Machine LearningFeb-28-2020

Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened enabling, for instance, the well known "cocktail party" and McGurk effects, i.e. speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, Neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.

dataset, detection, speaker detection, (14 more...)

arXiv.org Machine Learning

2003.00063

Country:

Europe > Portugal > Coimbra > Coimbra (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
North America > United States > New York (0.04)
(7 more...)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.66)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
(3 more...)

Add feedback

Multimodal active speaker detection and virtual cinematography for video conferencing

Cutler, Ross, Mehran, Ramin, Johnson, Sam, Zhang, Cha, Kirk, Adam, Whyte, Oliver, Kowdle, Adarsh

arXiv.org Machine LearningFeb-12-2020

Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array; it extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning to optimize the subjective quality of the overall experience. To avoid distracting the room participants and reduce switching latency the system has no moving parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a dataset with N=100 meetings, each 2-5 minutes in length.

asd and vc, speaker detection, video, (10 more...)

arXiv.org Machine Learning

2002.03977

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Washington > King County > Seattle (0.05)
North America > United States > Washington > King County > Redmond (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Communications > Collaboration (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Communications > Social Media > Crowdsourcing (0.35)

Add feedback

Self-Supervised Vision-Based Detection of the Active Speaker as a Prerequisite for Socially-Aware Language Acquisition

Stefanov, Kalin, Beskow, Jonas, Salvi, Giampiero

arXiv.org Machine LearningNov-24-2017

This paper presents a self-supervised method for detecting the active speaker in a multi-person spoken interaction scenario. We argue that this capability is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. Our methods are able to detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Our methods do not rely on external annotations, thus complying with cognitive development. Instead, they use information from the auditory modality to support learning in the visual domain. The methods have been extensively evaluated on a large multi-person face-to-face interaction dataset. The results reach an accuracy of 80% on a multi-speaker setting. We believe this system represents an essential component of any artificial cognitive system or robotic platform engaging in social interaction.

artificial intelligence, experiment, machine learning, (17 more...)

arXiv.org Machine Learning

1711.08992

Country: Europe > Sweden (0.14)

Genre: Research Report > New Finding (0.94)

Industry: Education > Curriculum (0.47)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)

Add feedback

An Alternative to Low-level-Sychrony-Based Methods for Speech Detection

Movellan, Javier R., Ruvolo, Paul L.

Neural Information Processing SystemsDec-31-2010

Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. One popular approach to this problem is audio-visual synchrony detection. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot). Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database.

artificial intelligence, detector, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States > California > San Diego County (0.14)

Genre: Instructional Material > Course Syllabus & Notes (0.46)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)

Add feedback