Schuller, Björn W.
Neuroplasticity in Artificial Intelligence -- An Overview and Inspirations on Drop In & Out Learning
Li, Yupei, Milling, Manuel, Schuller, Björn W.
Artificial Intelligence (AI) has achieved new levels of performance and spread in public usage with the rise of deep neural networks (DNNs). Initially inspired by human neurons and their connections, NNs have become the foundation of AI models for many advanced architectures. However, some of the most integral processes in the human brain, particularly neurogenesis and neuroplasticity in addition to the more spread neuroapoptosis have largely been ignored in DNN architecture design. Instead, contemporary AI development predominantly focuses on constructing advanced frameworks, such as large language models, which retain a static structure of neural connections during training and inference. In this light, we explore how neurogenesis, neuroapoptosis, and neuroplasticity can inspire future AI advances. Specifically, we examine analogous activities in artificial NNs, introducing the concepts of ``dropin'' for neurogenesis and revisiting ``dropout'' and structural pruning for neuroapoptosis. We additionally suggest neuroplasticity combining the two for future large NNs in ``life-long learning'' settings following the biological inspiration. We conclude by advocating for greater research efforts in this interdisciplinary domain and identifying promising directions for future exploration.
GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations
Li, Yupei, Sun, Qiyang, Murthy, Sunil Munthumoduku Krishna, Alturki, Emran, Schuller, Björn W.
GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations Y upei Li, Qiyang Sun, Sunil Munthumoduku Krishna Murthy, Emran Alturki, and Bj orn W . Schuller Fellow, IEEE Abstract --Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual's expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverages multiple signals but traditionally relies on utterance-level analysis, overlooking the dynamic nature of emotions in conversations. Emotion Recognition in Conversation (ERC) addresses this limitation, yet existing methods struggle to align multimodal features and explain why emotions evolve within dialogues. T o bridge this gap, we propose GatedxLSTM, a novel speech-text multimodal ERC model that explicitly considers voice and transcripts of both the speaker and their conversational partner(s) to identify the most influential sentences driving emotional shifts. By integrating Contrastive Language-Audio Pretraining (CLAP) for improved cross-modal alignment and employing a gating mechanism to emphasise emotionally impactful utterances, GatedxLSTM enhances both interpretability and performance. Experiments on the IEMOCAP dataset demonstrate that GatedxLSTM achieves state-of-the-art (SOT A) performance among open-source methods in four-class emotion classification. These results validate its effectiveness for ERC applications and provide an interpretability analysis from a psychological perspective. I NTRODUCTION Artificial General Intelligence (AGI) represents a key future direction in AI development, with Affective Computing (AC) playing a crucial role in enhancing AGI's ability to interact effectively with humans. Sunil Munthumoduku Krishna Murthy is with CHI - Chair of Health Informatics, MRI, Technical University of Munich, Germany (e-mail: sunil.munthumoduku@tum.de). Bj orn W . Schuller is with GLAM, Department of Computing, Imperial College London, UK; CHI - Chair of Health Informatics, Technical University of Munich, Germany; relAI - the Konrad Zuse School of Excellence in Reliable AI, Munich, Germany; MDSI - Munich Data Science Institute, Munich, Germany; and MCML - Munich Center for Machine Learning, Munich, Germany (e-mail: schuller@tum.de). Y upei Li and Qiyang Sun contributed equally to this work.
Representation Learning with Parameterised Quantum Circuits for Advancing Speech Emotion Recognition
Rajapakshe, Thejan, Rana, Rajib, Riaz, Farina, Khalifa, Sara, Schuller, Björn W.
Speech Emotion Recognition (SER) is a complex and challenging task in human-computer interaction due to the intricate dependencies of features and the overlapping nature of emotional expressions conveyed through speech. Although traditional deep learning methods have shown effectiveness, they often struggle to capture subtle emotional variations and overlapping states. This paper introduces a hybrid classical-quantum framework that integrates Parameterised Quantum Circuits (PQCs) with conventional Convolutional Neural Network (CNN) architectures. By leveraging quantum properties such as superposition and entanglement, the proposed model enhances feature representation and captures complex dependencies more effectively than classical methods. Experimental evaluations conducted on benchmark datasets, including IEMOCAP, RECOLA, and MSP-Improv, demonstrate that the hybrid model achieves higher accuracy in both binary and multi-class emotion classification while significantly reducing the number of trainable parameters. While a few existing studies have explored the feasibility of using Quantum Circuits to reduce model complexity, none have successfully shown how they can enhance accuracy. This study is the first to demonstrate that Quantum Circuits has the potential to improve the accuracy of SER. The findings highlight the promise of QML to transform SER, suggesting a promising direction for future research and practical applications in emotion-aware systems.
DFingerNet: Noise-Adaptive Speech Enhancement for Hearing Aids
Tsangko, Iosif, Triantafyllopoulos, Andreas, Müller, Michael, Schröter, Hendrik, Schuller, Björn W.
The DeepFilterNet (DFN) architecture was recently proposed as a deep learning model suited for hearing aid devices. Despite its competitive performance on numerous benchmarks, it still follows a `one-size-fits-all' approach, which aims to train a single, monolithic architecture that generalises across different noises and environments. However, its limited size and computation budget can hamper its generalisability. Recent work has shown that in-context adaptation can improve performance by conditioning the denoising process on additional information extracted from background recordings to mitigate this. These recordings can be offloaded outside the hearing aid, thus improving performance while adding minimal computational overhead. We introduce these principles to the DFN model, thus proposing the DFingerNet (DFiN) model, which shows superior performance on various benchmarks inspired by the DNS Challenge.
Gender Bias in Text-to-Video Generation Models: A case study of Sora
Nadeem, Mohammad, Sohail, Shahab Saquib, Cambria, Erik, Schuller, Björn W., Hussain, Amir
The advent of AI-generated content (AIGC) has spurred extensive scholarly research and revolutionized industries such as content generation [3,4], medical imaging [5,6], etc. Significant milestones, such as OpenAI's release of ChatGPT in 2023, have propelled the field toward the ambitious goal of Artificial General Intelligence (AGI). Among major Generative AI tools, Text-to-video (T2V) generation models have gained immense popularity due to their ability to create visually compelling and contextually accurate videos from textual descriptions [7]. Leveraging breakthroughs in Generative AI, T2V models like OpenAI's Sora [8] have showcased unprecedented capabilities in blending textual input with dynamic video output, transforming visual storytelling, advertising, and content creation. Generative AI models often inherit and amplify social biases and stereotypes embedded in their training data [9,10]. The training data, sourced from diverse and extensive internet repositories, frequently reflects cultural prejudices, societal inequities, and skewed portrayals of different demographics [15].
MAD-UV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge
Yang, Zijiang, Song, Meishu, Jing, Xin, Zhang, Haojie, Qian, Kun, Hu, Bin, Tamada, Kota, Takumi, Toru, Schuller, Björn W., Yamamoto, Yoshiharu
The Mice Autism Detection via Ultrasound Vocalization (MAD-UV) Challenge introduces the first INTERSPEECH challenge focused on detecting autism spectrum disorder (ASD) in mice through their vocalizations. Participants are tasked with developing models to automatically classify mice as either wild-type or ASD models based on recordings with a high sampling rate. Our baseline system employs a simple CNN-based classification using three different spectrogram features. Results demonstrate the feasibility of automated ASD detection, with the considered audible-range features achieving the best performance (UAR of 0.600 for segment-level and 0.625 for subject-level classification). This challenge bridges speech technology and biomedical research, offering opportunities to advance our understanding of ASD models through machine learning approaches. The findings suggest promising directions for vocalization analysis and highlight the potential value of audible and ultrasound vocalizations in ASD detection.
Towards Friendly AI: A Comprehensive Review and New Perspectives on Human-AI Alignment
Sun, Qiyang, Li, Yupei, Alturki, Emran, Murthy, Sunil Munthumoduku Krishna, Schuller, Björn W.
As Artificial Intelligence (AI) continues to advance rapidly, Friendly AI (FAI) has been proposed to advocate for more equitable and fair development of AI. Despite its importance, there is a lack of comprehensive reviews examining FAI from an ethical perspective, as well as limited discussion on its potential applications and future directions. This paper addresses these gaps by providing a thorough review of FAI, focusing on theoretical perspectives both for and against its development, and presenting a formal definition in a clear and accessible format. Key applications are discussed from the perspectives of eXplainable AI (XAI), privacy, fairness and affective computing (AC). Additionally, the paper identifies challenges in current technological advancements and explores future research avenues. The findings emphasise the significance of developing FAI and advocate for its continued advancement to ensure ethical and beneficial AI development.
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis
He, Xiangheng, Chen, Junjie, Zhang, Zixing, Schuller, Björn W.
Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which learns a bank of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features
Li, Yupei, Milling, Manuel, Specia, Lucia, Schuller, Björn W.
The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To address the challenge of detecting highly similar paraphrased texts, we propose MhBART, an encoder-decoder model designed to emulate human writing style while incorporating a novel difference score mechanism. This model outperforms strong classifier baselines and identifies deceptive sentence patterns. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets -- 15.5\% absolute improvement on paraLFQA, 4\% absolute improvement on paraWP, and 1.5\% absolute improvement on M4 compared to SOTA approaches.
autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks
Rampp, Simon, Triantafyllopoulos, Andreas, Milling, Manuel, Schuller, Björn W.
Reproducibility, code quality, and development speed constitute the'impossible trinity' of contemporary experimental artificial intelligence (AI) research. Of the three, the first has attracted the most attention in recent literature [1], as reproducibility of findings is a cornerstone of science. However, the impact of the other two should not be underestimated. Development speed allows the quick iteration of ideas - a necessary prerequisite in experimental sciences and a prominent feature of AI research, as asserted by "The Bitter Lesson" of R. Sutton [2]. Similarly, code quality can be the key differentiating factor when it comes to "standing on the shoulders of giants", as shaky foundations can lead to a spectacular collapse. This is why toolkits that are easy-to-use and provide pre-baked reproducibility are critical for the proliferation and adaptation of new ideas. The not-so-recent renaissance of deep learning (DL) has been largely driven by the creation of such toolkits.