schüller
Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge
Li, Yupei, Shao, Shuaijie, Milling, Manuel, Schuller, Björn W.
Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper. The codes are available at https://github.com/myxp-lyp/Depression-detection.git
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- Europe > United Kingdom > England > Greater London > London (0.04)
- North America > United States (0.04)
- (4 more...)
- Overview (1.00)
- Research Report > New Finding (0.68)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
- Health & Medicine > Diagnostic Medicine (0.86)
Speech-Based Depressive Mood Detection in the Presence of Multiple Sclerosis: A Cross-Corpus and Cross-Lingual Study
Gonzalez-Machorro, Monica, Reichel, Uwe, Hecker, Pascal, Hammer, Helly, Sagha, Hesam, Eyben, Florian, Hoepner, Robert, Schuller, Björn W.
Depression commonly co-occurs with neurodegenerative disorders like Multiple Sclerosis (MS), yet the potential of speech-based Artificial Intelligence for detecting depression in such contexts remains unexplored. This study examines the transferability of speech-based depression detection methods to people with MS (pwMS) through cross-corpus and cross-lingual analysis using English data from the general population and German data from pwMS. Our approach implements supervised machine learning models using: 1) conventional speech and language features commonly used in the field, 2) emotional dimensions derived from a Speech Emotion Recognition (SER) model, and 3) exploratory speech feature analysis. Despite limited data, our models detect depressive mood in pwMS with moderate generalisability, achieving a 66% Unweighted Average Recall (UAR) on a binary task. Feature selection further improved performance, boosting UAR to 74%. Our findings also highlight the relevant role emotional changes have as an indicator of depressive mood in both the general population and within PwMS. This study provides an initial exploration into generalising speech-based depression detection, even in the presence of co-occurring conditions, such as neurodegenerative diseases.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
- Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Non-Verbal Vocalisations and their Challenges: Emotion, Privacy, Sparseness, and Real Life
Batliner, Anton, Amiriparian, Shahin, Schuller, Björn W.
Non-Verbal Vocalisations (NVVs) are short `non-word' utterances without proper linguistic (semantic) meaning but conveying connotations -- be this emotions/affects or other paralinguistic information. We start this contribution with a historic sketch: how they were addressed in psychology and linguistics in the last two centuries, how they were neglected later on, and how they came to the fore with the advent of emotion research. We then give an overview of types of NVVs (formal aspects) and functions of NVVs, exemplified with the typical NVV \textit{ah}. Interesting as they are, NVVs come, however, with a bunch of challenges that should be accounted for: Privacy and general ethical considerations prevent them of being recorded in real-life (private) scenarios to a sufficient extent. Isolated, prompted (acted) exemplars do not necessarily model NVVs in context; yet, this is the preferred strategy so far when modelling NVVs, especially in AI. To overcome these problems, we argue in favour of corpus-based approaches. This guarantees a more realistic modelling; however, we are still faced with privacy and sparse data problems.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Germany > Saarland (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- (20 more...)
- Overview (1.00)
- Research Report (0.82)
Breaking Resource Barriers in Speech Emotion Recognition via Data Distillation
Chang, Yi, Ren, Zhao, Zhao, Zhonghao, Nguyen, Thanh Tam, Qian, Kun, Schultz, Tanja, Schuller, Björn W.
Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment of SER models. To address these challenges, we propose a data distillation framework to facilitate efficient development of SER models in IoT applications using a synthesised, smaller, and distilled dataset. Our experiments demonstrate that the distilled dataset can be effectively utilised to train SER models with fixed initialisation, achieving performances comparable to those developed using the original full emotional speech dataset.
- Europe > Germany > Bremen > Bremen (0.14)
- Oceania > Australia (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (2 more...)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.66)
MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge
Jing, Xin, Wang, Jiadong, Tsangko, Iosif, Triantafyllopoulos, Andreas, Schuller, Björn W.
Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scalable alternative for annotating text data. However, the potential of LLMs to perform emotional speech data annotation without human supervision has yet to be thoroughly investigated. To address these problems, we apply GPT-4o to annotate a multimodal dataset collected from the sitcom Friends, using only textual cues as inputs. By crafting structured text prompts, our methodology capitalizes on the knowledge GPT-4o has accumulated during its training, showcasing that it can generate accurate and contextually relevant annotations without direct access to multimodal inputs. Therefore, we propose MELT, a multimodal emotion dataset fully annotated by GPT-4o. We demonstrate the effectiveness of MELT by fine-tuning four self-supervised learning (SSL) backbones and assessing speech emotion recognition performance across emotion datasets. Additionally, our subjective experiments\' results demonstrate a consistence performance improvement on SER.
- North America > United States (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- North America > Canada > Ontario > Toronto (0.04)
- (5 more...)
MAD-UV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge
Yang, Zijiang, Song, Meishu, Jing, Xin, Zhang, Haojie, Qian, Kun, Hu, Bin, Tamada, Kota, Takumi, Toru, Schuller, Björn W., Yamamoto, Yoshiharu
The Mice Autism Detection via Ultrasound Vocalization (MAD-UV) Challenge introduces the first INTERSPEECH challenge focused on detecting autism spectrum disorder (ASD) in mice through their vocalizations. Participants are tasked with developing models to automatically classify mice as either wild-type or ASD models based on recordings with a high sampling rate. Our baseline system employs a simple CNN-based classification using three different spectrogram features. Results demonstrate the feasibility of automated ASD detection, with the considered audible-range features achieving the best performance (UAR of 0.600 for segment-level and 0.625 for subject-level classification). This challenge bridges speech technology and biomedical research, offering opportunities to advance our understanding of ASD models through machine learning approaches. The findings suggest promising directions for vocalization analysis and highlight the potential value of audible and ultrasound vocalizations in ASD detection.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (5 more...)
Towards Friendly AI: A Comprehensive Review and New Perspectives on Human-AI Alignment
Sun, Qiyang, Li, Yupei, Alturki, Emran, Murthy, Sunil Munthumoduku Krishna, Schuller, Björn W.
As Artificial Intelligence (AI) continues to advance rapidly, Friendly AI (FAI) has been proposed to advocate for more equitable and fair development of AI. Despite its importance, there is a lack of comprehensive reviews examining FAI from an ethical perspective, as well as limited discussion on its potential applications and future directions. This paper addresses these gaps by providing a thorough review of FAI, focusing on theoretical perspectives both for and against its development, and presenting a formal definition in a clear and accessible format. Key applications are discussed from the perspectives of eXplainable AI (XAI), privacy, fairness and affective computing (AC). Additionally, the paper identifies challenges in current technological advancements and explores future research avenues. The findings emphasise the significance of developing FAI and advocate for its continued advancement to ensure ethical and beneficial AI development.
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (6 more...)
- Overview (0.84)
- Instructional Material (0.67)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- (2 more...)
Classification of Spontaneous and Scripted Speech for Multilingual Audio
Elisha, Shahar, McDowell, Andrew, Beguerisse-Díaz, Mariano, Benetos, Emmanouil
Distinguishing scripted from spontaneous speech is an essential tool for better understanding how speech styles influence speech processing research. It can also improve recommendation systems and discovery experiences for media users through better segmentation of large recorded speech catalogues. This paper addresses the challenge of building a classifier that generalises well across different formats and languages. We systematically evaluate models ranging from traditional, handcrafted acoustic and prosodic features to advanced audio transformers, utilising a large, multilingual proprietary podcast dataset for training and validation. We break down the performance of each model across 11 language groups to evaluate cross-lingual biases. Our experimental analysis extends to publicly available datasets to assess the models' generalisability to non-podcast domains. Our results indicate that transformer-based models consistently outperform traditional feature-based techniques, achieving state-of-the-art performance in distinguishing between scripted and spontaneous speech across various languages.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (6 more...)
autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks
Rampp, Simon, Triantafyllopoulos, Andreas, Milling, Manuel, Schuller, Björn W.
Reproducibility, code quality, and development speed constitute the'impossible trinity' of contemporary experimental artificial intelligence (AI) research. Of the three, the first has attracted the most attention in recent literature [1], as reproducibility of findings is a cornerstone of science. However, the impact of the other two should not be underestimated. Development speed allows the quick iteration of ideas - a necessary prerequisite in experimental sciences and a prominent feature of AI research, as asserted by "The Bitter Lesson" of R. Sutton [2]. Similarly, code quality can be the key differentiating factor when it comes to "standing on the shoulders of giants", as shaky foundations can lead to a spectacular collapse. This is why toolkits that are easy-to-use and provide pre-baked reproducibility are critical for the proliferation and adaptation of new ideas. The not-so-recent renaissance of deep learning (DL) has been largely driven by the creation of such toolkits.
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- North America > United States > Alaska (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)
Audio-based Kinship Verification Using Age Domain Conversion
Sun, Qiyang, Akman, Alican, Jing, Xin, Milling, Manuel, Schuller, Björn W.
Audio-based kinship verification (AKV) is important in many domains, such as home security monitoring, forensic identification, and social network analysis. A key challenge in the task arises from differences in age across samples from different individuals, which can be interpreted as a domain bias in a cross-domain verification task. To address this issue, we design the notion of an "age-standardised domain" wherein we utilise the optimised CycleGAN-VC3 network to perform age-audio conversion to generate the in-domain audio. The generated audio dataset is employed to extract a range of features, which are then fed into a metric learning architecture to verify kinship. Experiments are conducted on the KAN_AV audio dataset, which contains age and kinship labels. The results demonstrate that the method markedly enhances the accuracy of kinship verification, while also offering novel insights for future kinship verification research.
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Asia > India (0.04)