AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)

Neural Information Processing SystemsFeb-8-2026, 16:55:05 GMT

4730d10b22261faa9a95ebf7497bc556-Paper-Conference.pdf

arxiv preprint arxiv, generspeech, representation, (13 more...)

Country:

Asia > China (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.77)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.65)

Corrêa, Pedro, Lima, João, Moreno, Victor, Ueda, Lucas, Costa, Paula Dornhofer Paro

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

arXiv.org Artificial IntelligenceOct-31-2025

ABSTRACT Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.

artificial intelligence, emotion, natural language, (17 more...)

2510.25054

Country: South America > Brazil (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.64)

Sanchez, Ariadna, King, Simon

Can we reconstruct a dysarthric voice with the large speech model Parler TTS?

arXiv.org Artificial IntelligenceSep-26-2025

Speech disorders can make communication hard or even impossible for those who develop them. Personalised Text-to-Speech is an attractive option as a communication aid. We attempt voice reconstruction using a large speech model, with which we generate an approximation of a dysarthric speaker's voice prior to the onset of their condition. In particular, we investigate whether a state-of-the-art large speech model, Parler TTS, can generate intelligible speech while maintaining speaker identity. We curate a dataset and annotate it with relevant speaker and intelligibility information, and use this to fine-tune the model. Our results show that the model can indeed learn to generate from the distribution of this challenging data, but struggles to control intelligibility and to maintain consistent speaker identity. We propose future directions to improve controllability of this class of model, for the voice reconstruction task.

artificial intelligence, machine learning, natural language, (18 more...)

doi: 10.21437/Interspeech.2025-2679

2506.04397

Genre: Research Report > New Finding (0.54)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.95)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)

arXiv.org Artificial IntelligenceSep-25-2025

Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data

Wang, Qiongqiong, Sailor, Hardik Bhupendra, Liu, Tianchi, Zhang, Wenyu, Huzaifah, Muhammad, Lertcheva, Nattadaporn, Sun, Shuo, Chen, Nancy F., Wu, Jinyang, Aw, AiTi

Recent speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence. We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning the integration of verbal content with non-verbal cues like emotion and prosody. The benchmark includes two curated question answering (QA) datasets requiring both linguistic and empathetic understanding. We evaluate state-of-the-art speech-LLMs from both open and closed-source models and perform a comprehensive analysis across different question types. The top two models were further analyzed under temperature tuning to understand its effect on this task. Our benchmark reveals a key gap in existing evaluations and offers insights into building more context-aware and emotionally intelligent speech-capable LLMs.

large language model, machine learning, natural language, (18 more...)

2509.16589

Country: Asia (0.28)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.48)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Matiyali, Neeraj, Srivastava, Siddharth, Sharma, Gaurav

RephraseTTS: Dynamic Length Text based Speech Insertion with Speaker Style Transfer

arXiv.org Artificial IntelligenceAug-26-2025

We propose a method for the task of text-conditioned speech insertion, i.e. inserting a speech sample in an input speech sample, conditioned on the corresponding complete text transcript. An example use case of the task would be to update the speech audio when corrections are done on the corresponding text transcript. The proposed method follows a transformer-based non-autoregressive approach that allows speech insertions of variable lengths, which are dynamically determined during inference, based on the text transcript and tempo of the available partial input. It is capable of maintaining the speaker's voice characteristics, prosody and other spectral properties of the available speech input. Results from our experiments and user study on LibriTTS show that our method outperforms baselines based on an existing adaptive text to speech method. We also provide numerous qualitative results to appreciate the quality of the output from the proposed method.

artificial intelligence, machine learning, representation, (16 more...)

2508.17031

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Neural Information Processing SystemsAug-15-2025, 16:33:51 GMT

87682805257e619d49b8e0dfdc14affa-Paper.pdf

information, representation, voice conversion, (14 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
(2 more...)

Genre: Research Report (0.68)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)

Neural Information Processing SystemsAug-14-2025, 14:18:25 GMT

4730d10b22261faa9a95ebf7497bc556-Supplemental-Conference.pdf

generspeech, mean opinion score, visualization, (13 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Neural Information Processing SystemsAug-14-2025, 14:18:20 GMT

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.

arxiv preprint arxiv, generspeech, representation, (13 more...)

Country:

Asia > China (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.77)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.65)

arXiv.org Artificial IntelligenceJul-24-2025

Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction

Ali, Mai, Lucasius, Christopher, Patel, Tanmay P., Aitken, Madison, Vorstman, Jacob, Szatmari, Peter, Battaglia, Marco, Kundur, Deepa

Speech is a noninvasive digital phenotype that can offer valuable insights into mental health conditions, but it is often treated as a single modality. In contrast, we propose the treatment of patient speech data as a trimodal multimedia data source for depression detection. This study explores the potential of large language model-based architectures for speech-based depression prediction in a multimodal regime that integrates speech-derived text, acoustic landmarks, and vocal biomarkers. Adolescent depression presents a significant challenge and is often comorbid with multiple disorders, such as suicidal ideation and sleep disturbances. This presents an additional opportunity to integrate multi-task learning (MTL) into our study by simultaneously predicting depression, suicidal ideation, and sleep disturbances using the multimodal formulation. We also propose a longitudinal analysis strategy that models temporal changes across multiple clinical interactions, allowing for a comprehensive understanding of the conditions' progression. Our proposed approach, featuring trimodal, longitudinal MTL is evaluated on the Depression Early Warning dataset. It achieves a balanced accuracy of 70.8%, which is higher than each of the unimodal, single-task, and non-longitudinal methods.

large language model, machine learning, natural language, (18 more...)

2505.23822

Country: North America > Canada > Ontario > Toronto (0.16)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)