AITopics

Country:

Asia > Taiwan (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
(2 more...)

Neural Information Processing SystemsFeb-11-2026, 09:56:16 GMT

Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

Non-autoregressive Transformers (NA Ts) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data.

machine learning, natural language, translation, (21 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceNov-17-2025

Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment

Gao, Yan, Yang, Yazheng, Lan, Zhibin, Chen, Yidong, Zhang, Min, Wei, Daimeng, Huang, Hui, Su, Jinsong

Code-switching (CS) speech translation (ST) refers to translating speech that alternates between two or more languages into a target language text, which poses significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies tend to rely on the model itself to implicitly learn semantic modeling during training, and resort to inefficient and costly manual annotations for these two challenges. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture of Experts (MoE) speech projector, where each expert specializes in the semantic subspace of a specific language, enabling fine-grained modeling of speech features. Additionally, we introduce a multi-stage training paradigm that utilizes readily available monolingual automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation capabilities. During training, we leverage a combination of language-specific loss and intra-group load balancing loss to guide the MoE speech projector in efficiently allocating tokens to the appropriate experts, across expert groups and within each group, respectively. To bridge the data gap across different training stages and improve adaptation to the CS scenario, we further employ a transition loss, enabling smooth transitions of data between stages, to effectively address the scarcity of high-quality CS speech translation data. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach.

artificial intelligence, machine translation, natural language, (13 more...)

2511.1067

Country: Asia > China (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Neural Information Processing SystemsOct-10-2025, 20:13:10 GMT

Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

Automatic speech recognition (ASR) systems convert speech signals into their transcription texts.

boundary, phoneme prediction model, segmentation model, (12 more...)

Country:

Asia > Taiwan (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Neural Information Processing SystemsOct-9-2025, 23:39:33 GMT

Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

Non-autoregressive Transformers (NA Ts) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data.

diffusion model, speech unit, translation, (16 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceOct-3-2025

Enhancing Noise Robustness of Parkinson's Disease Telemonitoring via Contrastive Feature Augmentation

Tang, Ziming, Hou, Chengbin, Zhang, Tianyu, Tian, Bangxu, Wang, Jinbao, Lv, Hairong

Parkinson's disease (PD) is one of the most common neurodegenerative disorder. PD telemonitoring emerges as a novel assessment modality enabling self-administered at-home tests of Unified Parkinson's Disease Rating Scale (UPDRS) scores, enhancing accessibility for PD patients. However, three types of noise would occur during measurements: (1) patient-induced measurement inaccuracies, (2) environmental noise, and (3) data packet loss during transmission, resulting in higher prediction errors. To address these challenges, NoRo, a noise-robust UPDRS prediction framework is proposed. First, the original speech features are grouped into ordered bins, based on the continuous values of a selected feature, to construct contrastive pairs. Second, the contrastive pairs are employed to train a multilayer perceptron encoder for generating noise-robust features. Finally, these features are concatenated with the original features as the augmented features, which are then fed into the UPDRS prediction models. Notably, we further introduces a novel evaluation approach with customizable noise injection module, and extensive experiments show that NoRo can successfully enhance the noise robustness of UPDRS prediction across various downstream prediction models under different noisy environments.

artificial intelligence, machine learning, prediction error, (14 more...)

2510.01588

Country:

Asia > China (0.28)
North America > United States (0.28)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Neurology > Parkinson's Disease (1.00)
Health & Medicine > Therapeutic Area > Musculoskeletal (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)

arXiv.org Artificial IntelligenceSep-25-2025

KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

Lyu, Tianle, Zhao, Junchuan, Wang, Ye

Audio-driven facial animation has made significant progress in multimedia applications, with diffusion models showing strong potential for talking-face synthesis. However, most existing works treat speech features as a monolithic representation and fail to capture their fine-grained roles in driving different facial motions, while also overlooking the importance of modeling keyframes with intense dynamics. To address these limitations, we propose KSDiff, a Keyframe-Augmented Speech-Aware Dual-Path Diffusion framework. Specifically, the raw audio and transcript are processed by a Dual-Path Speech Encoder (DPSE) to disentangle expression-related and head-pose-related features, while an autoregressive Keyframe Establishment Learning (KEL) module predicts the most salient motion frames. These components are integrated into a Dual-path Motion generator to synthesize coherent and realistic facial motions. Extensive experiments on HDTF and VoxCeleb demonstrate that KSDiff achieves state-of-the-art performance, with improvements in both lip synchronization accuracy and head-pose naturalness. Our results highlight the effectiveness of combining speech disentanglement with keyframe-aware diffusion for talking-head generation.

artificial intelligence, machine learning, natural language, (13 more...)

2509.20128

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.90)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.87)

arXiv.org Artificial IntelligenceJul-17-2025

Exploring Gender Bias in Alzheimer's Disease Detection: Insights from Mandarin and Greek Speech Perception

He, Liu, Li, Yuanchao, Feng, Rui, Han, XinRan, Liu, Yin-Long, Yang, Yuwei, Zhu, Zude, Yuan, Jiahong

Gender bias has been widely observed in speech perception tasks, influenced by the fundamental voicing differences between genders. This study reveals a gender bias in the perception of Alzheimer's Disease (AD) speech. In a perception experiment involving 16 Chinese listeners evaluating both Chinese and Greek speech, we identified that male speech was more frequently identified as AD, with this bias being particularly pronounced in Chinese speech. Acoustic analysis showed that shimmer values in male speech were significantly associated with AD perception, while speech portion exhibited a significant negative correlation with AD identification. Although language did not have a significant impact on AD perception, our findings underscore the critical role of gender bias in AD speech perception. This work highlights the necessity of addressing gender bias when developing AD detection models and calls for further research to validate model performance across different linguistic contexts.

artificial intelligence, machine learning, perception, (15 more...)

2507.12356

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Chowdhury, Tahiya, Romero, Veronica

Can We Trust Machine Learning? The Reliability of Features from Open-Source Speech Analysis Tools for Speech Modeling

arXiv.org Artificial IntelligenceJun-16-2025

Machine learning-based behavioral models rely on features extracted from audio-visual recordings. The recordings are processed using open-source tools to extract speech features for classification models. These tools often lack validation to ensure reliability in capturing behaviorally relevant information. This gap raises concerns about reproducibility and fairness across diverse populations and contexts. Speech processing tools, when used outside of their design context, can fail to capture behavioral variations equitably and can then contribute to bias. We evaluate speech features extracted from two widely used speech analysis tools, OpenSMILE and Praat, to assess their reliability when considering adolescents with autism. We observed considerable variation in features across tools, which influenced model performance across context and demographic groups. We encourage domain-relevant verification to enhance the reliability of machine learning models in clinical applications.

artificial intelligence, machine learning, praat, (15 more...)

2506.11072

Genre:

Research Report > New Finding (0.47)
Research Report > Experimental Study (0.47)

Industry: Health & Medicine > Therapeutic Area > Neurology > Autism (0.96)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.40)

Dhar, Sandipan, Jana, Nanda Dulal, Das, Swagatam

Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements

arXiv.org Artificial IntelligenceApr-29-2025

Voice conversion (VC) stands as a crucial research area in speech synthesis, enabling the transformation of a speaker's vocal characteristics to resemble another while preserving the linguistic content. This technology has broad applications, including automated movie dubbing, speech-to-singing conversion, and assistive devices for pathological speech rehabilitation. With the increasing demand for high-quality and natural-sounding synthetic voices, researchers have developed a wide range of VC techniques. Among these, generative adversarial network (GAN)-based approaches have drawn considerable attention for their powerful feature-mapping capabilities and potential to produce highly realistic speech. Despite notable advancements, challenges such as ensuring training stability, maintaining linguistic consistency, and achieving perceptual naturalness continue to hinder progress in GAN-based VC systems. This systematic review presents a comprehensive analysis of the voice conversion landscape, highlighting key techniques, key challenges, and the transformative impact of GANs in the field. The survey categorizes existing methods, examines technical obstacles, and critically evaluates recent developments in GAN-based VC. By consolidating and synthesizing research findings scattered across the literature, this review provides a structured understanding of the strengths and limitations of different approaches. The significance of this survey lies in its ability to guide future research by identifying existing gaps, proposing potential directions, and offering insights for building more robust and efficient VC systems. Overall, this work serves as an essential resource for researchers, developers, and practitioners aiming to advance the state-of-the-art (SOTA) in voice conversion technology.

artificial intelligence, deep learning, machine learning, (19 more...)