AITopics | linguistic content

Collaborating Authors

linguistic content

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition

Bian, Wesley, Lin, Xiaofeng, Cheng, Guang

arXiv.org Artificial IntelligenceNov-26-2025

Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.

artificial intelligence, machine learning, speech recognition, (13 more...)

arXiv.org Artificial Intelligence

2511.20534

Country: North America > United States > California > Los Angeles County > Los Angeles (0.29)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data

Wei-Ning Hsu, Yu Zhang, James Glass

Neural Information Processing SystemsNov-21-2025, 04:22:25 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, utterance, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Mississippi (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

Chou, Ju-Chieh, Zhou, Jiawei, Livescu, Karen

arXiv.org Artificial IntelligenceOct-23-2025

Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

information, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.0935

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.51)

Add feedback

Content Anonymization for Privacy in Long-form Audio

Aggazzotti, Cristina, Garg, Ashi, Cai, Zexin, Andrews, Nicholas

arXiv.org Artificial IntelligenceOct-15-2025

Voice anonymization techniques have been found to successfully obscure a speaker's acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual's vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose new content anonymization approaches. Our approach performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio.

machine learning, natural language, utterance, (17 more...)

arXiv.org Artificial Intelligence

2510.1278

Country: North America > United States (0.68)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody

Yoon, Jinsung, Jeong, Wooyeol, Gim, Jio, Suh, Young-Joo

arXiv.org Artificial IntelligenceAug-12-2025

--Emotional voice conversion (EVC) aims to modify the emotional style of speech while preserving its linguistic content. In practical EVC, controllability, the ability to independently control speaker identity and emotional style using distinct references, is crucial. However, existing methods often struggle to fully disentangle these attributes and lack the ability to model fine-grained emotional expressions such as temporal dynamics. We propose Maestro-EVC, a controllable EVC framework that enables independent control of content, speaker identity, and emotion by effectively disentangling each attribute from separate references. We further introduce a temporal emotion representation and an explicit prosody modeling with prosody augmentation to robustly capture and transfer the temporal dynamics of the target emotion, even under prosody-mismatched conditions. Experimental results confirm that Maestro-EVC achieves high-quality, controllable, and emotionally expressive speech synthesis.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.0689

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.95)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.49)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)
(2 more...)

Add feedback

Children's Voice Privacy: First Steps And Emerging Challenges

Kulkarni, Ajinkya, Teixeira, Francisco, Hermann, Enno, Rolland, Thomas, Trancoso, Isabel, Doss, Mathew Magimai

arXiv.org Artificial IntelligenceJun-5-2025

Children are one of the most under-represented groups in speech technologies, as well as one of the most vulnerable in terms of privacy. Despite this, anonymization techniques targeting this population have received little attention. In this study, we seek to bridge this gap, and establish a baseline for the use of voice anonymization techniques designed for adult speech when applied to children's voices. Such an evaluation is essential, as children's speech presents a distinct set of challenges when compared to that of adults. This study comprises three children's datasets, six anonymization methods, and objective and subjective utility metrics for evaluation. Our results show that existing systems for adults are still able to protect children's voice privacy, but suffer from much higher utility degradation. In addition, our subjective study displays the challenges of automatic evaluation methods for speech quality in children's speech, highlighting the need for further research.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.001

Country: Europe > Portugal (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Security & Privacy (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Add feedback

StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion

Li, Fengjin, Wang, Jie, Niu, Yadong, Wang, Yongqing, Meng, Meng, Luan, Jian, Wu, Zhiyong

arXiv.org Artificial IntelligenceJun-4-2025

Voice Conversion (VC) modifies speech to match a target speaker while preserving linguistic content. Traditional methods usually extract speaker information directly from speech while neglecting the explicit utilization of linguistic content. Since VC fundamentally involves disentangling speaker identity from linguistic content, leveraging structured semantic features could enhance conversion performance. However, previous attempts to incorporate semantic features into VC have shown limited effectiveness, motivating the integration of explicit text modeling. We propose StarVC, a unified autoregressive VC framework that first predicts text tokens before synthesizing acoustic features. The experiments demonstrate that StarVC outperforms conventional VC methods in preserving both linguistic content (i.e., WER and CER) and speaker characteristics (i.e., SECS and MOS). Audio demo can be found at: https://thuhcsi.github.io/StarVC/.

arxiv preprint arxiv, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.02414

Country: Asia > China (0.15)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Residual Speech Embeddings for Tone Classification: Removing Linguistic Content to Enhance Paralinguistic Analysis

Ahbabi, Hamdan Al, Marti, Gautier, AlMarri, Saeed, Elfadel, Ibrahim

arXiv.org Artificial IntelligenceFeb-26-2025

--Self-supervised learning models for speech processing, such as wav2vec2, HuBERT, WavLM, and Whisper, generate embeddings that capture both linguistic and paralinguistic information, making it challenging to analyze tone independently of spoken content. In this work, we introduce a method for disentangling paralinguistic features from linguistic content by regressing speech embeddings onto their corresponding text embeddings and using the residuals as a representation of vocal tone. We evaluate this approach across multiple self-supervised speech embeddings, demonstrating that residual embeddings significantly improve tone classification performance compared to raw speech embeddings. Our results show that this method enhances linear separability, enabling improved classification even with simple models such as logistic regression. Visualization of the residual embeddings further confirms the successful removal of linguistic information while preserving tone-related features.

linguistic content, speech, tone classification, (15 more...)

arXiv.org Artificial Intelligence

2502.19387

Country: Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.16)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.92)

Add feedback

Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

Zhang, Xueyao, Zhang, Xiaohui, Peng, Kainan, Tang, Zhenyu, Manohar, Vimal, Liu, Yingru, Hwang, Jeff, Li, Dangna, Wang, Yuhao, Chan, Julian, Huang, Yuan, Wu, Zhizheng, Ma, Mingbo

arXiv.org Artificial IntelligenceFeb-10-2025

The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zeroshot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE [1] as the tokenizer for the continuous hidden features of HuBERT [2]. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. The imitation of voice has long been an important issue in the field of speech generation. This includes the imitation of speaker identity [3, 4], the imitation of speaking style such as accent [5, 6] or emotion [7], and a broader concept of voice cloning such as in zero-shot text-to-speech (TTS) task [8]. These techniques have a wide range of applications, including spoken language learning [5, 6, 9], voice anonymization [10], voice assistants [11, 12], and video dubbing [11, 12, 13]. To achieve targeted and controllable imitation over various speech attributes, many studies focuses on factorizing speech into multiple sub-spaces [14, 15, 16, 17]. In this work, we follow this idea and decompose speech into three key attributes: linguistic content (what to speak), style (how to speak), and timbre (who speaks).

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2502.07243

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Language Modelling for Speaker Diarization in Telephonic Interviews

India, Miquel, Hernando, Javier, Fonollosa, José A. R.

arXiv.org Artificial IntelligenceJan-28-2025

The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.

artificial intelligence, diarization, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2501.17893

Country:

Asia > India (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Spain (0.04)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback