AITopics | target speech

Collaborating Authors

target speech

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation Qingkai Fang 1,2, Y an Zhou 1,2, Y ang Feng 1,2 1

Neural Information Processing SystemsFeb-17-2026, 16:32:38 GMT

In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.

machine learning, natural language, translation, (19 more...)

Neural Information Processing Systems

Country:

Europe > Austria > Vienna (0.14)
Asia > South Korea > Incheon > Incheon (0.04)
North America > Canada > British Columbia > Vancouver (0.04)
(12 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

Neural Information Processing SystemsDec-27-2025, 01:25:26 GMT

However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG).

daspeech, directed acyclic transformer, fast and high-quality speech-to-speech translation, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.62)

Add feedback

UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement

Yan, Haoyin, Liu, Chengwei, Xue, Shaofei, Liang, Xiaotao, Xue, Zheng

arXiv.org Artificial IntelligenceOct-24-2025

The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LMbased models in unifying different sub-tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling, which facilitates a compatibility between distinct learning patterns of multiple tasks. Experiments on several benchmarks indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines, showing the capacity of LMs in unifying SE tasks. The demo page is available here: https://github.com/hyyan2k/UniSE.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.20441

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

TFGA-Net: Temporal-Frequency Graph Attention Network for Brain-Controlled Speaker Extraction

Si, Youhao, Liao, Yuan, Han, Qiushi, Yang, Yuhang, Dai, Rui, Huang, Liya

arXiv.org Artificial IntelligenceOct-15-2025

The rapid development of auditory attention decoding (AAD) based on electroencephalography (EEG) signals offers the possibility EEG-driven target speaker extraction. However, how to effectively utilize the target-speaker common information between EEG and speech remains an unresolved problem. In this paper, we propose a model for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech. In order to effectively extract information from EEG signals, we derive multi-scale time--frequency features and further incorporate cortical topological structures that are selectively engaged during the task. Moreover, to effectively exploit the non-Euclidean structure of EEG signals and capture their global features, the graph convolutional networks and self-attention mechanism are used in the EEG encoder. In addition, to make full use of the fused EEG and speech feature and preserve global context and capture speech rhythm and prosody, we introduce MossFormer2 which combines MossFormer and RNN-Free Recurrent as separator. Experimental results on both the public Cocktail Party and KUL dataset in this paper show that our TFGA-Net model significantly outper-forms the state-of-the-art method in certain objective evaluation metrics. The source code is available at: https://github.com/LaoDa-X/TFGA-NET.

artificial intelligence, machine learning, speech, (16 more...)

arXiv.org Artificial Intelligence

2510.12275

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.94)

Industry: Health & Medicine > Therapeutic Area (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

a32539cb16274581a17e679f6046f4bf-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 12:00:12 GMT

dataset, information, speech, (15 more...)

Neural Information Processing Systems

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models

Tang, Beilong, Zeng, Bang, Li, Ming

arXiv.org Artificial IntelligenceAug-19-2025

--We propose LauraTSE, an Auto-Regressive Decoder-Only Language Model for T arget Speaker Extraction built upon the LauraGPT backbone. LauraTSE employs a small-scale auto-regressive decoder-only language model that generates the initial layers of the target speech's discrete codec representations from the continuous embeddings of both the mixture and reference speech. These outputs serve as coarse-grained predictions. T o refine them, a one-step encoder-only language model reconstructs the full codec representation by integrating information from both the mixture and the reference speech, adding fine-grained details. Experimental results show that our approach can achieve promising performance. Additionally, we conduct ablation studies to investigate the data scalability and the contribution of the encoder-only model. Target Speaker Extraction (TSE) aims at extracting target speaker's speech from a mixture using auxiliary information like reference speech, spatial information, or visual information etc., regarding the target speaker [1]. Current dominant approaches utilize discriminative models which try to directly map the mixture speech to target clean speech [2]- [5]. However, this method might struggle for unseen data and sometimes even introduce undesirable distortions [6].

artificial intelligence, natural language, speech, (16 more...)

arXiv.org Artificial Intelligence

2504.07402

Country: Asia (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Language translation, and change of accent for speech-to-speech task using diffusion model

Mishra, Abhishek, Chowdhury, Ritesh Sur, Bahuguna, Vartul, Pandey, Isha, Ramakrishnan, Ganesh

arXiv.org Artificial IntelligenceMay-9-2025

Speech-to-speech translation (S2ST) aims to convert spoken input in one language to spoken output in another, typically focusing on either language translation or accent adaptation. However, effective cross-cultural communication requires handling both aspects simultaneously -- translating content while adapting the speaker's accent to match the target language context. In this work, we propose a unified approach for simultaneous speech translation and change of accent, a task that remains underexplored in current literature. Our method reformulates the problem as a conditional generation task, where target speech is generated based on phonemes and guided by target speech features. Leveraging the power of diffusion models, known for high-fidelity generative capabilities, we adapt text-to-image diffusion strategies by conditioning on source speech transcriptions and generating Mel spectrograms representing the target speech with desired linguistic and accentual attributes. This integrated framework enables joint optimization of translation and accent adaptation, offering a more parameter-efficient and effective model compared to traditional pipelines.

machine learning, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2505.04639

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

Wu, Wenxuan, Chen, Xueyuan, Wang, Shuai, Wang, Jiadong, Meng, Lingwei, Wu, Xixin, Meng, Helen, Li, Haizhou

arXiv.org Artificial IntelligenceApr-1-2025

Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.

fc model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.0075

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction

Kim, Minsu, Mira, Rodrigo, Chen, Honglie, Petridis, Stavros, Pantic, Maja

arXiv.org Artificial IntelligenceMar-11-2025

In this paper, we investigate a novel approach for Target Speech Extraction (TSE), which relies solely on textual context to extract the target speech. We refer to this task as Contextual Speech Extraction (CSE). Unlike traditional TSE methods that rely on pre-recorded enrollment utterances, video of the target speaker's face, spatial information, or other explicit cues to identify the target stream, our proposed method requires only a few turns of previous dialogue (or monologue) history. This approach is naturally feasible in mobile messaging environments where voice recordings are typically preceded by textual dialogue that can be leveraged implicitly. We present three CSE models and analyze their performances on three datasets. Through our experiments, we demonstrate that even when the model relies purely on dialogue history, it can achieve over 90 % accuracy in identifying the correct target stream with only two previous dialogue turns. Furthermore, we show that by leveraging both textual context and enrollment utterances as cues during training, we further enhance our model's flexibility and effectiveness, allowing us to use either cue during inference, or combine both for improved performance. Samples and code available on https://miraodasilva.github.io/cse-project-page .

speech, speech extraction, speech separation, (11 more...)

arXiv.org Artificial Intelligence

2503.08798

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > India > Telangana > Hyderabad (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.70)

Add feedback

Filters

Collaborating Authors

target speech

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

e5b1c0d4866f72393c522c8a00eed4eb-Paper-Conference.pdf

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation Qingkai Fang 1,2, Y an Zhou 1,2, Y ang Feng 1,2 1

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement

TFGA-Net: Temporal-Frequency Graph Attention Network for Brain-Controlled Speaker Extraction

a32539cb16274581a17e679f6046f4bf-Paper-Conference.pdf

LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models

Language translation, and change of accent for speech-to-speech task using diffusion model

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction

Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction