AITopics | Wang, Xiaofei

Collaborating Authors

Wang, Xiaofei

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Lin, Yan-Bo, Lin, Kevin, Yang, Zhengyuan, Li, Linjie, Wang, Jianfeng, Lin, Chung-Ching, Wang, Xiaofei, Bertasius, Gedas, Wang, Lijuan

arXiv.org Artificial IntelligenceMar-26-2025

In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2503.20782

Genre: Research Report (0.82)

Industry:

Media (0.47)
Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)

Add feedback

Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

Wang, Peidong, Kanda, Naoyuki, Xue, Jian, Li, Jinyu, Wang, Xiaofei, Subramanian, Aswin Shanmugam, Chen, Junkun, Sivasankaran, Sunit, Xiao, Xiong, Zhao, Yong

arXiv.org Artificial IntelligenceFeb-4-2025

Streaming multi-talker speech translation is a task that involves not only generating accurate and fluent translations with low latency but also recognizing when a speaker change occurs and what the speaker's gender is. Speaker change information can be used to create audio prompts for a zero-shot text-to-speech system, and gender can help to select speaker profiles in a conventional text-to-speech model. We propose to tackle streaming speaker change detection and gender classification by incorporating speaker embeddings into a transducer-based streaming end-to-end speech translation model. Our experiments demonstrate that the proposed methods can achieve high accuracy for both speaker change detection and gender classification.

machine learning, natural language, translation, (15 more...)

arXiv.org Artificial Intelligence

2502.02683

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Summary of the NOTSOFAR-1 Challenge: Highlights and Learnings

Abramovski, Igor, Vinnikov, Alon, Shaer, Shalev, Kanda, Naoyuki, Wang, Xiaofei, Ivry, Amir, Krupka, Eyal

arXiv.org Artificial IntelligenceJan-28-2025

The first Natural Office Talkers in Settings of Far-field Audio Recordings (NOTSOFAR-1) Challenge is a pivotal initiative that sets new benchmarks by offering datasets more representative of the needs of real-world business applications than those previously available. The challenge provides a unique combination of 280 recorded meetings across 30 diverse environments, capturing real-world acoustic conditions and conversational dynamics, and a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. In this paper, we provide an overview of the systems submitted to the challenge and analyze the top-performing approaches, hypothesizing the factors behind their success. Additionally, we highlight promising directions left unexplored by participants. By presenting key findings and actionable insights, this work aims to drive further innovation and progress in DASR research and applications.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.17304

Country: Asia > India (0.14)

Genre:

Research Report (1.00)
Overview (0.86)

Industry:

Information Technology (0.35)
Media (0.34)
Leisure & Entertainment (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.96)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

Zhang, YiFan, Lei, Shanglin, Qiao, Runqi, GongQue, Zhuoma, Song, Xiaoshuai, Dong, Guanting, Tan, Qiuna, Wei, Zhe, Yang, Peiqing, Tian, Ye, Xue, Yadong, Wang, Xiaofei, Zhang, Honggang

arXiv.org Artificial IntelligenceDec-17-2024

The rapidly developing field of large multimodal models (LMMs) has led to the emergence of diverse models with remarkable capabilities. However, existing benchmarks fail to comprehensively, objectively and accurately evaluate whether LMMs align with the diverse needs of humans in real-world scenarios. To bridge this gap, we propose the Multi-Dimensional Insights (MDI) benchmark, which includes over 500 images covering six common scenarios of human life. Notably, the MDI-Benchmark offers two significant advantages over existing evaluations: (1) Each image is accompanied by two types of questions: simple questions to assess the model's understanding of the image, and complex questions to evaluate the model's ability to analyze and reason beyond basic content. (2) Recognizing that people of different age groups have varying needs and perspectives when faced with the same scenario, our benchmark stratifies questions into three age categories: young people, middle-aged people, and older people. This design allows for a detailed assessment of LMMs' capabilities in meeting the preferences and needs of different age groups. With MDI-Benchmark, the strong model like GPT-4o achieve 79% accuracy on age-related tasks, indicating that existing LMMs still have considerable room for improvement in addressing real-world applications. Looking ahead, we anticipate that the MDI-Benchmark will open new pathways for aligning real-world personalization in LMMs. The MDI-Benchmark data and evaluation code are available at https://mdi-benchmark.github.io/

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2412.12606

Country: Asia (0.67)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages

Yousefi, Midia, Qian, Yao, Chen, Junkun, Wang, Gang, Liu, Yanqing, Wang, Dongmei, Wang, Xiaofei, Xue, Jian

arXiv.org Artificial IntelligenceNov-11-2024

End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years. Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio, including both speech and pause segments. Previous methods often controlled the number of words or characters generated by the Machine Translation model to approximate the source sentence's length without considering the isochrony of pauses and speech segments, as duration can vary between languages. To address this, we present improvements to the duration alignment component of our sequence-to-sequence ST model. Our method controls translation length by predicting the duration of speech and pauses in conjunction with the translation process. This is achieved by providing timing information to the decoder, ensuring it tracks the remaining duration for speech and pauses while generating the translation. The evaluation on the Zh-En test set of CoVoST 2, demonstrates that the proposed Isochrony-Controlled ST achieves 0.92 speech overlap and 8.9 BLEU, which has only a 1.4 BLEU drop compared to the ST baseline.

machine learning, natural language, translation, (16 more...)

arXiv.org Artificial Intelligence

2411.07387

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

Wu, Haibin, Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Tompkins, Daniel, Tsai, Chung-Hsien, Li, Canrun, Xiao, Zhen, Zhao, Sheng, Li, Jinyu, Kanda, Naoyuki

arXiv.org Artificial IntelligenceJul-16-2024

People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. EmoCtrl-TTS leverages arousal and valence values, as well as laughter embeddings, to condition the flow-matching-based zero-shot TTS. To achieve high-quality emotional speech generation, EmoCtrl-TTS is trained using more than 27,000 hours of expressive data curated based on pseudo-labeling. Comprehensive evaluations demonstrate that EmoCtrl-TTS excels in mimicking the emotions of audio prompts in speech-to-speech translation scenarios. We also show that EmoCtrl-TTS can capture emotion changes, express strong emotions, and generate various NVs in zero-shot TTS. See https://aka.ms/emoctrl-tts for demo samples.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2407.12229

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.86)

Add feedback

Knowledge-driven Subspace Fusion and Gradient Coordination for Multi-modal Learning

Zhang, Yupei, Wang, Xiaofei, Meng, Fangliangzi, Tang, Jin, Li, Chao

arXiv.org Artificial IntelligenceJun-20-2024

Multi-modal learning plays a crucial role in cancer diagnosis and prognosis. Current deep learning based multi-modal approaches are often limited by their abilities to model the complex correlations between genomics and histology data, addressing the intrinsic complexity of tumour ecosystem where both tumour and microenvironment contribute to malignancy. We propose a biologically interpretative and robust multi-modal learning framework to efficiently integrate histology images and genomics by decomposing the feature subspace of histology images and genomics, reflecting distinct tumour and microenvironment features. To enhance cross-modal interactions, we design a knowledge-driven subspace fusion scheme, consisting of a cross-modal deformable attention module and a gene-guided consistency strategy. Additionally, in pursuit of dynamically optimizing the subspace knowledge, we further propose a novel gradient coordination learning strategy. Extensive experiments demonstrate the effectiveness of the proposed method, outperforming state-of-the-art techniques in three downstream tasks of glioma diagnosis, tumour grading, and survival analysis. Our code is available at https://github.com/helenypzhang/Subspace-Multimodal-Learning.

artificial intelligence, information fusion, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2406.13979

Country: Asia > China (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.97)
Health & Medicine > Diagnostic Medicine (0.95)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Unified Modeling Enhanced Multimodal Learning for Precision Neuro-Oncology

Yi, Huahui, Wang, Xiaofei, Li, Kang, Li, Chao

arXiv.org Artificial IntelligenceJun-11-2024

Multimodal learning, integrating histology images and genomics, promises to enhance precision oncology with comprehensive views at microscopic and molecular levels. However, existing methods may not sufficiently model the shared or complementary information for more effective integration. In this study, we introduce a Unified Modeling Enhanced Multimodal Learning (UMEML) framework that employs a hierarchical attention structure to effectively leverage shared and complementary features of both modalities of histology and genomics. Specifically, to mitigate unimodal bias from modality imbalance, we utilize a query-based cross-attention mechanism for prototype clustering in the pathology encoder. Our prototype assignment and modularity strategy are designed to align shared features and minimizes modality gaps. An additional registration mechanism with learnable tokens is introduced to enhance cross-modal feature integration and robustness in multimodal unified modeling. Our experiments demonstrate that our method surpasses previous state-of-the-art approaches in glioma diagnosis and prognosis tasks, underscoring its superiority in precision neuro-Oncology.

artificial intelligence, machine learning, survival prediction, (11 more...)

arXiv.org Artificial Intelligence

2406.07078

Country: North America > United States (0.14)

Genre:

Research Report > New Finding (0.66)
Research Report > Promising Solution (0.48)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Yang, Hemin, Zhu, Zirun, Tang, Min, Xia, Yufei, Li, Jinzhu, Zhao, Sheng, Li, Jinyu, Kanda, Naoyuki

arXiv.org Artificial IntelligenceJun-9-2024

Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audio generated from noisy audio prompts within the context of flow-matching-based zero-shot TTS. Our investigation includes comprehensive training strategies: unsupervised pre-training with masked speech denoising, multi-speaker detection and DNSMOS-based data filtering on the pre-training data, and fine-tuning with random noise mixing. The results of our experiments demonstrate significant improvements in intelligibility, speaker similarity, and overall audio quality compared to the approach of applying speech enhancement to the audio prompt.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.05699

Country:

North America > United States (0.14)
Europe > United Kingdom (0.14)
Asia > Japan > Honshū (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)

Add feedback

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Zhang, Leying, Qian, Yao, Zhou, Long, Liu, Shujie, Wang, Dongmei, Wang, Xiaofei, Yousefi, Midia, Qian, Yanmin, Li, Jinyu, He, Lei, Zhao, Sheng, Zeng, Michael

arXiv.org Artificial IntelligenceMay-29-2024

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. This is exemplified by instances generated in a single channel where one speaker's utterance is seamlessly mixed with another's interjections or laughter, indicating the latter's role as an attentive listener. Audio samples are available at https://aka.ms/covomix.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2404.0669

Country:

Europe (0.28)
North America > United States > Pennsylvania (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
(2 more...)

Add feedback