AITopics | Zeng, Michael

Collaborating Authors

Zeng, Michael

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Zhang, Leying, Qian, Yao, Zhou, Long, Liu, Shujie, Wang, Dongmei, Wang, Xiaofei, Yousefi, Midia, Qian, Yanmin, Li, Jinyu, He, Lei, Zhao, Sheng, Zeng, Michael

arXiv.org Artificial IntelligenceMay-29-2024

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. This is exemplified by instances generated in a single channel where one speaker's utterance is seamlessly mixed with another's interjections or laughter, indicating the latter's role as an attentive listener. Audio samples are available at https://aka.ms/covomix.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2404.0669

Country:

Europe (0.28)
North America > United States > Pennsylvania (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
(2 more...)

Add feedback

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Le, Chenyang, Qian, Yao, Wang, Dongmei, Zhou, Long, Liu, Shujie, Wang, Xiaofei, Yousefi, Midia, Qian, Yanmin, Li, Jinyu, Zhao, Sheng, Zeng, Michael

arXiv.org Artificial IntelligenceMay-28-2024

There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.

artificial intelligence, machine translation, natural language, (18 more...)

arXiv.org Artificial Intelligence

2405.17809

Country:

Europe > France (0.14)
North America > Canada (0.14)
Asia > China (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

Kanda, Naoyuki, Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Yang, Hemin, Zhu, Zirun, Tang, Min, Li, Canrun, Tsai, Steven, Xiao, Zhen, Xia, Yufei, Li, Jinzhu, Liu, Yanqing, Zhao, Sheng, Zeng, Michael

arXiv.org Artificial IntelligenceFeb-11-2024

Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing and variety of the laughter to be generated. In this work, we propose ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt with precise control of laughter timing and expression. Specifically, ELaTE works on the audio prompt to mimic the voice characteristic, the text prompt to indicate the contents of the generated speech, and the input to control the laughter expression, which can be either the start and end times of laughter, or the additional audio prompt that contains laughter to be mimicked. We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS, and fine-tune it with frame-level representation from a laughter detector as additional conditioning. With a simple scheme to mix small-scale laughter-conditioned data with large-scale pre-training data, we demonstrate that a pre-trained zero-shot TTS model can be readily fine-tuned to generate natural laughter with precise controllability, without losing any quality of the pre-trained zero-shot TTS model. Through the evaluations, we show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models. See https://aka.ms/elate/ for demo samples.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2402.07383

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Automatic Prompt Optimization with "Gradient Descent" and Beam Search

Pryzant, Reid, Iter, Dan, Li, Jerry, Lee, Yin Tat, Zhu, Chenguang, Zeng, Michael

arXiv.org Artificial IntelligenceOct-19-2023

Large Language Models (LLMs) have shown impressive performance as general purpose agents, but their abilities remain highly dependent on prompts which are hand written with onerous trial-and-error effort. We propose a simple and nonparametric solution to this problem, Automatic Prompt Optimization (APO), which is inspired by numerical gradient descent to automatically improve prompts, assuming access to training data and an LLM API. The algorithm uses minibatches of data to form natural language "gradients" that criticize the current prompt. The gradients are then "propagated" into the prompt by editing the prompt in the opposite semantic direction of the gradient. These gradient descent steps are guided by a beam search and bandit selection procedure which significantly improves algorithmic efficiency. Preliminary results across three benchmark NLP tasks and the novel problem of LLM jailbreak detection suggest that Automatic Prompt Optimization can outperform prior prompt editing techniques and improve an initial prompt's performance by up to 31%, by using data to rewrite vague task descriptions into more precise annotation instructions.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2305.03495

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.82)

Add feedback

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Le, Chenyang, Qian, Yao, Zhou, Long, Liu, Shujie, Qian, Yanmin, Zeng, Michael, Huang, Xuedong

arXiv.org Artificial IntelligenceOct-14-2023

Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.

arxiv preprint arxiv, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.14838

Country: Asia > China (0.29)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Diffusion Conditional Expectation Model for Efficient and Robust Target Speech Extraction

Zhang, Leying, Qian, Yao, Yu, Linfeng, Wang, Heming, Wang, Xinkai, Yang, Hemin, Zhou, Long, Liu, Shujie, Qian, Yanmin, Zeng, Michael

arXiv.org Artificial IntelligenceSep-25-2023

Target Speech Extraction (TSE) is a crucial task in speech processing that focuses on isolating the clean speech of a specific speaker from complex mixtures. While discriminative methods are commonly used for TSE, they can introduce distortion in terms of speech perception quality. On the other hand, generative approaches, particularly diffusion-based methods, can enhance speech quality perceptually but suffer from slower inference speed. We propose an efficient generative approach named Diffusion Conditional Expectation Model (DCEM) for TSE. It can handle multi- and single-speaker scenarios in both noisy and clean conditions. Additionally, we introduce Regenerate-DCEM (R-DCEM) that can regenerate and optimize speech quality based on pre-processed speech from a discriminative model. Our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics and demonstrates notable strengths in inference efficiency and robustness to unseen tasks. Audio examples are available online (https://vivian556123.github.io/dcem).

artificial intelligence, machine learning, scenario, (11 more...)

arXiv.org Artificial Intelligence

2309.13874

Country:

North America > United States (0.14)
Asia > China (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization

He, Pengcheng, Peng, Baolin, Lu, Liyang, Wang, Song, Mei, Jie, Liu, Yang, Xu, Ruochen, Awadalla, Hany Hassan, Shi, Yu, Zhu, Chenguang, Xiong, Wayne, Zeng, Michael, Gao, Jianfeng, Huang, Xuedong

arXiv.org Artificial IntelligenceJun-7-2023

This paper presents Z-Code++, a new pre-trained language model optimized for abstractive text summarization. The model extends the state of the art encoder-decoder model using three techniques. First, we use a two-phase pre-training process to improve model's performance on low-resource summarization tasks. The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation. Second, we replace self-attention layers in the encoder with disentangled attention layers, where each word is represented using two vectors that encode its content and position, respectively. Third, we use fusion-in-encoder, a simple yet effective method of encoding long sequences in a hierarchical manner. Z-Code++ creates new state of the art on 9 out of 13 text summarization tasks across 5 languages. Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum. In zero-shot and few-shot settings, our model substantially outperforms the competing models.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2208.0977

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

MACSum: Controllable Summarization with Mixed Attributes

Zhang, Yusen, Liu, Yang, Yang, Ziyi, Fang, Yuwei, Chen, Yulong, Radev, Dragomir, Zhu, Chenguang, Zeng, Michael, Zhang, Rui

arXiv.org Artificial IntelligenceJun-6-2023

Controllable summarization allows users to generate customized summaries with specified attributes. However, due to the lack of designated annotations of controlled summaries, existing works have to craft pseudo datasets by adapting generic summarization benchmarks. Furthermore, most research focuses on controlling single attributes individually (e.g., a short summary or a highly abstractive summary) rather than controlling a mix of attributes together (e.g., a short and highly abstractive summary). In this paper, we propose MACSum, the first human-annotated summarization dataset for controlling mixed attributes. It contains source texts from two domains, news articles and dialogues, with human-annotated summaries controlled by five designed attributes (Length, Extractiveness, Specificity, Topic, and Speaker). We propose two simple and effective parameter-efficient approaches for the new task of mixed controllable summarization based on hard prompt tuning and soft prefix tuning. Results and analysis demonstrate that hard prompt models yield the best performance on all metrics and human evaluations. However, mixed-attribute control is still challenging for summarization tasks. Our dataset and code are available at https://github.com/psunlpgroup/MACSum.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2211.05041

Country:

Europe (1.00)
Asia > Middle East > UAE (0.14)
North America > United States > New York (0.14)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

Li, Chenda, Qian, Yao, Chen, Zhuo, Kanda, Naoyuki, Wang, Dongmei, Yoshioka, Takuya, Qian, Yanmin, Zeng, Michael

arXiv.org Artificial IntelligenceMay-30-2023

State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages. However, it remains a challenge for these models to recognize overlapped speech, which is often seen in meeting conversations. We propose an approach to adapt USMs for multi-talker ASR. We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction. That is, we predict the ASR hypotheses for all speakers, count the speakers, and estimate the utterance timestamps at the same time. We further introduce a lightweight adapter module to maintain the multilingual property of the USMs even when we perform the adaptation with only a single language. Experimental results obtained using the AMI and AliMeeting corpora show that our proposed approach effectively transfers the USMs to a strong multilingual multi-talker ASR model with timestamp prediction capability.

adapter module, artificial intelligence, machine learning, (9 more...)

arXiv.org Artificial Intelligence

2305.18747

Country:

Asia > China (0.14)
Oceania > Australia (0.14)
North America > United States (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

UniSumm and SummZoo: Unified Model and Diverse Benchmark for Few-Shot Summarization

Chen, Yulong, Liu, Yang, Xu, Ruochen, Yang, Ziyi, Zhu, Chenguang, Zeng, Michael, Zhang, Yue

arXiv.org Artificial IntelligenceMay-27-2023

The high annotation costs and diverse demands of various summarization tasks motivate the development of few-shot summarization. However, despite the emergence of many summarization tasks and datasets, the current training paradigm for few-shot summarization systems ignores potentially shareable knowledge in heterogeneous datasets. To this end, we propose \textsc{UniSumm}, a unified few-shot summarization model pre-trained with multiple summarization tasks and can be prefix-tuned to excel at any few-shot summarization task. Meanwhile, to better evaluate few-shot summarizers, under the principles of diversity and robustness, we assemble and release a new benchmark \textsc{SummZoo}. It consists of $8$ summarization tasks with multiple sets of few-shot samples for each task, covering diverse domains. Experimental results and analysis show that \textsc{UniSumm} outperforms strong baselines by a large margin across all sub-tasks in \textsc{SummZoo} under both automatic and human evaluations and achieves comparable results in human evaluation compared with a GPT-3.5 model.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2211.09783

Country:

Europe > United Kingdom (1.00)
Asia (0.68)
North America > United States > New York (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.64)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Law (0.92)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.92)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Communications > Social Media (0.95)

Add feedback