AITopics | voicebox

Collaborating Authors

voicebox

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Neural Information Processing SystemsDec-24-2025, 09:51:55 GMT

name change, text-guided multilingual universal speech generation, voicebox, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.68)
Information Technology > Artificial Intelligence > Machine Learning (0.56)

Add feedback

LoRP-TTS: Low-Rank Personalized Text-To-Speech

Bondaruk, Łukasz, Kubiak, Jakub

arXiv.org Artificial IntelligenceFeb-11-2025

Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to $30pp$ while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.07562

Country:

Europe > Poland (0.04)
North America > United States (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.86)

Add feedback

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

He, Haorui, Shang, Zengqiang, Wang, Chaoren, Li, Xuyuan, Gu, Yicheng, Hua, Hua, Liu, Liwei, Yang, Chen, Li, Jiaqi, Shi, Peiyang, Wang, Yuancheng, Chen, Kai, Zhang, Pengyuan, Wu, Zhizheng

arXiv.org Artificial IntelligenceJan-27-2025

Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.15907

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Beijing > Beijing (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.82)

Industry: Media (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)

Add feedback

Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits

Huang, Sung-Feng, Kuo, Heng-Cheng, Chen, Zhehuai, Yang, Xuesong, Yang, Chao-Han Huck, Tsao, Yu, Wang, Yu-Chiang Frank, Lee, Hung-yi, Fu, Szu-Wei

arXiv.org Artificial IntelligenceJan-7-2025

Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster spoofing detection research, we introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed the process of re-implementing Voicebox training and dataset creation. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods. Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization across different edit methods. The dataset and related models will be made publicly available.

dataset, detector, speech, (17 more...)

arXiv.org Artificial Intelligence

2501.03805

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Russia (0.04)
Asia > Taiwan (0.04)
Asia > Russia (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.71)

Add feedback

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Neural Information Processing SystemsOct-10-2024, 20:42:46 GMT

generative model, text-guided multilingual universal speech generation, voicebox

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.62)
Information Technology > Artificial Intelligence > Machine Learning (0.60)

Add feedback

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

Wu, Haibin, Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Tompkins, Daniel, Tsai, Chung-Hsien, Li, Canrun, Xiao, Zhen, Zhao, Sheng, Li, Jinyu, Kanda, Naoyuki

arXiv.org Artificial IntelligenceJul-16-2024

People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. EmoCtrl-TTS leverages arousal and valence values, as well as laughter embeddings, to condition the flow-matching-based zero-shot TTS. To achieve high-quality emotional speech generation, EmoCtrl-TTS is trained using more than 27,000 hours of expressive data curated based on pseudo-labeling. Comprehensive evaluations demonstrate that EmoCtrl-TTS excels in mimicking the emotions of audio prompts in speech-to-speech translation scenarios. We also show that EmoCtrl-TTS can capture emotion changes, express strong emotions, and generate various NVs in zero-shot TTS. See https://aka.ms/emoctrl-tts for demo samples.

emoctrl-tts, emotion, speech, (15 more...)

arXiv.org Artificial Intelligence

2407.12229

Country:

North America > United States (0.14)
Asia > Taiwan (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.86)

Add feedback

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Le, Matthew, Vyas, Apoorv, Shi, Bowen, Karrer, Brian, Sari, Leda, Moritz, Rashel, Williamson, Mary, Manohar, Vimal, Adi, Yossi, Mahadeokar, Jay, Hsu, Wei-Ning

arXiv.org Artificial IntelligenceOct-19-2023

Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}.

duration model, speech, voicebox, (16 more...)

arXiv.org Artificial Intelligence

2306.15687

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > Japan (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(3 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Meta unveils Voicebox AI: Should we all be worried?

FOX NewsJul-7-2023, 18:14:30 GMT

Meta's latest artificial intelligence model called Voicebox is a customized text to speech product that can mimic any specific voice of your choosing.

artificial intelligence, machine learning, synthetic voice, (15 more...)

FOX News

Country:

Asia > China (0.05)
North America > United States > California > San Mateo County > Menlo Park (0.05)

Industry:

Media (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision (0.74)
Information Technology > Artificial Intelligence > Speech (0.71)

Add feedback

Mark Zuckerberg's scary new AI is 'too dangerous' to make public

Daily Mail - Science & techJun-21-2023, 13:18:07 GMT

Meta boasted Friday that it has produced'the most versatile AI for speech generation' in existence. But it added that the company would not be making their AI model public, due to grave concerns over the advanced tech's'potential risks of misuse.' In recent months, scammers have become adept at employing AI-generated speech to perpetrate eerie and shocking crimes, including an April attempt at faking the kidnapping of a teenage girl in Arizona, terrorizing the young girl's distraught mother with realistic AI-generated pleas. But Meta proposed a variety of more optimistic use cases in their press release, stating that Voicebox could be used to help the visually impaired hear messages from their friends and loved ones, or to allow non-native speakers to play translations of their own words, in their own voice, but in a foreign tongue. Meta called their new Voicebox generative AI model'the most versatile AI for speech generation' in existence.

meta, voicebox, zuckerberg, (16 more...)

Daily Mail - Science & tech

Country: North America > United States > Arizona (0.26)

Genre: Press Release (0.41)

Industry:

Health & Medicine (0.60)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.57)
Information Technology > Services (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Add feedback

Meta's Voicebox Generative AI Makes Anyone Speak a Foreign Language - CNET

CNET - NewsJun-16-2023, 21:45:59 GMT

Generative artificial intelligence like ChatGPT and Google's Bard generates certain text in response to a query using natural language processing and machine learning. Meta's new generative AI, Voicebox, does things a little differently -- by producing audio clips. Voicebox, announced Friday by Facebook's parent company Meta, can synthesize speech using a 2-second audio sample. With that clip, it can match the audio style as well as do text-to-speech generation or re-create a portion of the speech that may have been interrupted by some external noise. Voicebox can also take that sample and have it read English text in other languages such as French, German, Spanish, Polish or Portuguese. Meta says Voicebox can be used to give a natural-sounding voice to virtual assistants or nonplayer characters in the metaverse, which are digital worlds in which people will gather to work, play and hang out.

artificial intelligence, natural language, voicebox, (4 more...)

CNET - News

AI-Alerts: 2023 > 2023-06 > AAAI AI-Alert for Jun 20, 2023 (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback