Goto

Collaborating Authors

 audiolm


Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

Zhu, Yongxin, Su, Dan, He, Liqiang, Xu, Linli, Yu, Dong

arXiv.org Artificial Intelligence

While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities. By training on large corpora of speeches in an end-to-end unsupervised manner, GPST can generate syntactically consistent speech with diverse speaker identities. Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities. Moreover, our approach can be easily extended to spoken cross-lingual speech generation by incorporating multi-lingual semantic tokens and universal acoustic tokens. Experimental results indicate that GPST significantly outperforms the existing speech language models in terms of word error rate, speech quality, and speaker similarity. See \url{https://youngsheen.github.io/GPST/demo} for demo samples.


Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM

Nachmani, Eliya, Levkovitch, Alon, Hirsch, Roy, Salazar, Julian, Asawaroengchai, Chulayuth, Mariooryad, Soroosh, Rivlin, Ehud, Skerry-Ryan, RJ, Ramanovich, Michelle Tadmor

arXiv.org Artificial Intelligence

We present a novel approach to adapting pre-trained large language models (LLMs) to perform question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a `cross-modal' chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets. Audio samples can be found at https://michelleramanovich.github.io/spectron/spectron


AudioLM: a Language Modeling Approach to Audio Generation

Borsos, Zalán, Marinier, Raphaël, Vincent, Damien, Kharitonov, Eugene, Pietquin, Olivier, Sharifi, Matt, Roblek, Dominik, Teboul, Olivier, Grangier, David, Tagliasacchi, Marco, Zeghidour, Neil

arXiv.org Artificial Intelligence

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.


SoundStorm: Efficient Parallel Audio Generation

Borsos, Zalán, Sharifi, Matt, Vincent, Damien, Kharitonov, Eugene, Zeghidour, Neil, Tagliasacchi, Marco

arXiv.org Artificial Intelligence

We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.


SingSong: Generating musical accompaniments from singing

Donahue, Chris, Caillon, Antoine, Roberts, Adam, Manilow, Ethan, Esling, Philippe, Agostinelli, Andrea, Verzetti, Mauro, Simon, Ian, Pietquin, Olivier, Zeghidour, Neil, Engel, Jesse

arXiv.org Artificial Intelligence

We present SingSong, a system that generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build on recent developments in musical source separation and audio generation. Specifically, we apply a state-of-the-art source separation algorithm to a large corpus of music audio to produce aligned pairs of vocals and instrumental sources. Then, we adapt AudioLM (Borsos et al., 2022) -- a state-of-the-art approach for unconditional audio generation -- to be suitable for conditional "audio-to-audio" generation tasks, and train it on the source-separated (vocal, instrumental) pairs. In a pairwise comparison with the same vocal inputs, listeners expressed a significant preference for instrumentals generated by SingSong compared to those from a strong retrieval baseline. Sound examples at https://g.co/magenta/singsong


MusicLM: Generating Music From Text

Agostinelli, Andrea, Denk, Timo I., Borsos, Zalán, Engel, Jesse, Verzetti, Mauro, Caillon, Antoine, Huang, Qingqing, Jansen, Aren, Roberts, Adam, Tagliasacchi, Marco, Sharifi, Matt, Zeghidour, Neil, Frank, Christian

arXiv.org Artificial Intelligence

We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.


Google's three transformative areas of AI

#artificialintelligence

YEARS of research have led to rapid progress in Artificial Intelligence (AI). On November 2, Google announced three ways people are poised to benefit from the advancements in AI. Jeff Dean, senior vice president of Google Research and Health, presented three transformative areas of AI: first, using AI to make technology accessible in many more languages; second, exploring how AI might bolster creativity; and third, AI for social good, including climate adaptation. The 1,000 Languages Initiative is an ambitious research project to build an AI model that would support the 1,000 most spoken languages of the world. In order to provide AI-based language technology for the world, they need to make sure they also train their models on representative content of the world.


Google's Audiolm: Generating Music by Hearing a Song's Snippet

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. AudioLM is Google's new model, capable of generating music in the same style as the prompt.


Google's new AI can hear a snippet of song--and then keep on playing

#artificialintelligence

AI-generated audio is commonplace: voices on home assistants like Alexa use natural language processing. AI music systems like OpenAI's Jukebox have already generated impressive results, but most existing techniques need people to prepare transcriptions and label text-based training data, which takes a lot of time and human labor. Jukebox, for example, uses text-based data to generate song lyrics. AudioLM, described in a non-peer-reviewed paper last month, is different: it doesn't require transcription or labeling. Instead, sound databases are fed into the program, and machine learning is used to compress the audio files into sound snippets, called "tokens," without losing too much information.


AudioLM: a Language Modeling Approach to Audio Generation

#artificialintelligence

Posted by Zalán Borsos, Research Software Engineer, and Neil Zeghidour, Research Scientist, Google Research Generating realistic audio re...