AITopics | Zeghidour, Neil

Collaborating Authors

Zeghidour, Neil

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

High-Fidelity Simultaneous Speech-To-Speech Translation

Labiausse, Tom, Mazaré, Laurent, Grave, Edouard, Pérez, Patrick, Défossez, Alexandre, Zeghidour, Neil

arXiv.org Artificial IntelligenceFeb-5-2025

We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart, where one waits for the end of the source utterance to start translating, adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples as well as models and inference code.

artificial intelligence, natural language, translation, (18 more...)

arXiv.org Artificial Intelligence

2502.03382

Country:

Europe (1.00)
Asia (0.67)
North America > United States > Hawaii (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Moshi: a speech-text foundation model for real-time dialogue

Défossez, Alexandre, Mazaré, Laurent, Orsini, Manu, Royer, Amélie, Pérez, Patrick, Jégou, Hervé, Grave, Edouard, Zeghidour, Neil

arXiv.org Artificial IntelligenceOct-2-2024

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this "Inner Monologue" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at https://github.com/kyutai-labs/moshi.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.00037

Country:

Asia (0.67)
North America > United States > Hawaii (0.14)
North America > United States > California (0.14)

Genre:

Research Report (0.81)
Workflow (0.67)

Industry: Leisure & Entertainment > Sports > Football (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

MAD Speech: Measures of Acoustic Diversity of Speech

Futeral, Matthieu, Agostinelli, Andrea, Tagliasacchi, Marco, Zeghidour, Neil, Kharitonov, Eugene

arXiv.org Artificial IntelligenceApr-16-2024

Generative spoken language models produce speech in a wide range of voices, prosody, and recording conditions, seemingly approaching the diversity of natural speech. However, the extent to which generated speech is acoustically diverse remains unclear due to a lack of appropriate metrics. We address this gap by developing lightweight metrics of acoustic diversity, which we collectively refer to as MAD Speech. We focus on measuring five facets of acoustic diversity: voice, gender, emotion, accent, and background noise. We construct the metrics as a composition of specialized, per-facet embedding models and an aggregation function that measures diversity within the embedding space. Next, we build a series of datasets with a priori known diversity preferences for each facet. Using these datasets, we demonstrate that our proposed metrics achieve a stronger agreement with the ground-truth diversity than baselines. Finally, we showcase the applicability of our proposed metrics across several real-life evaluation scenarios. MAD Speech will be made publicly accessible.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2404.10419

Country:

Europe (0.28)
North America > United States > Minnesota (0.14)
North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)

Add feedback

MusicRL: Aligning Music Generation to Human Preferences

Cideron, Geoffrey, Girgin, Sertan, Verzetti, Mauro, Vincent, Damien, Kastelic, Matej, Borsos, Zalán, McWilliams, Brian, Ungureanu, Victor, Bachem, Olivier, Pietquin, Olivier, Geist, Matthieu, Hussenot, Léonard, Zeghidour, Neil, Agostinelli, Andrea

arXiv.org Artificial IntelligenceFeb-6-2024

We propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of musicality as well as the specific intention behind a caption are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a retro guitar solo or a techno pop beat). Not only this makes supervised training of such models challenging, but it also calls for integrating continuous human feedback in their post-deployment finetuning. MusicRL is a pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete audio tokens finetuned with reinforcement learning to maximise sequence-level rewards. We design reward functions related specifically to text-adherence and audio quality with the help from selected raters, and use those to finetune MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model that incorporates human feedback at scale. Human evaluations show that both MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU combines the two approaches and results in the best model according to human raters. Ablation studies shed light on the musical attributes influencing human preferences, indicating that text adherence and quality only account for a part of it. This underscores the prevalence of subjectivity in musical appreciation and calls for further involvement of human listeners in the finetuning of music generation models.

large language model, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2402.04229

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

Erdogan, Hakan, Wisdom, Scott, Chang, Xuankai, Borsos, Zalán, Tagliasacchi, Marco, Zeghidour, Neil, Hershey, John R.

arXiv.org Artificial IntelligenceAug-20-2023

We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We also present a "refinement" version of the model that predicts enhanced audio tokens from the audio tokens of speech separated by a conventional separation model. Using both objective metrics and subjective MUSHRA listening tests, we show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.

artificial intelligence, machine learning, transcript, (17 more...)

arXiv.org Artificial Intelligence

2308.10415

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

AudioLM: a Language Modeling Approach to Audio Generation

Borsos, Zalán, Marinier, Raphaël, Vincent, Damien, Kharitonov, Eugene, Pietquin, Olivier, Sharifi, Matt, Roblek, Dominik, Teboul, Olivier, Grangier, David, Tagliasacchi, Marco, Zeghidour, Neil

arXiv.org Artificial IntelligenceJul-25-2023

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2209.03143

Country:

Europe > France (0.14)
Asia > China (0.14)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (0.93)
Media > Music (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.61)

Add feedback

DNArch: Learning Convolutional Neural Architectures by Backpropagation

Romero, David W., Zeghidour, Neil

arXiv.org Artificial IntelligenceJul-22-2023

We present Differentiable Neural Architectures (DNArch), a method that jointly learns the weights and the architecture of Convolutional Neural Networks (CNNs) by backpropagation. In particular, DNArch allows learning (i) the size of convolutional kernels at each layer, (ii) the number of channels at each layer, (iii) the position and values of downsampling layers, and (iv) the depth of the network. To this end, DNArch views neural architectures as continuous multidimensional entities, and uses learnable differentiable masks along each dimension to control their size. Unlike existing methods, DNArch is not limited to a predefined set of possible neural components, but instead it is able to discover entire CNN architectures across all feasible combinations of kernel sizes, widths, depths and downsampling. Empirically, DNArch finds performant CNN architectures for several classification and dense prediction tasks on sequential and image data. When combined with a loss term that controls the network complexity, DNArch constrains its search to architectures that respect a predefined computational budget during training.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2302.054

Country:

Europe (0.46)
North America > Canada (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Backpropagation (0.61)

Add feedback

AudioPaLM: A Large Language Model That Can Speak and Listen

Rubenstein, Paul K., Asawaroengchai, Chulayuth, Nguyen, Duc Dung, Bapna, Ankur, Borsos, Zalán, Quitry, Félix de Chaumont, Chen, Peter, Badawy, Dalia El, Han, Wei, Kharitonov, Eugene, Muckenhirn, Hannah, Padfield, Dirk, Qin, James, Rozenberg, Danny, Sainath, Tara, Schalkwyk, Johan, Sharifi, Matt, Ramanovich, Michelle Tadmor, Tagliasacchi, Marco, Tudor, Alexandru, Velimirović, Mihajlo, Vincent, Damien, Yu, Jiahui, Wang, Yongqiang, Zayats, Vicky, Zeghidour, Neil, Zhang, Yu, Zhang, Zhishuai, Zilka, Lukas, Frank, Christian

arXiv.org Artificial IntelligenceJun-22-2023

We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

artificial intelligence, natural language, translation, (15 more...)

arXiv.org Artificial Intelligence

2306.12925

Country:

Europe (0.92)
Asia > Japan > Honshū (0.14)
North America > United States > Texas (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Disentangling speech from surroundings with neural embeddings

Omran, Ahmed, Zeghidour, Neil, Borsos, Zalán, Quitry, Félix de Chaumont, Slaney, Malcolm, Tagliasacchi, Marco

arXiv.org Artificial IntelligenceJun-4-2023

We present a method to separate speech signals from noisy environments in the embedding space of a neural audio codec. We introduce a new training procedure that allows our model to produce structured encodings of audio waveforms given by embedding vectors, where one part of the embedding vector represents the speech signal, and the rest represent the environment. We achieve this by partitioning the embeddings of different input waveforms and training the model to faithfully reconstruct audio from mixed partitions, thereby ensuring each partition encodes a separate audio attribute. As use cases, we demonstrate the separation of speech from background noise or from reverberation characteristics. Our method also allows for targeted adjustments of the audio output characteristics.

artificial intelligence, machine learning, speech, (16 more...)

arXiv.org Artificial Intelligence

2203.15578

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.88)

Add feedback

SoundStorm: Efficient Parallel Audio Generation

Borsos, Zalán, Sharifi, Matt, Vincent, Damien, Kharitonov, Eugene, Zeghidour, Neil, Tagliasacchi, Marco

arXiv.org Artificial IntelligenceMay-16-2023

We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2305.09636

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)

Add feedback