AITopics | Dupoux, Emmanuel

Collaborating Authors

Dupoux, Emmanuel

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Garrido, Quentin, Ballas, Nicolas, Assran, Mahmoud, Bardes, Adrien, Najman, Laurent, Rabbat, Michael, Dupoux, Emmanuel, LeCun, Yann

arXiv.org Artificial IntelligenceFeb-17-2025

We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2502.11831

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach

Poli, Maxime, Chemla, Emmanuel, Dupoux, Emmanuel

arXiv.org Artificial IntelligenceOct-30-2024

Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three orders of magnitude more data to catch up to their text-based counterparts in terms of their semantic abilities. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, and language models trained on these units achieve comparable lexical comprehension to ones trained on hundred times more data.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2410.00025

Country:

North America > United States (0.46)
North America > Mexico > Mexico City (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

Rita, Mathieu, Strub, Florian, Chaabouni, Rahma, Michel, Paul, Dupoux, Emmanuel, Pietquin, Olivier

arXiv.org Artificial IntelligenceApr-30-2024

While Reinforcement Learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations' and LLM's rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation. We show the effectiveness of RCfD on three language tasks, which achieves comparable performance to carefully tuned baselines while mitigating ROO.

demonstration, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2404.19409

Country: North America > United States > Illinois > Coles County (0.15)

Genre: Research Report (1.00)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Language Evolution with Deep Learning

Rita, Mathieu, Michel, Paul, Chaabouni, Rahma, Pietquin, Olivier, Dupoux, Emmanuel, Strub, Florian

arXiv.org Artificial IntelligenceMar-18-2024

Social animals have been found to use some means of communication to coordinate in various contexts: foraging for food, avoiding predators, mating, etc. (Hauser, 1996). Among animals, however, humans seem to be unique in having developed a communication system, natural language, that transcends these basic needs and can represent an infinite variety of new situations (Hauser et al., 2002) to the extent that language itself becomes the basis for a new form of evolution: cultural evolution. Understanding the emergence of this unique human ability has always been a vexing scientific problem due to the lack of access to the communication systems of intermediate steps of hominid evolution (Harnad et al., 1976; Bickerton, 2007). In the absence of data, a tempting idea has been to reproduce experimentally the process of language emergence in either humans or computational models (Steels, 1997; Myers-Scotton, 2002; Kirby, 2002). Experimental paradigms with humans (Kirby et al., 2008; Raviv et al., 2019; Motamedi et al., 2019) have produced significant insights into language evolution. Still, their scope is limited due to the inability to replicate key aspects of language evolution, such as communication within and across large populations and the study of long evolutionary timescales. Computer modeling can help overcome these limitations and has played a prominent role in studying language evolution for a long time (Lieberman and Crelin, 1971).

machine learning, natural language, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2403.11958

Country:

North America > United States > New York (0.14)
Europe > United Kingdom > England (0.14)

Genre:

Research Report (0.50)
Instructional Material (0.46)

Industry:

Leisure & Entertainment > Games (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(3 more...)

Add feedback

SpiRit-LM: Interleaved Spoken and Written Language Model

Nguyen, Tu Anh, Muller, Benjamin, Yu, Bokai, Costa-jussa, Marta R., Elbayad, Maha, Popuri, Sravya, Duquenne, Paul-Ambroise, Algayres, Robin, Mavlyutov, Ruslan, Gat, Itai, Synnaeve, Gabriel, Pino, Juan, Sagot, Benoit, Dupoux, Emmanuel

arXiv.org Artificial IntelligenceFeb-8-2024

We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2402.05755

Country:

Europe (0.92)
Asia > Middle East > UAE (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Textually Pretrained Speech Language Models

Hassid, Michael, Remez, Tal, Nguyen, Tu Anh, Gat, Itai, Conneau, Alexis, Kreuk, Felix, Copet, Jade, Defossez, Alexandre, Synnaeve, Gabriel, Dupoux, Emmanuel, Schwartz, Roy, Adi, Yossi

arXiv.org Artificial IntelligenceJan-30-2024

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .

arxiv preprint arxiv, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.13009

Country:

Europe (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (0.68)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models

de Seyssel, Maureen, D'Avirro, Antony, Williams, Adina, Dupoux, Emmanuel

arXiv.org Artificial IntelligenceDec-21-2023

We introduce EmphAssess, a prosodic benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech translation. In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output, potentially across a change of speaker and language. As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.

emphasis, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2312.14069

Country: Asia > South Korea (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.90)

Add feedback

WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

Benchekroun, Youssef, Dervishi, Megi, Ibrahim, Mark, Gaya, Jean-Baptiste, Martinet, Xavier, Mialon, Grégoire, Scialom, Thomas, Dupoux, Emmanuel, Hupkes, Dieuwke, Vincent, Pascal

arXiv.org Artificial IntelligenceNov-27-2023

We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2311.1593

Country:

Europe (0.67)
North America > United States > Louisiana (0.14)
North America > United States > Minnesota (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report (0.65)

Industry: Retail (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

Algayres, Robin, Nabli, Adel, Sagot, Benoit, Dupoux, Emmanuel

arXiv.org Artificial IntelligenceOct-21-2023

Building on similar ideas in vision and speech, we select our positive examples through a We introduce a simple neural encoder architecture that can mix of time-stretching data augmentation [26] and k-Nearerst be trained using an unsupervised contrastive learning objective Neighbors search [27, 28]. Figure 1 gives an overview of our which gets its positive samples from data-augmented k-Nearest method. To evaluate our method, we test our model on 5 types Neighbors search. We show that when built on top of recent of acoustic features: MFCCs, CPC [4, 3] HuBERT [1] and self-supervised audio representations [1, 2, 3], this method can Wav2Vec 2.0 (Base and Large) [2]. We pick the best method be applied iteratively and yield competitive SSE as evaluated on from our LibriSpeech benchmark and show that when applied two tasks: query-by-example of random sequences of speech, without any change to the task of spoken term discovery as defined and spoken term discovery. On both tasks our method pushes in the zero resource challenges [29], we beat the state of the state-of-the-art by a significant margin across 5 different the art on the NED/COV metric by a large margin in 5 new languages. Finally, we establish a benchmark on a query-byexample datasets.

available, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2204.05148

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Generative Spoken Language Model based on continuous word-sized audio tokens

Algayres, Robin, Adi, Yossi, Nguyen, Tu Anh, Copet, Jade, Synnaeve, Gabriel, Sagot, Benoit, Dupoux, Emmanuel

arXiv.org Artificial IntelligenceOct-8-2023

In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.

artificial intelligence, generative spoken language model

arXiv.org Artificial Intelligence

2310.05224

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.60)

Add feedback