AITopics | Kharitonov, Eugene

Collaborating Authors

Kharitonov, Eugene

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Gemma 3 Technical Report

Gemma Team, null, Kamath, Aishwarya, Ferret, Johan, Pathak, Shreya, Vieillard, Nino, Merhej, Ramona, Perrin, Sarah, Matejovicova, Tatiana, Ramé, Alexandre, Rivière, Morgane, Rouillard, Louis, Mesnard, Thomas, Cideron, Geoffrey, Grill, Jean-bastien, Ramos, Sabela, Yvinec, Edouard, Casbon, Michelle, Pot, Etienne, Penchev, Ivo, Liu, Gaël, Visin, Francesco, Kenealy, Kathleen, Beyer, Lucas, Zhai, Xiaohai, Tsitsulin, Anton, Busa-Fekete, Robert, Feng, Alex, Sachdeva, Noveen, Coleman, Benjamin, Gao, Yi, Mustafa, Basil, Barr, Iain, Parisotto, Emilio, Tian, David, Eyal, Matan, Cherry, Colin, Peter, Jan-Thorsten, Sinopalnikov, Danila, Bhupatiraju, Surya, Agarwal, Rishabh, Kazemi, Mehran, Malkin, Dan, Kumar, Ravin, Vilar, David, Brusilovsky, Idan, Luo, Jiaming, Steiner, Andreas, Friesen, Abe, Sharma, Abhanshu, Sharma, Abheesht, Gilady, Adi Mayrav, Goedeckemeyer, Adrian, Saade, Alaa, Feng, Alex, Kolesnikov, Alexander, Bendebury, Alexei, Abdagic, Alvin, Vadi, Amit, György, András, Pinto, André Susano, Das, Anil, Bapna, Ankur, Miech, Antoine, Yang, Antoine, Paterson, Antonia, Shenoy, Ashish, Chakrabarti, Ayan, Piot, Bilal, Wu, Bo, Shahriari, Bobak, Petrini, Bryce, Chen, Charlie, Lan, Charline Le, Choquette-Choo, Christopher A., Carey, CJ, Brick, Cormac, Deutsch, Daniel, Eisenbud, Danielle, Cattle, Dee, Cheng, Derek, Paparas, Dimitris, Sreepathihalli, Divyashree Shivakumar, Reid, Doug, Tran, Dustin, Zelle, Dustin, Noland, Eric, Huizenga, Erwin, Kharitonov, Eugene, Liu, Frederick, Amirkhanyan, Gagik, Cameron, Glenn, Hashemi, Hadi, Klimczak-Plucińska, Hanna, Singh, Harman, Mehta, Harsh, Lehri, Harshal Tushar, Hazimeh, Hussein, Ballantyne, Ian, Szpektor, Idan, Nardini, Ivan, Pouget-Abadie, Jean, Chan, Jetha, Stanton, Joe, Wieting, John, Lai, Jonathan, Orbay, Jordi, Fernandez, Joseph, Newlan, Josh, Ji, Ju-yeong, Singh, Jyotinder, Black, Kat, Yu, Kathy, Hui, Kevin, Vodrahalli, Kiran, Greff, Klaus, Qiu, Linhai, Valentine, Marcella, Coelho, Marina, Ritter, Marvin, Hoffman, Matt, Watson, Matthew, Chaturvedi, Mayank, Moynihan, Michael, Ma, Min, Babar, Nabila, Noy, Natasha, Byrd, Nathan, Roy, Nick, Momchev, Nikola, Chauhan, Nilay, Sachdeva, Noveen, Bunyan, Oskar, Botarda, Pankil, Caron, Paul, Rubenstein, Paul Kishan, Culliton, Phil, Schmid, Philipp, Sessa, Pier Giuseppe, Xu, Pingmei, Stanczyk, Piotr, Tafti, Pouya, Shivanna, Rakesh, Wu, Renjie, Pan, Renke, Rokni, Reza, Willoughby, Rob, Vallu, Rohith, Mullins, Ryan, Jerome, Sammy, Smoot, Sara, Girgin, Sertan, Iqbal, Shariq, Reddy, Shashir, Sheth, Shruti, Põder, Siim, Bhatnagar, Sijal, Panyam, Sindhu Raghuram, Eiger, Sivan, Zhang, Susan, Liu, Tianqi, Yacovone, Trevor, Liechty, Tyler, Kalra, Uday, Evci, Utku, Misra, Vedant, Roseberry, Vincent, Feinberg, Vlad, Kolesnikov, Vlad, Han, Woohyun, Kwon, Woosuk, Chen, Xi, Chow, Yinlam, Zhu, Yuvein, Wei, Zichuan, Egyed, Zoltan, Cotruta, Victor, Giang, Minh, Kirk, Phoebe, Rao, Anand, Black, Kat, Babar, Nabila, Lo, Jessica, Moreira, Erica, Martins, Luiz Gustavo, Sanseviero, Omar, Gonzalez, Lucas, Gleicher, Zach, Warkentin, Tris, Mirrokni, Vahab, Senter, Evan, Collins, Eli, Barral, Joelle, Ghahramani, Zoubin, Hadsell, Raia, Matias, Yossi, Sculley, D., Petrov, Slav, Fiedel, Noah, Shazeer, Noam, Vinyals, Oriol, Dean, Jeff, Hassabis, Demis, Kavukcuoglu, Koray, Farabet, Clement, Buchatskaya, Elena, Alayrac, Jean-Baptiste, Anil, Rohan, Dmitry, null, Lepikhin, null, Borgeaud, Sebastian, Bachem, Olivier, Joulin, Armand, Andreev, Alek, Hardin, Cassidy, Dadashi, Robert, Hussenot, Léonard

arXiv.org Artificial IntelligenceMar-25-2025

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

arxiv preprint arxiv, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2503.19786

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MAD Speech: Measures of Acoustic Diversity of Speech

Futeral, Matthieu, Agostinelli, Andrea, Tagliasacchi, Marco, Zeghidour, Neil, Kharitonov, Eugene

arXiv.org Artificial IntelligenceApr-16-2024

Generative spoken language models produce speech in a wide range of voices, prosody, and recording conditions, seemingly approaching the diversity of natural speech. However, the extent to which generated speech is acoustically diverse remains unclear due to a lack of appropriate metrics. We address this gap by developing lightweight metrics of acoustic diversity, which we collectively refer to as MAD Speech. We focus on measuring five facets of acoustic diversity: voice, gender, emotion, accent, and background noise. We construct the metrics as a composition of specialized, per-facet embedding models and an aggregation function that measures diversity within the embedding space. Next, we build a series of datasets with a priori known diversity preferences for each facet. Using these datasets, we demonstrate that our proposed metrics achieve a stronger agreement with the ground-truth diversity than baselines. Finally, we showcase the applicability of our proposed metrics across several real-life evaluation scenarios. MAD Speech will be made publicly accessible.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2404.10419

Country:

Europe (0.28)
North America > United States > Minnesota (0.14)
North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)

Add feedback

AudioLM: a Language Modeling Approach to Audio Generation

Borsos, Zalán, Marinier, Raphaël, Vincent, Damien, Kharitonov, Eugene, Pietquin, Olivier, Sharifi, Matt, Roblek, Dominik, Teboul, Olivier, Grangier, David, Tagliasacchi, Marco, Zeghidour, Neil

arXiv.org Artificial IntelligenceJul-25-2023

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2209.03143

Country:

Europe > France (0.14)
Asia > China (0.14)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (0.93)
Media > Music (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.61)

Add feedback

AudioPaLM: A Large Language Model That Can Speak and Listen

Rubenstein, Paul K., Asawaroengchai, Chulayuth, Nguyen, Duc Dung, Bapna, Ankur, Borsos, Zalán, Quitry, Félix de Chaumont, Chen, Peter, Badawy, Dalia El, Han, Wei, Kharitonov, Eugene, Muckenhirn, Hannah, Padfield, Dirk, Qin, James, Rozenberg, Danny, Sainath, Tara, Schalkwyk, Johan, Sharifi, Matt, Ramanovich, Michelle Tadmor, Tagliasacchi, Marco, Tudor, Alexandru, Velimirović, Mihajlo, Vincent, Damien, Yu, Jiahui, Wang, Yongqiang, Zayats, Vicky, Zeghidour, Neil, Zhang, Yu, Zhang, Zhishuai, Zilka, Lukas, Frank, Christian

arXiv.org Artificial IntelligenceJun-22-2023

We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

artificial intelligence, natural language, translation, (15 more...)

arXiv.org Artificial Intelligence

2306.12925

Country:

Europe (0.92)
Asia > Japan > Honshū (0.14)
North America > United States > Texas (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Long-term Effects of Temperature Variations on Economic Growth: A Machine Learning Approach

Kharitonov, Eugene, Zakharchuk, Oksana, Mei, Lin

arXiv.org Artificial IntelligenceJun-17-2023

This study investigates the long-term effects of temperature variations on economic growth using a data-driven approach. Leveraging machine learning techniques, we analyze global land surface temperature data from Berkeley Earth and economic indicators, including GDP and population data, from the World Bank. Our analysis reveals a significant relationship between average temperature and GDP growth, suggesting that climate variations can substantially impact economic performance. This research underscores the importance of incorporating climate factors into economic planning and policymaking, and it demonstrates the utility of machine learning in uncovering complex relationships in climate-economy studies.

artificial intelligence, economic growth, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2308.06265

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Banking & Finance > Economy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

SoundStorm: Efficient Parallel Audio Generation

Borsos, Zalán, Sharifi, Matt, Vincent, Damien, Kharitonov, Eugene, Zeghidour, Neil, Tagliasacchi, Marco

arXiv.org Artificial IntelligenceMay-16-2023

We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2305.09636

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)

Add feedback

Generative Spoken Dialogue Language Modeling

Nguyen, Tu Anh, Kharitonov, Eugene, Copet, Jade, Adi, Yossi, Hsu, Wei-Ning, Elkahky, Ali, Tomasello, Paden, Algayres, Robin, Sagot, Benoit, Mohamed, Abdelrahman, Dupoux, Emmanuel

arXiv.org Artificial IntelligenceNov-22-2022

We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn-taking compared to a text-based cascaded model.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2203.16502

Country:

Europe (0.67)
North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.84)

Add feedback

Textless Speech Emotion Conversion using Decomposed and Discrete Representations

Kreuk, Felix, Polyak, Adam, Copet, Jade, Kharitonov, Eugene, Nguyen, Tu-Anh, Rivière, Morgane, Hsu, Wei-Ning, Mohamed, Abdelrahman, Dupoux, Emmanuel, Adi, Yossi

arXiv.org Artificial IntelligenceNov-14-2021

Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is superior to the baselines in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples and code will be publicly available under the following link: https://speechbot.github.io/emotion.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2111.07402

Country: Asia (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN

Chaabouni, Rahma, Dessì, Roberto, Kharitonov, Eugene

arXiv.org Artificial IntelligenceJul-3-2021

Despite their practical success, modern seq2seq architectures are unable to generalize systematically on several SCAN tasks. Hence, it is not clear if SCAN-style compositional generalization is useful in realistic NLP tasks. In this work, we study the benefit that such compositionality brings about to several machine translation tasks. We present several focused modifications of Transformer that greatly improve generalization capabilities on SCAN and select one that remains on par with a vanilla Transformer on a standard machine translation (MT) task. Next, we study its performance in low-resource settings and on a newly introduced distribution-shifted English-French translation task. Overall, we find that improvements of a SCAN-capable model do not directly transfer to the resource-rich MT setup. In contrast, in the low-resource setup, general modifications lead to an improvement of up to 13.1% BLEU score w.r.t. a vanilla Transformer. Similarly, an improvement of 14% in an accuracy-based metric is achieved in the introduced compositional English-French translation task. This provides experimental evidence that the compositional generalization assessed in SCAN is particularly useful in resource-starved and domain-shifted scenarios.

deep learning, neural network, transformer, (20 more...)

arXiv.org Artificial Intelligence

2107.01366

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Interpretable agent communication from scratch(with a generic visual processor emerging on the side)

Dessì, Roberto, Kharitonov, Eugene, Baroni, Marco

arXiv.org Artificial IntelligenceJun-8-2021

As deep networks begin to be deployed as autonomous agents, the issue of how they can communicate with each other becomes important. Here, we train two deep nets from scratch to perform realistic referent identification through unsupervised emergent communication. We show that the largely interpretable emergent protocol allows the nets to successfully communicate even about object types they did not see at training time. The visual representations induced as a by-product of our training regime, moreover, show comparable quality, when re-used as generic visual features, to a recent self-supervised learning model. Our results provide concrete evidence of the viability of (interpretable) emergent deep net communication in a more realistic scenario than previously considered, as well as establishing an intriguing link between this field and self-supervised visual learning.

artificial intelligence, natural language, proceedings, (19 more...)

arXiv.org Artificial Intelligence

2106.04258

Country:

Europe (1.00)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Florida (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(2 more...)

Add feedback