AITopics | Kutuzov, Andrey

Collaborating Authors

Kutuzov, Andrey

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Burchell, Laurie, de Gibert, Ona, Arefyev, Nikolay, Aulamo, Mikko, Bañón, Marta, Chen, Pinzhen, Fedorova, Mariia, Guillou, Liane, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Henriksson, Erik, Klimaszewski, Mateusz, Komulainen, Ville, Kutuzov, Andrey, Kytöniemi, Joona, Laippala, Veronika, Mæhlum, Petter, Malik, Bhavitvya, Mehryary, Farrokh, Mikhailov, Vladislav, Moghe, Nikita, Myntti, Amanda, O'Brien, Dayyán, Oepen, Stephan, Pal, Proyag, Piha, Jousia, Pyysalo, Sampo, Ramírez-Sánchez, Gema, Samuel, David, Stepachev, Pavel, Tiedemann, Jörg, Variš, Dušan, Vojtěchová, Tereza, Zaragoza-Bernabeu, Jaume

arXiv.org Artificial IntelligenceMar-14-2025

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

artificial intelligence, machine translation, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.10267

Country:

Europe (1.00)
Asia > Middle East (0.92)
South America (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology (0.67)
Government (0.46)
Education (0.46)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

de la Rosa, Javier, Mikhailov, Vladislav, Zhang, Lemei, Wetjen, Freddy, Samuel, David, Liu, Peng, Braaten, Rolv-Arild, Mæhlum, Petter, Birkenes, Magnus Breder, Kutuzov, Andrey, Enstad, Tita, Brygfjeld, Svein Arne, Gulla, Jon Atle, Oepen, Stephan, Velldal, Erik, Østgulen, Wilfred, Øvrelid, Liljia, Myhre, Aslak Sira

arXiv.org Artificial IntelligenceDec-12-2024

The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.0946

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Media > News (0.38)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Small Languages, Big Models: A Study of Continual Training on Languages of Norway

Samuel, David, Mikhailov, Vladislav, Velldal, Erik, Øvrelid, Lilja, Charpentier, Lucas Georges Gabriel, Kutuzov, Andrey

arXiv.org Artificial IntelligenceDec-9-2024

This method vast amounts of data, posing a challenge enables us to train an 11.4B parameter model that for less widely spoken languages like Norwegian achieves state-of-the-art performance across Norwegian and even more so for truly lowresource language tasks while maintaining strong languages like Sámi. To address capabilities in Northern Sámi. The three main research this issue, we present a novel three-stage contributions of this paper can be summarized continual training approach. We also experiment as follows: with combining causal and masked 1. Novel training method for data-constrained language modeling to get more flexible language models We propose a three-stage models. Based on our findings, we train, training method for efficient adaptation of existing evaluate, and openly release a new large language models to lower-resource languages.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.06484

Country:

Europe (1.00)
Asia (0.93)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.88)

Add feedback

AXOLOTL'24 Shared Task on Multilingual Explainable Semantic Change Modeling

Fedorova, Mariia, Mickus, Timothee, Partanen, Niko, Siewert, Janine, Spaziani, Elena, Kutuzov, Andrey

arXiv.org Artificial IntelligenceJul-4-2024

This paper describes the organization and findings of AXOLOTL'24, the first multilingual explainable semantic change modeling shared task. We present new sense-annotated diachronic semantic change datasets for Finnish and Russian which were employed in the shared task, along with a surprise test-only German dataset borrowed from an existing source. The setup of AXOLOTL'24 is new to the semantic change modeling field, and involves subtasks of identifying unknown (novel) senses and providing dictionary-like definitions to these senses. The methods of the winning teams are described and compared, thus paving a path towards explainability in computational approaches to historical change of meaning.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2407.04079

Country:

Europe > Middle East > Malta (0.14)
North America > United States > New Mexico (0.14)
North America > United States > Maryland (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

Definition generation for lexical semantic change detection

Fedorova, Mariia, Kutuzov, Andrey, Scherrer, Yves

arXiv.org Artificial IntelligenceJun-20-2024

We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as `senses', and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages, we show that generated definitions are indeed specific and general enough to convey a signal sufficient to rank sets of words by the degree of their semantic change over time. Our approach is on par with or outperforms prior non-supervised sense-based LSCD methods. At the same time, it preserves interpretability and allows to inspect the reasons behind a specific shift in terms of discrete definitions-as-senses. This is another step in the direction of explainable semantic change modeling.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

2406.14167

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

Enriching Word Usage Graphs with Cluster Definitions

Fedorova, Mariia, Kutuzov, Andrey, Arefyev, Nikolay, Schlechtweg, Dominik

arXiv.org Artificial IntelligenceMar-26-2024

We present a dataset of word usage graphs (WUGs), where the existing WUGs for multiple languages are enriched with cluster labels functioning as sense definitions. They are generated from scratch by fine-tuned encoder-decoder language models. The conducted human evaluation has shown that these definitions match the existing clusters in WUGs better than the definitions chosen from WordNet by two baseline systems. At the same time, the method is straightforward to use and easy to extend to new languages. The resulting enriched datasets can be extremely helpful for moving on to explainable semantic change modeling.

artificial intelligence, natural language, text processing, (15 more...)

arXiv.org Artificial Intelligence

2403.18024

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.49)

Add feedback

A New Massive Multilingual Dataset for High-Performance Language Technologies

de Gibert, Ona, Nail, Graeme, Arefyev, Nikolay, Bañón, Marta, van der Linde, Jelmer, Ji, Shaoxiong, Zaragoza-Bernabeu, Jaume, Aulamo, Mikko, Ramírez-Sánchez, Gema, Kutuzov, Andrey, Pyysalo, Sampo, Oepen, Stephan, Tiedemann, Jörg

arXiv.org Artificial IntelligenceMar-20-2024

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

2403.14009

Country: Europe > Finland (0.46)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)

Add feedback

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

Chen, Pinzhen, Ji, Shaoxiong, Bogoychev, Nikolay, Kutuzov, Andrey, Haddow, Barry, Heafield, Kenneth

arXiv.org Artificial IntelligenceJan-30-2024

Foundational large language models (LLMs) can be instruction-tuned to perform open-domain question answering, facilitating applications like chat assistants. While such efforts are often carried out in a single language, we empirically analyze cost-efficient strategies for multilingual scenarios. Our study employs the Alpaca dataset and machine translations of it to form multilingual data, which is then used to tune LLMs through either low-rank adaptation or full-parameter training. Under a controlled computation budget, comparisons show that multilingual tuning is on par or better than tuning a model for each language. Furthermore, multilingual tuning with downsampled data can be as powerful and more robust. Our findings serve as a guide for expanding language support through instruction tuning.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2309.08958

Country: Europe > Finland (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)

Add feedback

Interpretable Word Sense Representations via Definition Generation: The Case of Semantic Change Analysis

Giulianelli, Mario, Luden, Iris, Fernandez, Raquel, Kutuzov, Andrey

arXiv.org Artificial IntelligenceJul-25-2023

We propose using automatically generated natural language definitions of contextualised word usages as interpretable word and word sense representations. Given a collection of usage examples for a target word, and the corresponding data-driven usage clusters (i.e., word senses), a definition is generated for each usage with a specialised Flan-T5 language model, and the most prototypical definition in a usage cluster is chosen as the sense label. We demonstrate how the resulting sense labels can make existing approaches to semantic change analysis more interpretable, and how they can allow users -- historical linguists, lexicographers, or social scientists -- to explore and intuitively explain diachronic trajectories of word meaning. Semantic change analysis is only one of many possible applications of the `definitions as representations' paradigm. Beyond being human-readable, contextualised definitions also outperform token or usage sentence embeddings in word-in-context semantic similarity judgements, making them a new promising type of lexical representation for NLP.

artificial intelligence, natural language, text processing, (18 more...)

arXiv.org Artificial Intelligence

2305.11993

Country:

Europe (1.00)
Asia (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.89)

Add feedback

Trained on 100 million words and still in shape: BERT meets British National Corpus

Samuel, David, Kutuzov, Andrey, Øvrelid, Lilja, Velldal, Erik

arXiv.org Artificial IntelligenceMay-5-2023

While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source -- the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2303.09859

Country:

Europe (1.00)
Asia (0.67)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
(2 more...)

Add feedback