AITopics | Variš, Dušan

Plotting

Variš, Dušan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Burchell, Laurie, de Gibert, Ona, Arefyev, Nikolay, Aulamo, Mikko, Bañón, Marta, Chen, Pinzhen, Fedorova, Mariia, Guillou, Liane, Haddow, Barry, Hajič, Jan, Helcl, Jindřich, Henriksson, Erik, Klimaszewski, Mateusz, Komulainen, Ville, Kutuzov, Andrey, Kytöniemi, Joona, Laippala, Veronika, Mæhlum, Petter, Malik, Bhavitvya, Mehryary, Farrokh, Mikhailov, Vladislav, Moghe, Nikita, Myntti, Amanda, O'Brien, Dayyán, Oepen, Stephan, Pal, Proyag, Piha, Jousia, Pyysalo, Sampo, Ramírez-Sánchez, Gema, Samuel, David, Stepachev, Pavel, Tiedemann, Jörg, Variš, Dušan, Vojtěchová, Tereza, Zaragoza-Bernabeu, Jaume

arXiv.org Artificial IntelligenceMar-14-2025

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

artificial intelligence, machine translation, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.10267

Country:

Europe (1.00)
Asia > Middle East (0.92)
South America (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology (0.67)
Government (0.46)
Education (0.46)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Adversarial Testing as a Tool for Interpretability: Length-based Overfitting of Elementary Functions in Transformers

Zavoral, Patrik, Variš, Dušan, Bojar, Ondřej

arXiv.org Artificial IntelligenceOct-17-2024

We study length-based generalization, whereby the novel out-of-distribution condition is induced solely by controlling The Transformer model has a tendency to overfit the range of the sequences in the training and validation various aspects of the training data, such as sets. This type of generalization is especially apparent in the overall sequence length. We study elementary tasks where the pattern is elementary, and therefore easily string edit functions using a defined set of identifiable by humans. For example, when we illustrate the error indicators to interpret the behaviour of the operation of string reversal on short strings, humans will sequence-to-sequence Transformer. We show that correctly reverse also a long string. Such elementary string generalization to shorter sequences is often possible, edit functions thus highlight the extent to which universal but confirm that longer sequences are highly approximators may be limited by data. The elementary problematic, although partially correct answers functions we experiment with are solvable using very small are often obtained. Additionally, we find that Transformers (1-2 layers, 1 attention head; Weiss et al., other structural characteristics of the sequences, 2021) and it is possible to construct such Transformers such as subsegment length, may be equally important.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.13802

Country: North America > United States (0.46)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Negative Lexical Constraints in Neural Machine Translation

Jon, Josef, Variš, Dušan, Novák, Michal, Aires, João Paulo, Bojar, Ondřej

arXiv.org Artificial IntelligenceAug-7-2023

This paper explores negative lexical constraining in English to Czech neural machine translation. Negative lexical constraining is used to prohibit certain words or expressions in the translation produced by the neural translation model. We compared various methods based on modifying either the decoding process or the training data. The comparison was performed on two tasks: paraphrasing and feedback-based translation refinement. We also studied to which extent these methods "evade" the constraints presented to the model (usually in the dictionary form) by generating a different surface form of a given constraint.We propose a way to mitigate the issue through training with stemmed negative constraints to counter the model's ability to induce a variety of the surface forms of a word that can result in bypassing the constraint. We demonstrate that our method improves the constraining, although the problem still persists in many cases.

artificial intelligence, constraint, natural language, (16 more...)

arXiv.org Artificial Intelligence

2308.03601

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback