Plotting

 Variš, Dušan


An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

arXiv.org Artificial Intelligence

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.


Adversarial Testing as a Tool for Interpretability: Length-based Overfitting of Elementary Functions in Transformers

arXiv.org Artificial Intelligence

We study length-based generalization, whereby the novel out-of-distribution condition is induced solely by controlling The Transformer model has a tendency to overfit the range of the sequences in the training and validation various aspects of the training data, such as sets. This type of generalization is especially apparent in the overall sequence length. We study elementary tasks where the pattern is elementary, and therefore easily string edit functions using a defined set of identifiable by humans. For example, when we illustrate the error indicators to interpret the behaviour of the operation of string reversal on short strings, humans will sequence-to-sequence Transformer. We show that correctly reverse also a long string. Such elementary string generalization to shorter sequences is often possible, edit functions thus highlight the extent to which universal but confirm that longer sequences are highly approximators may be limited by data. The elementary problematic, although partially correct answers functions we experiment with are solvable using very small are often obtained. Additionally, we find that Transformers (1-2 layers, 1 attention head; Weiss et al., other structural characteristics of the sequences, 2021) and it is possible to construct such Transformers such as subsegment length, may be equally important.


Negative Lexical Constraints in Neural Machine Translation

arXiv.org Artificial Intelligence

This paper explores negative lexical constraining in English to Czech neural machine translation. Negative lexical constraining is used to prohibit certain words or expressions in the translation produced by the neural translation model. We compared various methods based on modifying either the decoding process or the training data. The comparison was performed on two tasks: paraphrasing and feedback-based translation refinement. We also studied to which extent these methods "evade" the constraints presented to the model (usually in the dictionary form) by generating a different surface form of a given constraint.We propose a way to mitigate the issue through training with stemmed negative constraints to counter the model's ability to induce a variety of the surface forms of a word that can result in bypassing the constraint. We demonstrate that our method improves the constraining, although the problem still persists in many cases.