Goto

Collaborating Authors

 beto


Morphological evaluation of subwords vocabulary used by BETO language model

García-Sierra, Óscar, Cesteros, Ana Fernández-Pampillón, Ortega-Martín, Miguel

arXiv.org Artificial Intelligence

Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always align with real morphemes, potentially impacting the models' performance, though it remains uncertain when this might occur. In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language. Our evaluation method was built on three quality measures, relevance, cohesion, and morphological accuracy, and a procedure for their assessment. By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality. In this article, we apply this evaluation to the tokenizer of BETO, a BERT language model trained on large Spanish corpora. This evaluation, along with our previous results, helped us conclude that its vocabulary has a low morphological quality, and we also found that training the tokenizer in a larger corpus does not improve the morphological quality of the generated vocabulary. Additionally, this evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors' claims and the model's configuration.


Seventeenth-Century Spanish American Notary Records for Fine-Tuning Spanish Large Language Models

Sarker, Shraboni, Hamad, Ahmad Tamim, Alshammari, Hulayyil, Grieco, Viviana, Rao, Praveen

arXiv.org Artificial Intelligence

Large language models have gained tremendous popularity in domains such as e-commerce, finance, healthcare, and education. Fine-tuning is a common approach to customize an LLM on a domain-specific dataset for a desired downstream task. In this paper, we present a valuable resource for fine-tuning LLMs developed for the Spanish language to perform a variety of tasks such as classification, masked language modeling, clustering, and others. Our resource is a collection of handwritten notary records from the seventeenth century obtained from the National Archives of Argentina. This collection contains a combination of original images and transcribed text (and metadata) of 160+ pages that were handwritten by two notaries, namely, Estenban Agreda de Vergara and Nicolas de Valdivia y Brisuela nearly 400 years ago. Through empirical evaluation, we demonstrate that our collection can be used to fine-tune Spanish LLMs for tasks such as classification and masked language modeling, and can outperform pre-trained Spanish models and ChatGPT-3.5/ChatGPT-4o. Our resource will be an invaluable resource for historical text analysis and is publicly available on GitHub.


Spanish Language Models

Gutiérrez-Fandiño, Asier, Armengol-Estapé, Jordi, Pàmies, Marc, Llop-Palao, Joan, Silveira-Ocampo, Joaquín, Carrino, Casimiro Pio, Gonzalez-Agirre, Aitor, Armentano-Oller, Carme, Rodriguez-Penagos, Carlos, Villegas, Marta

arXiv.org Artificial Intelligence

This paper presents the Spanish RoBERTa-base and RoBERTa-large models, as well as the corresponding performance evaluations. Both models were pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019. We extended the current evaluation datasets with an extractive Question Answering dataset and our models outperform the existing Spanish models across tasks and settings.


BETO: Spanish BERT

#artificialintelligence

Transformer based models are creating tremendous impact in the space of NLP as they have proven to be effective in a wide range of tasks such as POS tagging, machine translation, named-entity recognition, and a series of text classification tasks. This year saw the introduction to a whole family of transformer-based language models such as BERT, Transformer-XL, and GPT-2, among others. Langauge models, in general, offer desirable properties that can be leveraged in a transfer learning setting where you train a model with large-scale data to learn the properties of language in an unsupervised setting. The resulting model and weights can then be fine-tuned and be applied in low-resourced regimes to address different NLP tasks. In particular, it's exciting to see the use of BERT in different domains such as text classification, text summarization, text generation, and information retrieval.