AITopics | beto

Collaborating Authors

beto

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Morphological evaluation of subwords vocabulary used by BETO language model

García-Sierra, Óscar, Cesteros, Ana Fernández-Pampillón, Ortega-Martín, Miguel

arXiv.org Artificial IntelligenceOct-3-2024

Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always align with real morphemes, potentially impacting the models' performance, though it remains uncertain when this might occur. In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language. Our evaluation method was built on three quality measures, relevance, cohesion, and morphological accuracy, and a procedure for their assessment. By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality. In this article, we apply this evaluation to the tokenizer of BETO, a BERT language model trained on large Spanish corpora. This evaluation, along with our previous results, helped us conclude that its vocabulary has a low morphological quality, and we also found that training the tokenizer in a larger corpus does not improve the morphological quality of the generated vocabulary. Additionally, this evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors' claims and the model's configuration.

beto, tokenizer, vocabulario, (15 more...)

arXiv.org Artificial Intelligence

2410.02283

Country:

Europe > Spain > Galicia > Madrid (0.04)
North America > United States > California > Los Angeles County > El Segundo (0.04)
Europe > Portugal > Guarda > Guarda (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

Seventeenth-Century Spanish American Notary Records for Fine-Tuning Spanish Large Language Models

Sarker, Shraboni, Hamad, Ahmad Tamim, Alshammari, Hulayyil, Grieco, Viviana, Rao, Praveen

arXiv.org Artificial IntelligenceJun-9-2024

Large language models have gained tremendous popularity in domains such as e-commerce, finance, healthcare, and education. Fine-tuning is a common approach to customize an LLM on a domain-specific dataset for a desired downstream task. In this paper, we present a valuable resource for fine-tuning LLMs developed for the Spanish language to perform a variety of tasks such as classification, masked language modeling, clustering, and others. Our resource is a collection of handwritten notary records from the seventeenth century obtained from the National Archives of Argentina. This collection contains a combination of original images and transcribed text (and metadata) of 160+ pages that were handwritten by two notaries, namely, Estenban Agreda de Vergara and Nicolas de Valdivia y Brisuela nearly 400 years ago. Through empirical evaluation, we demonstrate that our collection can be used to fine-tune Spanish LLMs for tasks such as classification and masked language modeling, and can outperform pre-trained Spanish models and ChatGPT-3.5/ChatGPT-4o. Our resource will be an invaluable resource for historical text analysis and is publicly available on GitHub.

language modeling, sanrlite, spanish american notary record, (12 more...)

arXiv.org Artificial Intelligence

2406.05812

Country:

South America > Argentina (0.26)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Missouri > Boone County > Columbia (0.05)
(6 more...)

Genre: Research Report (0.82)

Industry:

Information Technology (0.48)
Health & Medicine (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Spanish Language Models

Gutiérrez-Fandiño, Asier, Armengol-Estapé, Jordi, Pàmies, Marc, Llop-Palao, Joan, Silveira-Ocampo, Joaquín, Carrino, Casimiro Pio, Gonzalez-Agirre, Aitor, Armentano-Oller, Carme, Rodriguez-Penagos, Carlos, Villegas, Marta

arXiv.org Artificial IntelligenceAug-13-2021

This paper presents the Spanish RoBERTa-base and RoBERTa-large models, as well as the corresponding performance evaluations. Both models were pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019. We extended the current evaluation datasets with an extractive Question Answering dataset and our models outperform the existing Spanish models across tasks and settings.

mbert, roberta-base-bne, roberta-large-bne, (16 more...)

arXiv.org Artificial Intelligence

2107.07253

Country:

Europe > Spain (0.34)
South America > Paraguay > Asunción > Asunción (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
Africa > Middle East > Morocco (0.04)

Genre: Research Report (0.51)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)

Add feedback

BETO: Spanish BERT

#artificialintelligenceJan-23-2020, 08:08:28 GMT

Transformer based models are creating tremendous impact in the space of NLP as they have proven to be effective in a wide range of tasks such as POS tagging, machine translation, named-entity recognition, and a series of text classification tasks. This year saw the introduction to a whole family of transformer-based language models such as BERT, Transformer-XL, and GPT-2, among others. Langauge models, in general, offer desirable properties that can be leveraged in a transfer learning setting where you train a model with large-scale data to learn the properties of language in an unsupervised setting. The resulting model and weights can then be fine-tuned and be applied in low-resourced regimes to address different NLP tasks. In particular, it's exciting to see the use of BERT in different domains such as text classification, text summarization, text generation, and information retrieval.

Add feedback