spanish language
RigoChat 2: an adapted language model to Spanish using a bounded dataset and reduced hardware
Gómez, Gonzalo Santamaría, Subies, Guillem García, Ruiz, Pablo Gutiérrez, Valero, Mario González, Fuertes, Natàlia, Zamorano, Helena Montoro, Sanz, Carmen Muñoz, Plaza, Leire Rosado, García, Nuria Aldama, Sánchez, David Betancur, Sushkova, Kateryna, Nieto, Marta Guerrero, Jiménez, Álvaro Barbero
Large Language Models (LLMs) have become a key element of modern artificial intelligence, demonstrating the ability to address a wide range of language processing tasks at unprecedented levels of accuracy without the need of collecting problem-specific data. However, these versatile models face a significant challenge: both their training and inference processes require substantial computational resources, time, and memory. Consequently, optimizing this kind of models to minimize these requirements is crucial. In this article, we demonstrate that, with minimal resources and in a remarkably short time, it is possible to enhance a state-of-the-art model, specifically for a given language task, without compromising its overall capabilities using a relatively small pretrained LLM as a basis. Specifically, we present our use case, RigoChat 2, illustrating how LLMs can be adapted to achieve superior results in Spanish-language tasks.
- Europe > Spain (0.14)
- Europe > Netherlands (0.14)
Evaluating Large Language Models with Tests of Spanish as a Foreign Language: Pass or Fail?
Mayor-Rocher, Marina, Melero, Nina, Merino-Gómez, Elena, Grandury, María, Conde, Javier, Reviriego, Pedro
Large Language Models (LLMs) have been profusely evaluated on their ability to answer questions on many topics and their performance on different natural language understanding tasks. Those tests are usually conducted in English, but most LLM users are not native English speakers. Therefore, it is of interest to analyze how LLMs understand other languages at different levels: from paragraphs to morphems. In this paper, we evaluate the performance of state-of-the-art LLMs in TELEIA, a recently released benchmark with similar questions to those of Spanish exams for foreign students, covering topics such as reading comprehension, word formation, meaning and compositional semantics, and grammar. The results show that LLMs perform well at understanding Spanish but are still far from achieving the level of a native speaker in terms of grammatical competence.
- Europe > Spain > Galicia > Madrid (0.04)
- South America (0.04)
- North America > Central America (0.04)
- Government (0.46)
- Education > Assessment & Standards > Student Performance (0.35)
CNER: A tool Classifier of Named-Entity Relationships
Torres, Jefferson A. Peña, De Piñerez, Raúl E. Gutiérrez
However, Spanish is occasionally adopted as the focus language for research endeavors and as result multiple projects are conducted in Spanish to explore language-specific nuances and challenges in NLP applications. Named-Entity recognition [1], Machine Translation [2], Semantic Relation Extraction [3] among others tasks have been conducted with a focus on Spanish language data, allowing for a more nuanced understanding of the intricacies involved. In this paper we present Classifier for Named Entities Recognized (CNER) a linguistically-aware online service that offers the possibility to test two main tasks of NLP, Named Entity Recognition (NER) and Relation Extraction (RE) for Spanish language. This together with other projects on Spanish language have been evaluated and adapted as a web service. In this context, language technologies and natural language processing (NLP) tools can support the identification of useful information in text and to promote its understanding. Specifically, CNER i) identifies the mentions follow the ACE standard with entity types include Person (PER), Organisation (ORG), Facility (FAC), Location (LOC), Geographical/Political (GPE), Vehicle (VEH), Vehicle (VEH) and Weapon (WEA) [4], [5]; ii) displays three different NER tools as previous step to RE task and iii) offers entity relationship information through tags GPE-AFF, PHYS, DISC, EMP-ORG, ART, NON-REL representing the relations between two entities [6] .
- South America > Colombia > Valle del Cauca Department > Cali (0.05)
- North America > United States > California > Santa Clara County > Palo Alto (0.05)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
- Europe > Portugal > Lisbon > Lisbon (0.05)
Spanish Pre-trained BERT Model and Evaluation Data
Cañete, José, Chaperon, Gabriel, Fuentes, Rodrigo, Ho, Jou-Hui, Kang, Hojin, Pérez, Jorge
The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pretrained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks. The field of natural language processing (NLP) has made incredible progress in the last two years.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Chile (0.05)
- South America > Paraguay > Asunción > Asunción (0.04)
- (2 more...)
EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing
de la Iglesia, Iker, Atutxa, Aitziber, Gojenola, Koldo, Barrena, Ander
The utilization of clinical reports for various secondary purposes, including health research and treatment monitoring, is crucial for enhancing patient care. Natural Language Processing (NLP) tools have emerged as valuable assets for extracting and processing relevant information from these reports. However, the availability of specialized language models for the clinical domain in Spanish has been limited. In this paper, we introduce EriBERTa, a bilingual domain-specific language model pre-trained on extensive medical and clinical corpora. We demonstrate that EriBERTa outperforms previous Spanish language models in the clinical domain, showcasing its superior capabilities in understanding medical texts and extracting meaningful information. Moreover, EriBERTa exhibits promising transfer learning abilities, allowing for knowledge transfer from one language to another. This aspect is particularly beneficial given the scarcity of Spanish clinical data.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- North America > Montserrat (0.04)
- (10 more...)
Regionalized models for Spanish language variations based on Twitter
Tellez, Eric S., Moctezuma, Daniela, Miranda, Sabino, Graff, Mario, Ruiz, Guillermo
Spanish is one of the most spoken languages in the globe, but not necessarily Spanish is written and spoken in the same way in different countries. Understanding local language variations can help to improve model performances on regional tasks, both understanding local structures and also improving the message's content. For instance, think about a machine learning engineer who automatizes some language classification task on a particular region or a social scientist trying to understand a regional event with echoes on social media; both can take advantage of dialect-based language models to understand what is happening with more contextual information hence more precision. This manuscript presents and describes a set of regionalized resources for the Spanish language built on four-year Twitter public messages geotagged in 26 Spanish-speaking countries. We introduce word embeddings based on FastText, language models based on BERT, and per-region sample corpora. We also provide a broad comparison among regions covering lexical and semantical similarities; as well as examples of using regional resources on message classification tasks.
- North America > United States (0.14)
- South America > Argentina (0.05)
- North America > Cuba (0.04)
- (35 more...)
- Information Technology > Services (0.93)
- Health & Medicine (0.68)
Learning about Spanish dialects through Twitter
Gonçalves, Bruno, Sánchez, David
This paper maps the large-scale variation of the Spanish language by employing a corpus based on geographically tagged Twitter messages. Lexical dialects are extracted from an analysis of variants of tens of concepts. The resulting maps show linguistic variation on an unprecedented scale across the globe. We discuss the properties of the main dialects within a machine learning approach and find that varieties spoken in urban areas have an international character in contrast to country areas where dialects show a more regional uniformity.
- North America > Mexico (0.14)
- Europe > Spain (0.05)
- South America > Colombia (0.05)
- (13 more...)