Spanish Pre-trained BERT Model and Evaluation Data

Cañete, José, Chaperon, Gabriel, Fuentes, Rodrigo, Ho, Jou-Hui, Kang, Hojin, Pérez, Jorge

Aug-5-2023–arXiv.org Artificial Intelligence

The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pretrained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks. The field of natural language processing (NLP) has made incredible progress in the last two years.

arxiv preprint arxiv, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Aug-5-2023

arXiv.org PDF

Add feedback

Country:
- Europe
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - France > Île-de-France
    - Paris > Paris (0.04)
- North America > United States
  - Minnesota > Hennepin County > Minneapolis (0.14)
- South America
  - Chile (0.05)
  - Paraguay > Asunción
    - Asunción (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Machine Translation (0.94)
    - Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found