Spanish Pre-trained BERT Model and Evaluation Data
Cañete, José, Chaperon, Gabriel, Fuentes, Rodrigo, Ho, Jou-Hui, Kang, Hojin, Pérez, Jorge
–arXiv.org Artificial Intelligence
The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pretrained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks. The field of natural language processing (NLP) has made incredible progress in the last two years.
arXiv.org Artificial Intelligence
Aug-5-2023
- Country:
- Genre:
- Research Report (0.64)
- Technology: