Sabi\'a-3 Technical Report

Abonizio, Hugo, Almeida, Thales Sales, Laitz, Thiago, Junior, Roseval Malaquias, Bonás, Giovana Kerche, Nogueira, Rodrigo, Pires, Ramon

arXiv.org Artificial Intelligence 

This technical report presents the details of the development and evaluation of the Sabiá-3 and Sabiazinho-3 models. We trained them on a large corpus of documents written in Portuguese, with a special focus on Brazil-related resources. Through training, models were exposed to information relevant to Brazilian culture, history, and context. The main objective was to have a specialized model that is aware of the linguistic nuances, societal norms, and regional variations unique to the country. Throughout this report, we show that this specialization allows the models to perform better in knowledge-intensive tasks. We applied an approach of continual learning by leveraging a "generalist" model that already acquired some level of language understanding and reasoning abilities, and then further trained it on our corpus of high-quality data relevant to the Brazilian context. The development consisted of two main phases: (1) the pre-training phase, in which we further train a pre-trained model on specialized data following a self-supervised learning strategy optimizing for the next token prediction objective, and (2) the post-training phase where the model is tuned to follow instructions and align to human preferences. Compared to our previous release, Sabiá-2 [5], we have collected a significantly larger volume of data for pre-training.