Igea: a Decoder-Only Language Model for Biomedical Text Generation in Italian
Buonocore, Tommaso Mario, Rancati, Simone, Parimbelli, Enea
–arXiv.org Artificial Intelligence
The advent of probabilistic language models has revolutionized various domains, with biomedical natural language processing (NLP) standing out due to its significant impact on healthcare provision and medical research. The ability of these models to understand, process, and generate text from vast biomedical corpora has led to improvements in tasks such as entity recognition, relation extraction, and question answering. However, the majority of this progress has been focused on English-language texts, creating a notable disparity for other languages with fewer resources, such as Italian. In the Italian context, the scarcity of large and diverse training datasets presents a substantial challenge. General language models like Minerva and Maestrale have made strides in Italian NLP, but they lack the specialization required to handle the nuances of biomedical terminology effectively. Addressing this gap is crucial, as the precision and clarity needed in medical communications are paramount for clinical and research applications in such a high-stakes domain. In this paper we introduce Igea, a biomedical language model (BLM) built from the ground-up on the Italian language, and that is effective in handling Italian native biomedical text while maintaining its efficiency in terms of computational resources. We built upon the foundation model Minerva, which we then continually trained on Italian native biomedical text, while employing proper provisions to avoid disruption of what was learned during pre-training.
arXiv.org Artificial Intelligence
Jul-8-2024