Seventeenth-Century Spanish American Notary Records for Fine-Tuning Spanish Large Language Models

Sarker, Shraboni, Hamad, Ahmad Tamim, Alshammari, Hulayyil, Grieco, Viviana, Rao, Praveen

Jun-9-2024–arXiv.org Artificial Intelligence

Large language models have gained tremendous popularity in domains such as e-commerce, finance, healthcare, and education. Fine-tuning is a common approach to customize an LLM on a domain-specific dataset for a desired downstream task. In this paper, we present a valuable resource for fine-tuning LLMs developed for the Spanish language to perform a variety of tasks such as classification, masked language modeling, clustering, and others. Our resource is a collection of handwritten notary records from the seventeenth century obtained from the National Archives of Argentina. This collection contains a combination of original images and transcribed text (and metadata) of 160+ pages that were handwritten by two notaries, namely, Estenban Agreda de Vergara and Nicolas de Valdivia y Brisuela nearly 400 years ago. Through empirical evaluation, we demonstrate that our collection can be used to fine-tune Spanish LLMs for tasks such as classification and masked language modeling, and can outperform pre-trained Spanish models and ChatGPT-3.5/ChatGPT-4o. Our resource will be an invaluable resource for historical text analysis and is publicly available on GitHub.

language modeling, sanrlite, spanish american notary record, (12 more...)

arXiv.org Artificial Intelligence

Jun-9-2024

arXiv.org PDF

Add feedback

Country:
- South America > Argentina (0.26)
- North America > United States
  - Missouri
    - Boone County > Columbia (0.05)
    - Jackson County > Kansas City (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - Massachusetts > Middlesex County
    - Cambridge (0.04)
- Europe
  - Italy > Tuscany
    - Florence (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
- Asia
  - Taiwan (0.04)
  - China > Zhejiang Province
    - Hangzhou (0.04)

Genre:
- Research Report (0.82)

Industry:
- Information Technology (0.48)
- Health & Medicine (0.35)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found