Ground Truth Generation for Multilingual Historical NLP using LLMs

Gladstone, Clovis, Fang, Zhao, Stewart, Spencer Dean

Nov-19-2025–arXiv.org Artificial Intelligence

Historical and low-resource NLP remains challenging due to limited annotated data and domain mismatches with modern, web-sourced corpora. This paper outlines our work in using large language models (LLMs) to create ground-truth annotations for historical French (16th-20th centuries) and Chinese (1900-1950) texts. By leveraging LLM-generated ground truth on a subset of our corpus, we were able to fine-tune spaCy to achieve significant gains on period-specific tests for part-of-speech (POS) annotations, lemmatization, and named entity recognition (NER). Our results underscore the importance of domain-specific models and demonstrate that even relatively limited amounts of synthetic data can improve NLP tools for under-resourced corpora in computational humanities research.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Nov-19-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China > Shanghai
    - Shanghai (0.06)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
  - Middle East > Saudi Arabia
    - Arabian Gulf (0.04)
- Europe (0.04)
- Indian Ocean > Arabian Gulf (0.04)
- North America > United States
  - Illinois > Cook County > Chicago (0.04)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)
  - Natural Language
    - Large Language Model (1.00)
    - Text Processing (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found