Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
Song, Seyoung, Kim, Nawon, Chae, Songeun, Park, Kiwoong, Jin, Jiho, Yoo, Haneul, Cho, Kyunghyun, Oh, Alice
–arXiv.org Artificial Intelligence
The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea ' s lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.
arXiv.org Artificial Intelligence
Oct-29-2025
- Country:
- Africa > Seychelles (0.04)
- Asia
- Azerbaijan (0.04)
- Japan
- Hokkaidō (0.04)
- Honshū
- Kansai > Osaka Prefecture
- Osaka (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.04)
- Kansai > Osaka Prefecture
- Myanmar (0.04)
- North Korea > Pyongyang
- Pyongyang (0.04)
- Singapore (0.04)
- South Korea > Seoul
- Seoul (0.05)
- Vietnam (0.04)
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany (0.04)
- Netherlands > South Holland
- Leiden (0.04)
- Poland (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America > United States
- New Jersey > Hudson County
- Hoboken (0.04)
- New York (0.04)
- New Jersey > Hudson County
- South America
- Genre:
- Research Report (0.64)
- Technology: