When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun
Song, Seyoung, Yoo, Haneul, Jin, Jiho, Cho, Kyunghyun, Oh, Alice
–arXiv.org Artificial Intelligence
Historical and linguistic connections within the Sinosphere have led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within $\pm{}0.0068$ F1-score for sequence labeling tasks and up to $+0.84$ BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These mixed results emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.
arXiv.org Artificial Intelligence
Nov-7-2024
- Country:
- North America
- Dominican Republic (0.04)
- United States
- Washington > King County
- Seattle (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- New York > New York County
- New York City (0.04)
- Washington > King County
- Canada > Ontario
- Toronto (0.04)
- Europe
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Netherlands > South Holland
- Leiden (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Spain > Catalonia
- Asia
- Vietnam (0.04)
- China (0.04)
- East Asia (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- South Korea > Daejeon
- Daejeon (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (0.93)
- Research Report
- Technology: