CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts
Nguyen, Hoang H., Zhang, Chenwei, Liu, Ye, Parde, Natalie, Rohrbaugh, Eugene, Yu, Philip S.
–arXiv.org Artificial Intelligence
Naively assuming English as a source language may hinder cross-lingual transfer for many languages by failing to consider the importance of language contact. Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages; for many languages, the set of closely related languages does not include English. In this work, we study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language. We also construct a novel benchmark dataset for close contact Chinese-Japanese-Korean-Vietnamese (CJKV) languages to further encourage in-depth studies of language contact. To comprehensively capture contact between these languages, we propose to integrate Romanized transcription beyond textual scripts via Contrastive Learning objectives, leading to enhanced cross-lingual representations and effective zero-shot cross-lingual transfer.
arXiv.org Artificial Intelligence
Apr-19-2024
- Country:
- Asia
- Japan (0.04)
- Middle East > Jordan (0.04)
- Singapore (0.04)
- Europe
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Ireland > Leinster
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- California > Santa Clara County
- Palo Alto (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Pennsylvania > Dauphin County
- Harrisburg (0.04)
- Virginia (0.04)
- Washington > King County
- Seattle (0.04)
- California > Santa Clara County
- Canada > Ontario
- Asia
- Genre:
- Research Report (0.50)
- Technology: