Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From
Gao, Changjiang, Lin, Hankun, Huang, Xin, Han, Xue, Feng, Junlan, Deng, Chao, Chen, Jiajun, Huang, Shujian
–arXiv.org Artificial Intelligence
Cross-lingual context retrieval (extracting contextual information in one language based on requests in another) is a fundamental aspect of cross-lingual alignment, but the performance and mechanism of it for large language models (LLMs) remains unclear. In this paper, we evaluate the cross-lingual context retrieval of over 40 LLMs across 12 languages, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our mechanism analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential.
arXiv.org Artificial Intelligence
Oct-21-2025
- Country:
- North America (1.00)
- Asia > Middle East
- UAE (0.27)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Leisure & Entertainment > Sports > Football (1.00)
- Technology: