Advancing Multilingual Pre-training: TRIP Triangular Document-level Pre-training for Multilingual Language Models
Lu, Hongyuan, Huang, Haoyang, Ma, Shuming, Zhang, Dongdong, Lam, Wai, Wei, Furu
–arXiv.org Artificial Intelligence
Despite the success of multilingual sequence-to-sequence pre-training, most existing approaches rely on document-level monolingual corpora in many different languages, sentence-level bilingual corpora,\footnote{In this paper, we use `bilingual corpora' to denote parallel corpora with `bilingual translation pairs' in many different language pairs, each consisting of two sentences/documents with the same meaning written in different languages. We use `trilingual corpora' to denote parallel corpora with `trilingual translation pairs' in many different language combinations, each consisting of three sentences/documents.} and sometimes synthetic document-level bilingual corpora. This hampers the performance with cross-lingual document-level tasks such as document-level translation. Therefore, we propose to mine and leverage document-level trilingual parallel corpora to improve sequence-to-sequence multilingual pre-training. We present \textbf{Tri}angular Document-level \textbf{P}re-training (\textbf{TRIP}), which is the first in the field to accelerate the conventional monolingual and bilingual objectives into a trilingual objective with a novel method called Grafting. Experiments show that TRIP achieves several strong state-of-the-art (SOTA) scores on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including consistent improvements by up to 3.11 d-BLEU points and 8.9 ROUGE-L points.
arXiv.org Artificial Intelligence
May-13-2023
- Country:
- Oceania > Australia
- North America > United States
- California (0.04)
- Washington > King County
- Seattle (0.14)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Europe
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Italy > Tuscany
- Florence (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Spain > Catalonia
- Asia
- Genre:
- Research Report (1.00)
- Technology: