parallel document
Smart Bilingual Focused Crawling of Parallel Documents
García-Romero, Cristian, Esplà-Gomis, Miquel, Sánchez-Martínez, Felipe
The availability of large text corpora is especially relevant in the field of machine translation where the state-of-the-art approach to neural machine translation (Vaswani et al., 2017) requires large amounts of parallel texts, i.e., texts in one language and their translation into another language. Parallel texts have also proven useful to build pre-trained language models with cross-lingual capabilities (Conneau et al., 2020; Kale et al., 2021; Reid and Artetxe, 2022), and in translation-memory tools (Bowker, 2002) to assist professional translators. The reduced availability of parallel documents, particularly for low-resource language pairs, is fuelling a growing interest in web mining, which has allowed to build some of the largest parallel corpora to date (El-Kishky et al., 2020; Bañón et al., 2020; Schwenk et al., 2021; Bañón et al., 2022). State-of-the-art tools for harvesting parallel data from the Internet, like Bitextor (Bañón et al., 2020; Esplà-Gomis et al., 2016) and ILSP-FocusedCrawler (Papavassiliou et al., 2018), use a web crawler to automatically browse the web and collect textual data. Web crawlers start with a list of seed URLs. The corresponding documents are downloaded and parsed, and any new URLs linked from them are added to a list of pending downloads.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Germany > Berlin (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (15 more...)
A Novel Two-Step Method for Cross Language Representation Learning
Cross language text classification is an important learning task in natural language processing. A critical challenge of cross language learning arises from the fact that words of different languages are in disjoint feature spaces. In this paper, we propose a two-step representation learning method to bridge the feature spaces of different languages by exploiting a set of parallel bilingual documents. Specifically, we first formulate a matrix completion problem to produce a complete parallel document-term matrix for all documents in two languages, and then induce a low dimensional cross-lingual document representation by applying latent semantic indexing on the obtained matrix. We use a projected gradient descent algorithm to solve the formulated matrix completion problem with convergence guarantees. The proposed method is evaluated by conducting a set of experiments with cross language sentiment classification tasks on Amazon product reviews. The experimental results demonstrate that the proposed learning method outperforms a number of other cross language representation learning methods, especially when the number of parallel bilingual documents is small.
Adapting Large Language Models for Document-Level Machine Translation
Wu, Minghao, Vu, Thuy-Trang, Qu, Lizhen, Foster, George, Haffari, Gholamreza
Large language models (LLMs) have made significant strides in various natural language processing (NLP) tasks. Recent research shows that the moderately-sized LLMs often outperform their larger counterparts after task-specific fine-tuning. In this work, we delve into the process of adapting LLMs to specialize in document-level machine translation (DocMT) for a specific language pair. Firstly, we explore how prompt strategies affect downstream translation performance. Then, we conduct extensive experiments with two fine-tuning methods, three LLM backbones, and 18 translation tasks across nine language pairs. Our findings indicate that in some cases, these specialized models even surpass GPT-4 in translation performance, while they still significantly suffer from the off-target translation issue in others, even if they are exclusively fine-tuned on bilingual parallel documents. Furthermore, we provide an in-depth analysis of these LLMs tailored for DocMT, exploring aspects such as translation errors, the scaling law of parallel documents, out-of-domain generalization, and the impact of zero-shot crosslingual transfer. The findings of this research not only shed light on the strengths and limitations of LLM-based DocMT models but also provide a foundation for future research in DocMT.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- (15 more...)
LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models
Gong, Hongyu, Chaudhary, Vishrav, Tang, Yuqing, Guzmán, Francisco
Cross-lingual document representations enable language understanding in multilingual contexts and allow transfer learning from high-resource to low-resource languages at the document level. Recently large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks. It is tempting to apply these cross-lingual models to document representation learning. However, there are two challenges: (1) these models impose high costs on long document processing and thus many of them have strict length limit; (2) model fine-tuning requires extra data and computational resources, which is not practical in resource-limited settings. In this work, we address these challenges by proposing unsupervised Language-Agnostic Weighted Document Representations (LAWDR). We study the geometry of pre-trained sentence embeddings and leverage it to derive document representations without fine-tuning. Evaluated on cross-lingual document alignment, LAWDR demonstrates comparable performance to state-of-the-art models on benchmark datasets.
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.98)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)
A Novel Two-Step Method for Cross Language Representation Learning
Cross language text classification is an important learning task in natural language processing. A critical challenge of cross language learning lies in that words of different languages are in disjoint feature spaces. In this paper, we propose a two-step representation learning method to bridge the feature spaces of different languages by exploiting a set of parallel bilingual documents. Specifically, we first formulate a matrix completion problem to produce a complete parallel document-term matrix for all documents in two languages, and then induce a cross-lingual document representation by applying latent semantic indexing on the obtained matrix. We use a projected gradient descent algorithm to solve the formulated matrix completion problem with convergence guarantees. The proposed approach is evaluated by conducting a set of experiments with cross language sentiment classification tasks on Amazon product reviews. The experimental results demonstrate that the proposed learning approach outperforms a number of comparison cross language representation learning methods, especially when the number of parallel bilingual documents is small.