Smart Bilingual Focused Crawling of Parallel Documents

García-Romero, Cristian, Esplà-Gomis, Miquel, Sánchez-Martínez, Felipe

arXiv.org Artificial Intelligence 

The availability of large text corpora is especially relevant in the field of machine translation where the state-of-the-art approach to neural machine translation (Vaswani et al., 2017) requires large amounts of parallel texts, i.e., texts in one language and their translation into another language. Parallel texts have also proven useful to build pre-trained language models with cross-lingual capabilities (Conneau et al., 2020; Kale et al., 2021; Reid and Artetxe, 2022), and in translation-memory tools (Bowker, 2002) to assist professional translators. The reduced availability of parallel documents, particularly for low-resource language pairs, is fuelling a growing interest in web mining, which has allowed to build some of the largest parallel corpora to date (El-Kishky et al., 2020; Bañón et al., 2020; Schwenk et al., 2021; Bañón et al., 2022). State-of-the-art tools for harvesting parallel data from the Internet, like Bitextor (Bañón et al., 2020; Esplà-Gomis et al., 2016) and ILSP-FocusedCrawler (Papavassiliou et al., 2018), use a web crawler to automatically browse the web and collect textual data. Web crawlers start with a list of seed URLs. The corresponding documents are downloaded and parsed, and any new URLs linked from them are added to a list of pending downloads.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found