Smart Bilingual Focused Crawling of Parallel Documents

García-Romero, Cristian, Esplà-Gomis, Miquel, Sánchez-Martínez, Felipe

May-23-2024–arXiv.org Artificial Intelligence

The availability of large text corpora is especially relevant in the field of machine translation where the state-of-the-art approach to neural machine translation (Vaswani et al., 2017) requires large amounts of parallel texts, i.e., texts in one language and their translation into another language. Parallel texts have also proven useful to build pre-trained language models with cross-lingual capabilities (Conneau et al., 2020; Kale et al., 2021; Reid and Artetxe, 2022), and in translation-memory tools (Bowker, 2002) to assist professional translators. The reduced availability of parallel documents, particularly for low-resource language pairs, is fuelling a growing interest in web mining, which has allowed to build some of the largest parallel corpora to date (El-Kishky et al., 2020; Bañón et al., 2020; Schwenk et al., 2021; Bañón et al., 2022). State-of-the-art tools for harvesting parallel data from the Internet, like Bitextor (Bañón et al., 2020; Esplà-Gomis et al., 2016) and ILSP-FocusedCrawler (Papavassiliou et al., 2018), use a web crawler to automatically browse the web and collect textual data. Web crawlers start with a list of seed URLs. The corresponding documents are downloaded and parsed, and any new URLs linked from them are added to a list of pending downloads.

computational linguistic, parallel document, proceedings, (15 more...)

arXiv.org Artificial Intelligence

May-23-2024

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America
  - United States
    - Washington > King County
      - Seattle (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Czechia > Prague (0.04)
  - Sweden > Vaestra Goetaland
    - Gothenburg (0.04)
  - Spain > Cáceres
    - Cáceres Province > Cáceres (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - France > Île-de-France
    - Paris > Paris (0.04)
  - Bulgaria > Sofia City Province
    - Sofia (0.04)
  - Belgium > Flanders
    - East Flanders > Ghent (0.04)
- Asia
  - Thailand > Chiang Mai
    - Chiang Mai (0.04)
  - Singapore > Central Region
    - Singapore (0.04)
  - Japan > Kyūshū & Okinawa
    - Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
  - India
    - Tamil Nadu > Chennai (0.04)
    - Karnataka > Bengaluru (0.04)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found