Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Sep-29-2015–arXiv.org Machine Learning

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from Wikipedia articles. We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs. We describe our implementation of a specialized tool for this task as well as training and adaption of a machine translation system that supplies our filter with additional information about the similarity of comparable sentence pairs.

alignment, corpora, translation, (16 more...)

arXiv.org Machine Learning

Sep-29-2015

arXiv.org PDF

Add feedback

Country:
- Europe > Poland
  - Masovia Province > Warsaw (0.04)
- Africa > Middle East
  - Egypt > Giza Governorate > Giza (0.05)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found