Building and Aligning Comparable Corpora
Saad, Motaz, Langlois, David, Smaili, Kamel
–arXiv.org Artificial Intelligence
Comparable corpus is a set of topic aligned documents in multiple languages, which are not necessarily translations of each other. These documents are useful for multilingual natural language processing when there is no parallel text available in some domains or languages. In addition, comparable documents are informative because they can tell what is being said about a topic in different languages. In this paper, we present a method to build comparable corpora from Wikipedia encyclopedia and EURONEWS website in English, French and Arabic languages. We further experiment a method to automatically align comparable documents using cross-lingual similarity measures. We investigate two cross-lingual similarity measures to align comparable documents. The first measure is based on bilingual dictionary, and the second measure is based on Latent Semantic Indexing (LSI). Experiments on several corpora show that the Cross-Lingual LSI (CL-LSI) measure outperforms the dictionary based measure. Finally, we collect English and Arabic news documents from the British Broadcast Corporation (BBC) and from ALJAZEERA (JSC) news website respectively. Then we use the CL-LSI similarity measure to automatically align comparable documents of BBC and JSC. The evaluation of the alignment shows that CL-LSI is not only able to align cross-lingual documents at the topic level, but also it is able to do this at the event level.
arXiv.org Artificial Intelligence
Aug-5-2025
- Country:
- Africa
- Middle East (0.04)
- North Africa (0.04)
- Asia
- Middle East
- Jordan (0.04)
- Palestine > Gaza Strip
- Gaza Governorate > Gaza (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Pakistan (0.04)
- Middle East
- Europe
- Bulgaria (0.04)
- France (0.14)
- Iceland > Capital Region
- Reykjavik (0.04)
- Italy > Trentino-Alto Adige/Südtirol
- Trentino Province > Trento (0.04)
- Middle East
- Malta (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Poland (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Lancashire > Lancaster (0.04)
- North America > United States
- California > San Francisco County
- San Francisco (0.14)
- Maryland > Prince George's County
- College Park (0.04)
- Washington > King County
- Seattle (0.04)
- California > San Francisco County
- Africa
- Genre:
- Research Report (1.00)
- Technology: