Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora
Ranathunga, Surangika, de Silva, Nisansa, Velayuthan, Menan, Fernando, Aloka, Rathnayake, Charitha
–arXiv.org Artificial Intelligence
We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.
arXiv.org Artificial Intelligence
Feb-12-2024
- Country:
- Asia
- Indonesia > Bali (0.04)
- Middle East
- Israel > Jerusalem District
- Jerusalem (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Israel > Jerusalem District
- Singapore (0.04)
- Sri Lanka (0.04)
- Europe
- Denmark > Capital Region
- Copenhagen (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany > Berlin (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy > Tuscany
- Florence (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Spain > Valencian Community
- Valencia Province > Valencia (0.04)
- Denmark > Capital Region
- North America
- Dominican Republic (0.04)
- United States
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Pennsylvania (0.04)
- Texas > Dallas County
- Dallas (0.04)
- Minnesota > Hennepin County
- Oceania
- Australia > Victoria
- Melbourne (0.04)
- New Zealand > North Island
- Manawatū-Whanganui > Palmerston North (0.04)
- Australia > Victoria
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Technology: