High-Resource Methodological Bias in Low-Resource Investigations
ter Hoeve, Maartje, Grangier, David, Schluter, Natalie
–arXiv.org Artificial Intelligence
The central bottleneck for low-resource NLP is typically regarded to be the quantity of accessible data, overlooking the contribution of data quality. This is particularly seen in the development and evaluation of low-resource systems via down sampling of high-resource language data. In this work we investigate the validity of this approach, and we specifically focus on two well-known NLP tasks for our empirical investigations: POS-tagging and machine translation. We show that down sampling from a high-resource language results in datasets with different properties than the low-resource datasets, impacting the model performance for both POS-tagging and machine translation. Based on these results we conclude that naive down sampling of datasets results in a biased view of how well these systems work in a low-resource scenario.
arXiv.org Artificial Intelligence
Nov-14-2022
- Country:
- North America
- Dominican Republic (0.04)
- United States
- New York (0.04)
- Maryland > Baltimore (0.04)
- Washington > King County
- Seattle (0.04)
- Ohio > Franklin County
- Columbus (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- California > Los Angeles County
- Long Beach (0.04)
- Canada
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Europe
- Czechia > Prague (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- Italy > Tuscany
- Florence (0.04)
- Germany
- Denmark > Capital Region
- Copenhagen (0.04)
- Sweden > Uppsala County
- Uppsala (0.04)
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Greece > Attica
- Athens (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- United Kingdom
- Scotland > City of Edinburgh
- Edinburgh (0.04)
- England > Cambridgeshire
- Cambridge (0.04)
- Scotland > City of Edinburgh
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- China > Hong Kong (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Middle East
- Japan > Honshū
- Kansai > Kyoto Prefecture > Kyoto (0.04)
- North America
- Genre:
- Research Report > New Finding (0.67)
- Technology: