Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding
Tan, Samson, Joty, Shafiq, Varshney, Lav R., Kan, Min-Yen
–arXiv.org Artificial Intelligence
Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by non-standard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so.
arXiv.org Artificial Intelligence
Nov-18-2020
- Country:
- Oceania > Australia
- Victoria > Melbourne (0.04)
- New South Wales > Sydney (0.04)
- North America
- United States
- Louisiana (0.04)
- Washington > King County
- Seattle (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Massachusetts > Middlesex County
- Malden (0.04)
- Illinois > Cook County
- Chicago (0.04)
- California > San Diego County
- San Diego (0.04)
- Canada > British Columbia
- United States
- Europe
- Germany > Berlin (0.04)
- Czechia > Prague (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Cambridgeshire > Cambridge (0.04)
- Spain > Valencian Community
- Valencia Province > Valencia (0.04)
- Italy > Tuscany
- Florence (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- Oceania > Australia
- Genre:
- Research Report (1.00)
- Technology: