Correcting diacritics and typos with ByT5 transformer model
Stankevičius, Lukas, Lukoševičius, Mantas, Kapočiūtė-Dzikienė, Jurgita, Briedienė, Monika, Krilavičius, Tomas
Due to the fast pace of life and online communications, the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing. Restoring diacritics and correcting spelling is important for proper language use and disambiguation of texts for both humans and downstream algorithms. However, both of these problems are typically addressed separately, i.e., state-of-the-art diacritics restoration methods do not tolerate other typos. In this work, we tackle both problems at once by employing newly-developed ByT5 byte-level transformer models. Our simultaneous diacritics restoration and typos correction approach demonstrates near state-of-the-art performance in 13 languages, reaching >96% of the alpha-word accuracy. We also perform diacritics restoration alone on 12 benchmark datasets with the additional one for the Lithuanian language. The experimental investigation proves that our approach is able to achieve comparable results (>98%) to previously reported despite being trained on fewer data. Our approach is also able to restore diacritics in words not seen during training with >76% accuracy. We also show the accuracies to further improve with longer training. All this shows a great real-world application potential of our suggested methods to more data, languages, and error classes.
Jan-31-2022
- Country:
- Oceania > Australia
- New South Wales > Sydney (0.04)
- North America
- Dominican Republic (0.04)
- United States
- Indiana (0.04)
- Texas > Travis County
- Austin (0.04)
- New York
- New York County > New York City (0.14)
- Monroe County > Rochester (0.04)
- New Mexico > Doña Ana County
- Las Cruces (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Canada > British Columbia
- Europe
- Montenegro (0.04)
- Slovenia (0.04)
- Czechia > Prague (0.04)
- Croatia > Zagreb County
- Zagreb (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Lithuania > Kaunas County
- Kaunas (0.04)
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- United Kingdom > Scotland
- City of Edinburgh > Edinburgh (0.04)
- Asia
- South Korea (0.04)
- Vietnam > Hanoi
- Hanoi (0.04)
- Taiwan > Takao Province
- Kaohsiung (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Japan
- Kyūshū & Okinawa > Kyūshū
- Miyazaki Prefecture > Miyazaki (0.04)
- Honshū > Chūbu
- Aichi Prefecture > Nagoya (0.04)
- Kyūshū & Okinawa > Kyūshū
- China
- Africa > Middle East
- Morocco (0.04)
- Oceania > Australia
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (0.46)
- Technology: