Revisiting non-English Text Simplification: A Unified Multilingual Benchmark
Ryan, Michael J., Naous, Tarek, Xu, Wei
–arXiv.org Artificial Intelligence
Recent advancements in high-quality, large-scale English resources have pushed the frontier of English Automatic Text Simplification (ATS) research. However, less work has been done on multilingual text simplification due to the lack of a diverse evaluation benchmark that covers complex-simple sentence pairs in many languages. This paper introduces the MultiSim benchmark, a collection of 27 resources in 12 distinct languages containing over 1.7 million complex-simple sentence pairs. This benchmark will encourage research in developing more effective multilingual text simplification models and evaluation metrics. Our experiments using MultiSim with pre-trained multilingual language models reveal exciting performance improvements from multilingual training in non-English settings. We observe strong performance from Russian in zero-shot cross-lingual transfer to low-resource languages. We further show that few-shot prompting with BLOOM-176b achieves comparable quality to reference simplifications outperforming fine-tuned models in most languages. We validate these findings through human evaluation.
arXiv.org Artificial Intelligence
May-24-2023
- Country:
- South America > Chile
- Oceania > Australia
- North America
- United States
- Pennsylvania (0.04)
- Ohio (0.04)
- California (0.04)
- Texas > Travis County
- Austin (0.04)
- New York > New York County
- New York City (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Colorado > Denver County
- Denver (0.04)
- Canada > British Columbia
- United States
- Europe
- Austria (0.04)
- Ukraine (0.04)
- Czechia > Prague (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- Bulgaria
- Sofia City Province > Sofia (0.04)
- Varna Province > Varna (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Spain
- Galicia > Madrid (0.04)
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Italy
- Lombardy > Milan (0.04)
- Trentino-Alto Adige/Südtirol > Trentino Province
- Trento (0.04)
- Middle East
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Cyprus > Nicosia
- Nicosia (0.04)
- Republic of Türkiye > Istanbul Province
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- Pakistan (0.04)
- Taiwan (0.04)
- China > Hong Kong (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Japan > Kyūshū & Okinawa
- Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Africa > Middle East
- Morocco (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Media > News (1.00)
- Government (0.92)
- Health & Medicine > Therapeutic Area
- Neurology (0.46)
- Education > Educational Setting
- K-12 Education (0.45)
- Technology: