Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification
Protasov, Vitaly, Babakov, Nikolay, Dementieva, Daryna, Panchenko, Alexander
–arXiv.org Artificial Intelligence
Despite recent progress in large language models (LLMs), evaluation of text generation tasks such as text style transfer (TST) remains a significant challenge. Recent studies (Dementieva et al., 2024; Pauli et al., 2025) revealed a substantial gap between automatic metrics and human judgments. Moreover, most prior work focuses exclusively on English, leaving multilingual TST evaluation largely unexplored. In this paper, we perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages: English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, Amharic. Drawing inspiration from the machine translation, we assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches. Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline in the text detoxification case.
arXiv.org Artificial Intelligence
Jul-22-2025
- Country:
- Asia
- India > Goa (0.04)
- Indonesia > Bali (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe
- France > Auvergne-Rhône-Alpes
- Germany > Bavaria
- Upper Bavaria > Munich (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Spain > Galicia
- A Coruña Province > Santiago de Compostela (0.04)
- North America
- Canada > Ontario
- Toronto (0.04)
- Dominican Republic (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Florida > Miami-Dade County
- Canada > Ontario
- Oceania > Australia
- South America > Chile
- Asia
- Genre:
- Research Report > New Finding (1.00)
- Technology: