\textsc{CantoNLU}: A benchmark for Cantonese natural language understanding
Min, Junghyun, Ng, York Hay, Chan, Sophia, Zhao, Helena Shunhua, Lee, En-Shiun Annie
–arXiv.org Artificial Intelligence
Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual models perform better on syntactic tasks. Mandarin models remain competitive in certain settings, indicating that direct transfer may be sufficient when Cantonese domain data is scarce. We release all datasets, code, and model weights to facilitate future research in Cantonese NLP.
arXiv.org Artificial Intelligence
Oct-24-2025
- Country:
- Africa > Middle East
- Egypt > Cairo Governorate > Cairo (0.04)
- Asia
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Faroe Islands > Streymoy
- Tórshavn (0.04)
- Sweden > Östergötland County
- Linköping (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Poland > Podlaskie Province
- Bialystok (0.04)
- Netherlands (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Italy > Tuscany
- Pisa Province > Pisa (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Slovenia > Central Slovenia
- Municipality of Ljubljana > Ljubljana (0.04)
- Austria > Vienna (0.14)
- Belgium > Brussels-Capital Region
- North America
- Canada > Ontario
- Toronto (0.86)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Virginia (0.04)
- Washington > King County
- Bellevue (0.04)
- Louisiana > Orleans Parish
- Canada > Ontario
- Africa > Middle East
- Genre:
- Research Report > New Finding (0.66)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.68)
- Natural Language
- Chatbot (0.68)
- Grammars & Parsing (0.87)
- Large Language Model (0.94)
- Understanding (0.61)
- Machine Learning > Neural Networks
- Information Technology > Artificial Intelligence