Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties

Tang, Zixin, Huang, Chieh-Yang, Li, Tsung-Chi, Ng, Ho Yin Sam, Huang, Hen-Hsen, Huang, Ting-Hao 'Kenneth'

Feb-12-2025–arXiv.org Artificial Intelligence

A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms, such as Booking.com, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.

large language model, machine learning, mandarin, (21 more...)

arXiv.org Artificial Intelligence

Feb-12-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.14)
  - Taiwan (0.70)
- North America > Mexico
  - Mexico City (0.14)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.30)
  - Natural Language > Large Language Model (1.00)