Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages
Pranida, Salsabila Zahirah, Genadi, Rifo Ahmad, Koto, Fajri
–arXiv.org Artificial Intelligence
Quantifying reasoning capability in low-resource languages remains a challenge in NLP due to data scarcity and limited access to annotators. While LLM-assisted dataset construction has proven useful for medium- and high-resource languages, its effectiveness in low-resource languages, particularly for commonsense reasoning, is still unclear. In this paper, we compare three dataset creation strategies: (1) LLM-assisted dataset generation, (2) machine translation, and (3) human-written data by native speakers, to build a culturally nuanced story comprehension dataset. We focus on Javanese and Sundanese, two major local languages in Indonesia, and evaluate the effectiveness of open-weight and closed-weight LLMs in assisting dataset creation through extensive manual validation. To assess the utility of synthetic data, we fine-tune language models on classification and generation tasks using this data and evaluate performance on a human-written test set. Our findings indicate that LLM-assisted data creation outperforms machine translation.
arXiv.org Artificial Intelligence
Feb-18-2025
- Country:
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- Asia
- Indonesia > Java
- Central Java (0.04)
- Jakarta > Jakarta (0.04)
- Yogyakarta > Yogyakarta (0.05)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Southeast Asia (0.04)
- Indonesia > Java
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy > Tuscany
- Florence (0.04)
- Monaco (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Spain
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Valencian Community > Valencia Province
- Valencia (0.04)
- Catalonia > Barcelona Province
- Belgium > Brussels-Capital Region
- North America
- Canada > Quebec
- Montreal (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California > San Diego County
- San Diego (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Michigan (0.04)
- California > San Diego County
- Canada > Quebec
- Africa > Ethiopia
- Genre:
- Research Report > New Finding (0.48)
- Industry:
- Health & Medicine (0.67)
- Leisure & Entertainment (0.93)
- Technology: