Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages

Pranida, Salsabila Zahirah, Genadi, Rifo Ahmad, Koto, Fajri

Feb-18-2025–arXiv.org Artificial Intelligence

Quantifying reasoning capability in low-resource languages remains a challenge in NLP due to data scarcity and limited access to annotators. While LLM-assisted dataset construction has proven useful for medium- and high-resource languages, its effectiveness in low-resource languages, particularly for commonsense reasoning, is still unclear. In this paper, we compare three dataset creation strategies: (1) LLM-assisted dataset generation, (2) machine translation, and (3) human-written data by native speakers, to build a culturally nuanced story comprehension dataset. We focus on Javanese and Sundanese, two major local languages in Indonesia, and evaluate the effectiveness of open-weight and closed-weight LLMs in assisting dataset creation through extensive manual validation. To assess the utility of synthetic data, we fine-tune language models on classification and generation tasks using this data and evaluate performance on a human-written test set. Our findings indicate that LLM-assisted data creation outperforms machine translation.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Feb-18-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States (0.67)
- Asia
  - Middle East > UAE (0.46)
  - Indonesia > Java (0.28)

Genre:
- Research Report > New Finding (0.48)

Industry:
- Leisure & Entertainment (0.93)
- Health & Medicine (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Commonsense Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.71)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found