Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
Laiyk, Nurkhan, Orel, Daniil, Joshi, Rituraj, Goloburda, Maiya, Wang, Yuxia, Nakov, Preslav, Koto, Fajri
–arXiv.org Artificial Intelligence
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
arXiv.org Artificial Intelligence
Feb-19-2025
- Country:
- Africa (0.04)
- Asia
- Central Asia (0.04)
- Indonesia > Bali (0.04)
- Kazakhstan
- Akmola Region > Astana (0.04)
- Almaty Region > Almaty (0.04)
- East Kazakhstan Region (0.04)
- Kyzylorda Region > Karmakshy District
- Baikonur (0.04)
- Mangystau Region > Mangistau (0.04)
- Ongtustik Qazaqstan > Shymkent (0.04)
- Kyrgyzstan (0.04)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Russia (0.14)
- Thailand > Bangkok
- Bangkok (0.04)
- Uzbekistan > Toshkent Shahri
- Tashkent (0.04)
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Poland > Masovia Province
- Warsaw (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America
- Genre:
- Personal (1.00)
- Research Report (1.00)
- Industry:
- Banking & Finance (0.93)
- Education > Educational Setting (0.46)
- Government > Regional Government
- Asia Government (0.67)
- Health & Medicine (1.00)
- Law (1.00)
- Leisure & Entertainment > Sports (0.46)
- Media > Film (0.68)
- Transportation
- Air (0.46)
- Infrastructure & Services (0.46)
- Technology: