kazakhstan
Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh
Koto, Fajri, Joshi, Rituraj, Mukhituly, Nurdaulet, Wang, Yuxia, Xie, Zhuohan, Pal, Rahul, Orel, Daniil, Mullah, Parvez, Turmakhan, Diana, Goloburda, Maiya, Kamran, Mohammed, Ghosh, Samujjwal, Jia, Bokang, Mansurov, Jonibek, Togmanov, Mukhammed, Banerjee, Debopriyo, Laiyk, Nurkhan, Sakip, Akhmed, Han, Xudong, Kochmar, Ekaterina, Aji, Alham Fikri, Singh, Aaryamonvikram, Jadhav, Alok Anil, Katipomu, Satheesh, Kamboj, Samta, Choudhury, Monojit, Gosal, Gurpreet, Ramakrishnan, Gokul, Mishra, Biswajit, Chandran, Sarath, Sheinin, Avraham, Vassilieva, Natalia, Sengupta, Neha, Murray, Larry, Nakov, Preslav
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outperforming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. We release Sherkala-Chat (8B) as an open-weight instruction-tuned model and provide a detailed overview of its training, fine-tuning, safety alignment, and evaluation, aiming to advance research and support diverse real-world applications.
- Asia > Kazakhstan (0.15)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Russia (0.05)
- (19 more...)
Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts
Goloburda, Maiya, Laiyk, Nurkhan, Turmakhan, Diana, Wang, Yuxia, Togmanov, Mukhammed, Mansurov, Jonibek, Sametov, Askhat, Mukhituly, Nurdaulet, Wang, Minghan, Orel, Daniil, Mujahid, Zain Muhammad, Koto, Fajri, Baldwin, Timothy, Nakov, Preslav
Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monolingual contexts, primarily in English. However, language- and region-specific risks in bilingual contexts are often overlooked, and core findings can diverge from those in monolingual settings. In this paper, we introduce Qorgau, a novel dataset specifically designed for safety evaluation in Kazakh and Russian, reflecting the unique bilingual context in Kazakhstan, where both Kazakh (a low-resource language) and Russian (a high-resource language) are spoken. Experiments with both multilingual and language-specific LLMs reveal notable differences in safety performance, emphasizing the need for tailored, region-specific datasets to ensure the responsible and safe deployment of LLMs in countries like Kazakhstan. Warning: this paper contains example data that may be offensive, harmful, or biased.
- Government (1.00)
- Law (0.94)
- Media (0.69)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
Laiyk, Nurkhan, Orel, Daniil, Joshi, Rituraj, Goloburda, Maiya, Wang, Yuxia, Nakov, Preslav, Koto, Fajri
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
- North America > United States (0.14)
- Asia > Russia (0.14)
- Asia > Kazakhstan > Akmola Region > Astana (0.04)
- (18 more...)
- Research Report (1.00)
- Personal (1.00)
- Law (1.00)
- Health & Medicine (1.00)
- Banking & Finance (0.93)
- (6 more...)
KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan
Togmanov, Mukhammed, Mukhituly, Nurdaulet, Turmakhan, Diana, Mansurov, Jonibek, Goloburda, Maiya, Sakip, Akhmed, Xie, Zhuohan, Wang, Yuxia, Syzdykov, Bekassyl, Laiyk, Nurkhan, Aji, Alham Fikri, Kochmar, Ekaterina, Nakov, Preslav, Koto, Fajri
Despite having a population of twenty million, Kazakhstan's culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU comprises 23,000 questions that cover various educational levels, including STEM, humanities, and social sciences, sourced from authentic educational materials and manually validated by native speakers and educators. The dataset includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting Kazakhstan's bilingual education system and rich local context. Our evaluation of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4, and DeepSeek V3) demonstrates substantial room for improvement, as even the best-performing models struggle to achieve competitive performance in Kazakh and Russian. These findings underscore significant performance gaps compared to high-resource languages. We hope that our dataset will enable further research and development of Kazakh-centric LLMs. Data and code will be made available upon acceptance.
- Asia > Kazakhstan (0.83)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > Middle East > Jordan (0.04)
- (5 more...)
- Education > Curriculum > Subject-Specific Education (0.68)
- Education > Educational Setting > K-12 Education (0.48)
Did Russian air defence down the Azerbaijani plane in Kazakhstan?
Kyiv, Ukraine – Russian air defence officials could very possibly have struck an Azerbaijani passenger jet over Chechnya after panicking during a Ukrainian drone attack, analysts and experts from Ukraine, Kazakhstan and Azerbaijan have told Al Jazeera. Moscow might have also compounded what one expert described as a "crime" by not letting the damaged plane land nearby and instead forcing it to fly to Kazakhstan. The analysis by these experts comes amid mounting reports quoting unnamed Azerbaijani officials and other analysts pointing fingers at Russia for the crash, in which at least 38 people were killed. The Kremlin claimed that the AZAL 8432 flight with 67 passengers on board hit a flock of birds early Wednesday after it entered Russian airspace to land in Grozny, Chechnya's administrative capital. But within hours, photos and videos of the plane surfaced, apparently showing deep holes and multiple pockmarks on its tail.
- Asia > Kazakhstan (1.00)
- Asia > Russia (0.80)
- Asia > Azerbaijan (0.58)
- (5 more...)
- Transportation > Passenger (1.00)
- Transportation > Air (1.00)
- Government > Regional Government > Europe Government > Russia Government (0.39)
- Government > Regional Government > Asia Government > Russia Government (0.39)
Russian air defenses downed Azerbaijan Airlines flight, sources say
Russian air defenses downed an Azerbaijan Airlines plane that crashed in Kazakhstan, killing 38 people, four sources with knowledge of the preliminary findings of Azerbaijan's investigation into the disaster said on Thursday. Flight J2-8243 crashed on Wednesday in a ball of fire near the city of Aktau in Kazakhstan after diverting from an area of southern Russia, where Moscow has repeatedly used air defense systems against Ukrainian drone strikes. The Embraer passenger jet had flown from Azerbaijan's capital Baku to Grozny, in Russia's southern Chechnya region, before veering off hundreds of miles across the Caspian Sea. It crashed on the opposite shore of the Caspian after what Russia's aviation watchdog said was an emergency that may have been caused by a bird strike. Officials did not explain why it had crossed the sea.
- Asia > Russia (1.00)
- Asia > Azerbaijan (1.00)
- Asia > Kazakhstan (0.47)
- (6 more...)
- Transportation > Air (1.00)
- Government > Military (1.00)
- Government > Regional Government > Europe Government > Russia Government (0.32)
- Government > Regional Government > Asia Government > Russia Government (0.32)
Azerbaijan Airlines plane crashes in Kazakhstan, killing 38
An Embraer passenger jet crashed near the city of Aktau in Kazakhstan on Wednesday, killing 38 people, after diverting from an area of Russia that Moscow has recently defended against Ukrainian drone attacks. Twenty-nine survivors received hospital treatment. Azerbaijan Airlines flight J2-8243 had flown hundreds of miles off its scheduled route from Azerbaijan to Russia to crash on the opposite shore of the Caspian Sea, after what Russia's aviation watchdog said was an emergency that may have been caused by a bird strike. But an aviation expert suggested that cause seemed unlikely.
- Asia > Azerbaijan (1.00)
- Asia > Russia (0.95)
- Asia > Kazakhstan (0.70)
- (2 more...)
RIFF: Learning to Rephrase Inputs for Few-shot Fine-tuning of Language Models
Pre-trained Language Models (PLMs) can be accurately fine-tuned for downstream text processing tasks. Recently, researchers have introduced several parameter-efficient fine-tuning methods that optimize input prompts or adjust a small number of model parameters (e.g LoRA). In this study, we explore the impact of altering the input text of the original task in conjunction with parameter-efficient fine-tuning methods. To most effectively rewrite the input text, we train a few-shot paraphrase model with a Maximum-Marginal Likelihood objective. Using six few-shot text classification datasets, we show that enriching data with paraphrases at train and test time enhances the performance beyond what can be achieved with parameter-efficient fine-tuning alone. The code used for our experiments can be found at https://github.com/SaeedNajafi/RIFF.
- North America > United States > California (0.16)
- North America > Canada > Alberta (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- (14 more...)
- Overview (1.00)
- Research Report > New Finding (0.66)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)
- (2 more...)
Enhancing Traffic Sign Recognition with Tailored Data Augmentation: Addressing Class Imbalance and Instance Scarcity
Alsiyeu, Ulan, Duisebekov, Zhasdauren
This paper tackles critical challenges in traffic sign recognition (TSR), which is essential for road safety -- specifically, class imbalance and instance scarcity in datasets. We introduce tailored data augmentation techniques, including synthetic image generation, geometric transformations, and a novel obstacle-based augmentation method to enhance dataset quality for improved model robustness and accuracy. Our methodology incorporates diverse augmentation processes to accurately simulate real-world conditions, thereby expanding the training data's variety and representativeness. Our findings demonstrate substantial improvements in TSR models performance, offering significant implications for traffic sign recognition systems. This research not only addresses dataset limitations in TSR but also proposes a model for similar challenges across different regions and applications, marking a step forward in the field of computer vision and traffic sign recognition systems.
- Asia > Kazakhstan (0.08)
- Asia > China > Tianjin Province > Tianjin (0.04)
KazQAD: Kazakh Open-Domain Question Answering Dataset
Yeshpanov, Rustem, Efimov, Pavel, Boytsov, Leonid, Shalkarbayuli, Ardak, Braslavski, Pavel
We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG@10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI's ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at https://github.com/IS2AI/KazQAD.
- Asia > Russia (0.14)
- North America > United States (0.14)
- Asia > Kazakhstan > Akmola Region > Astana (0.04)
- (20 more...)
- Research Report (0.64)
- Overview (0.46)
- Education (1.00)
- Information Technology (0.88)
- Leisure & Entertainment > Sports (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.86)