Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Choudhury, Monojit, Chauhan, Shivam, Das, Rocktim Jyoti, Sahnan, Dhruv, Han, Xudong, Li, Haonan, Singh, Aaryamonvikram, Jadhav, Alok Anil, Agarwal, Utkarsh, Choudhary, Mukund, Banerjee, Debopriyo, Koto, Fajri, Bhat, Junaid, Shukla, Awantika, Ghosh, Samujjwal, Kamboj, Samta, Pandit, Onkar, Pradhan, Lalit, Pal, Rahul, Sahu, Sunil, Doraiswamy, Soundar, Mullah, Parvez, Filali, Ali El, Sengupta, Neha, Ramakrishnan, Gokul, Joshi, Rituraj, Gosal, Gurpreet, Sheinin, Avraham, Vassilieva, Natalia, Nakov, Preslav
–arXiv.org Artificial Intelligence
Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continuous pre-training with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.
arXiv.org Artificial Intelligence
Apr-9-2025
- Country:
- Asia
- India (0.04)
- Indonesia > Bali (0.04)
- Japan > Kyūshū & Okinawa
- Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Middle East
- Saudi Arabia > Asir Province
- Abha (0.04)
- UAE (0.14)
- Saudi Arabia > Asir Province
- Singapore (0.04)
- Southeast Asia (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- North America
- Canada (0.04)
- Dominican Republic (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Washington > King County
- Seattle (0.04)
- Minnesota > Hennepin County
- Asia
- Genre:
- Research Report (1.00)
- Technology: