Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA

Oct-30-2025–arXiv.org Artificial Intelligence

Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.

large language model, machine learning, question answering, (17 more...)

arXiv.org Artificial Intelligence

Oct-30-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - United States (0.46)
  - Mexico (0.28)
- Asia
  - India (0.30)
  - Middle East > UAE (0.28)

Genre:
- Research Report > New Finding (0.54)

Industry:
- Consumer Products & Services > Travel (0.62)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Question Answering (1.00)
    - Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.51)