Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains

Wang, Pingjie, Liu, Hongcheng, Liao, Yusheng, Fan, Ziqing, Du, Yaxin, Tang, Shuo, Wang, Yanfeng, Wang, Yu

Nov-11-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved remarkable success across widespread tasks, yet their application in low-resource domains remains a significant challenge due to data scarcity and the high risk of overfitting. While in-domain data is limited, there exist vast amounts of similar general-domain data, and our initial findings reveal that they could potentially serve as auxiliary supervision for domain enhancement. This observation leads us to our central research question: how to effectively select the most valuable auxiliary data to maximize domain-specific performance, particularly when traditional methods are inapplicable due to a lack of large in-domain data pools or validation sets. To address this, we propose NTK-Selector, a principled and efficient framework for selecting general-domain auxiliary data to enhance domain-specific performance via neural tangent kernels (NTK). Our method tackles two challenges of directly applying NTK to LLMs, theoretical assumptions and prohibitive computational cost, by empirically demonstrating a stable NTK-like behavior in LLMs during LoRA fine-tuning and proposing a Jacobian-free approximation method. Extensive experiments across four low-resource domains (medical, financial, legal, and psychological) demonstrate that NTK-Selector consistently improves downstream performance. Specifically, fine-tuning on 1,000 in-domain samples alone only yielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. In contrast, enriching with 9,000 auxiliary samples selected by NTK-Selector led to substantial gains of +8.7 and +5.1 points, which corresponds to a 10.9x and 5.7x improvement over the domain-only setting. Each task is augmented with 9K auxiliary samples selected by Random, LESS, and NTK-Selector from Cot Collection based on 1K domain samples. The emergence of large language models (LLMs) has led to remarkable advancements across a wide spectrum of natural language processing tasks (Touvron et al., 2023; Chowdhery et al., 2023; Y ang et al., 2025). However, their formidable capabilities are predominantly anchored in the availability of immense, high-quality pre-training and instruction-tuning datasets.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Nov-11-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.87)

Industry:
- Health & Medicine (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found