Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains
Wang, Pingjie, Liu, Hongcheng, Liao, Yusheng, Fan, Ziqing, Du, Yaxin, Tang, Shuo, Wang, Yanfeng, Wang, Yu
–arXiv.org Artificial Intelligence
Large language models (LLMs) have achieved remarkable success across widespread tasks, yet their application in low-resource domains remains a significant challenge due to data scarcity and the high risk of overfitting. While in-domain data is limited, there exist vast amounts of similar general-domain data, and our initial findings reveal that they could potentially serve as auxiliary supervision for domain enhancement. This observation leads us to our central research question: how to effectively select the most valuable auxiliary data to maximize domain-specific performance, particularly when traditional methods are inapplicable due to a lack of large in-domain data pools or validation sets. To address this, we propose NTK-Selector, a principled and efficient framework for selecting general-domain auxiliary data to enhance domain-specific performance via neural tangent kernels (NTK). Our method tackles two challenges of directly applying NTK to LLMs, theoretical assumptions and prohibitive computational cost, by empirically demonstrating a stable NTK-like behavior in LLMs during LoRA fine-tuning and proposing a Jacobian-free approximation method. Extensive experiments across four low-resource domains (medical, financial, legal, and psychological) demonstrate that NTK-Selector consistently improves downstream performance. Specifically, fine-tuning on 1,000 in-domain samples alone only yielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. In contrast, enriching with 9,000 auxiliary samples selected by NTK-Selector led to substantial gains of +8.7 and +5.1 points, which corresponds to a 10.9x and 5.7x improvement over the domain-only setting. Each task is augmented with 9K auxiliary samples selected by Random, LESS, and NTK-Selector from Cot Collection based on 1K domain samples. The emergence of large language models (LLMs) has led to remarkable advancements across a wide spectrum of natural language processing tasks (Touvron et al., 2023; Chowdhery et al., 2023; Y ang et al., 2025). However, their formidable capabilities are predominantly anchored in the availability of immense, high-quality pre-training and instruction-tuning datasets.
arXiv.org Artificial Intelligence
Nov-11-2025