Goto

Collaborating Authors

 persyn


Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation

arXiv.org Artificial Intelligence

Training student models on synthetic data generated by strong teacher models is a promising way to distilling the capabilities of teachers. However, recent studies show that stronger models are not always optimal teachers, revealing a mismatch between teacher outputs and student learnability. To address this issue, we propose PerSyn (Personalized data Synthesis), a novel synthesis strategy that operates under a new ``Route then Generate'' paradigm to create data tailored to each student model, enabling it to learn more effectively. Specifically, PerSyn first assigns each prompt to its optimal teacher via a query-level router that jointly considers student learnability and teacher response quality. Each teacher then synthesizes data only for its assigned prompts, making the process more efficient than the conventional ``Generate then Select'' paradigm, where all teachers must generate parallel responses for the entire prompt set before constructing the final dataset. Extensive experiments across different model families and scales demonstrate that PerSyn consistently achieves superior or comparable performance to all baselines in instruct tuning and math reasoning settings. Further analysis verifies the effectiveness of PerSyn and offers extra insights to propel future research.


GoSGD: Distributed Optimization for Deep Learning with Gossip Exchange

arXiv.org Machine Learning

With deep convolutional neural networks (CNN) introduced by [1] and [2], computer vision tasks and more specifically image classification have made huge improvements in the years following [3]. CNN performances benefit a lot from big collections of annotated images like [4] or [5]. They are trained by optimizing a loss function with gradient descents computed on random mini-batches according to [6]. The method called stochastic gradient descent (SGD) has proved to be very efficient to train neural networks in general. However current CNN structures are extremely deep like the 100 layers ResNet of [7] and contains a lot of parameters (around 60M for Alexnet [3] and 130M for vgg [8]). Those structures involve heavy gradient computation times making the training on big data-sets very slow. Computation on GPU accelerates the training but requires huge local memory caches. Nevertheless the mini-batch optimization seems suitable for distributing the training over several threads. Many methods have been proposed like 1 [9, 10], which propose to distribute the batches over different threads called workers that periodically exchange information via a central thread to synchronize their models.