AITopics | teacher network

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Neural Information Processing SystemsApr-30-2026, 08:09:11 GMT

Any continuous function f can be approximated arbitrarily well by a neural network with sufficiently many neurons k. We consider the case when f itself is a neural network with one hidden layer and k neurons. Approximating f with a neural network with n < k neurons can thus be seen as fitting an under-parameterized "student" network with nneurons to a "teacher" network with k neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the n student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when n 1student neurons each copy one teacher neuron and the n-th student neuron averages the remaining k n+1 teacher neurons. For the student network with n = 1 neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.

activation function, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

466473650870501e3600d9a1b4ee5d44-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 16:36:20 GMT

artificial intelligence, machine learning, perturbation, (17 more...)

Neural Information Processing Systems

Country: Asia > South Korea (0.28)

Industry: Education (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

1f96b24df4b06f5d68389845a9a13ed9-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 00:49:16 GMT

artificial intelligence, machine learning, statistics, (17 more...)

Neural Information Processing Systems

Industry: Education (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

Discovering and Overcoming Limitations of Noise-engineered Data-free Knowledge Distillation

Neural Information Processing SystemsApr-25-2026, 00:49:11 GMT

Distillation in neural networks using only the samples randomly drawn from a Gaussian distribution is possibly the most straightforward solution one can think of for the complex problem of knowledge transfer from one network (teacher) to the other (student). If successfully done, it can eliminate the requirement of teacher's training data for knowledge distillation and avoid often arising privacy concerns in sensitive applications such as healthcare. There have been some recent attempts at Gaussian noise-based data-free knowledge distillation, however, none of them offer a consistent or reliable solution. We identify the shift in the distribution of hidden layer activation as the key limiting factor, which occurs when Gaussian noise is fed to the teacher network instead of the accustomed training data. We propose a simple solution to mitigate this shift and show that for vision tasks, such as classification, it is possible to achieve a performance close to the teacher by just using the samples randomly drawn from a Gaussian distribution.

artificial intelligence, gaussian noise, machine learning, (14 more...)

Neural Information Processing Systems

Industry:

Education (0.94)
Information Technology > Security & Privacy (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

1d6408264d31d453d556c60fe7d0459e-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 00:10:47 GMT

artificial intelligence, dataset, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre: Research Report > Promising Solution (0.68)

Industry:

Education (0.93)
Government > Regional Government (0.68)
Health & Medicine (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

Bocchi, Dario, Regimbeau, Theotime, Lucibello, Carlo, Saglietti, Luca, Cammarota, Chiara

arXiv.org Machine LearningApr-6-2026

We analyze the one-pass stochastic gradient descent dynamics of a two-layer neural network with quadratic activations in a teacher--student framework. In the high-dimensional regime, where the input dimension $N$ and the number of samples $M$ diverge at fixed ratio $α= M/N$, and for finite hidden widths $(p,p^*)$ of the student and teacher, respectively, we study the low-dimensional ordinary differential equations that govern the evolution of the student--teacher and student--student overlap matrices. We show that overparameterization ($p>p^*$) only modestly accelerates escape from a plateau of poor generalization by modifying the prefactor of the exponential decay of the loss. We then examine how unconstrained weight norms introduce a continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions for $p>1$. From this manifold the dynamics consistently selects the closest solution to the random initialization, as enforced by a conserved quantity in the ODEs governing the evolution of the overlaps. Finally, a Hessian analysis of the population-loss landscape confirms that the plateau and the solution manifold correspond to saddles with at least one negative eigenvalue and to marginal minima in the population-loss geometry, respectively.

artificial intelligence, machine learning, matrix, (18 more...)

arXiv.org Machine Learning

2604.03068

Country:

Europe > Italy > Lombardy > Milan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Lazio > Rome (0.04)

Genre: Research Report (0.82)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Paraphrasing Complex Network: Network Compression via Factor Transfer

Neural Information Processing SystemsMar-16-2026, 21:56:10 GMT

Many researchers have sought ways of model compression to reduce the size of a deep neural network (DNN) with minimal performance degradation in order to use DNNs in embedded systems. Among the model compression methods, a method called knowledge transfer is to train a student network with a stronger teacher network. In this paper, we propose a novel knowledge transfer method which uses convolutional operations to paraphrase teacher's knowledge and to translate it for the student. This is done by two convolutional modules, which are called a paraphraser and a translator. The paraphraser is trained in an unsupervised manner to extract the teacher factors which are defined as paraphrased information of the teacher network. The translator located at the student network extracts the student factors and helps to translate the teacher factors by mimicking them. We observed that our student network trained with the proposed factor transfer method outperforms the ones trained with conventional knowledge transfer methods.

artificial intelligence, machine learning, proceedings, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.61)

Add feedback

Paraphrasing Complex Network: Network Compression via Factor Transfer

Jangho Kim, Seonguk Park, Nojun Kwak

Neural Information Processing SystemsMar-15-2026, 15:12:08 GMT

Many researchers have sought ways of model compression to reduce the size of a deep neural network (DNN) with minimal performance degradation in order to use DNNs in embedded systems. Among the model compression methods, a method called knowledge transfer is to train a student network with a stronger teacher network. In this paper, we propose a novel knowledge transfer method which uses convolutional operations to paraphrase teacher's knowledge and to translate it for the student. This is done by two convolutional modules, which are called a paraphraser and a translator. The paraphraser is trained in an unsupervised manner to extract the teacher factors which are defined as paraphrased information of the teacher network. The translator located at the student network extracts the student factors and helps to translate the teacher factors by mimicking them. We observed that our student network trained with the proposed factor transfer method outperforms the ones trained with conventional knowledge transfer methods.

artificial intelligence, machine learning, student network, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)

Add feedback

f5ccb3ab757131a93586ef61ec701533-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 00:02:39 GMT

artificial intelligence, machine learning, optimization problem, (19 more...)

Neural Information Processing Systems

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.70)

Add feedback

How a student becomes a teacher: learning and forgetting through Spectral methods

Neural Information Processing SystemsFeb-16-2026, 20:30:25 GMT

The above scheme proves particularly relevant when the student network is overparameterized (namely, when larger layer sizes are employed) as compared to the underlying teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network.

artificial intelligence, machine learning, matrix, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Belgium > Wallonia > Namur Province > Namur (0.04)

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Filters

Collaborating Authors

teacher network

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

466473650870501e3600d9a1b4ee5d44-Paper.pdf

1f96b24df4b06f5d68389845a9a13ed9-Supplemental-Conference.pdf

Discovering and Overcoming Limitations of Noise-engineered Data-free Knowledge Distillation

1d6408264d31d453d556c60fe7d0459e-Paper.pdf

Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

Paraphrasing Complex Network: Network Compression via Factor Transfer

Paraphrasing Complex Network: Network Compression via Factor Transfer

f5ccb3ab757131a93586ef61ec701533-Paper-Conference.pdf

How a student becomes a teacher: learning and forgetting through Spectral methods