Review for NeurIPS paper: Agree to Disagree: Adaptive Ensemble Knowledge Distillation in Gradient Space

Jan-26-2025, 14:02:36 GMT–Neural Information Processing Systems

The motivation of AE-KD is to encourage the optimization direction of the student guided equally by all the teachers. However, considering there are some weak teachers (low generalization accuracy) in the ensemble teacher pool, why are these weak teachers treated equally with other strong teachers in the gradient space? Intuitively, the guidance of student should favor those strong teachers, but keep away from the weak teachers. What is the difference between them? 3. How to optimize the weights \alpha_m in Eq. (11)? Is it end-to-end optimized together with the student?

adaptive ensemble knowledge distillation, gradient space, knowledge distillation, (7 more...)

Neural Information Processing Systems

Jan-26-2025, 14:02:36 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (0.44)