distillation model
Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation
Jiang, Chenyang, Zhao, Hang, Zhang, Xinyu, Li, Zhengcen, Shan, Qiben, Wu, Shaocong, Su, Jingyong
Dataset distillation compresses large-scale datasets into compact, highly informative synthetic data, significantly reducing storage and training costs. However, existing research primarily focuses on balanced datasets and struggles to perform under real-world long-tailed distributions. In this work, we emphasize the critical role of soft labels in long-tailed dataset distillation and uncover the underlying mechanisms contributing to performance degradation. Specifically, we derive an imbalance-aware generalization bound for model trained on distilled dataset. We then identify two primary sources of soft-label bias, which originate from the distillation model and the distilled images, through systematic perturbation of the data imbalance levels. To address this, we propose ADSA, an Adaptive Soft-label Alignment module that calibrates the entangled biases. This lightweight module integrates seamlessly into existing distillation pipelines and consistently improves performance. On ImageNet-1k-LT with EDC and IPC=50, ADSA improves tail-class accuracy by up to 11.8% and raises overall accuracy to 41.4%. Extensive experiments demonstrate that ADSA provides a robust and generalizable solution under limited label budgets and across a range of distillation techniques. Code is available at: https://github.com/j-cyoung/ADSA_DD.git.
Ensemble Distribution Distillation for Self-Supervised Human Activity Recognition
Nolan, Matthew, Yao, Lina, Davidson, Robert
Human Activity Recognition (HAR) has seen significant advancements with the adoption of deep learning techniques, yet challenges remain in terms of data requirements, reliability and robustness. This paper explores a novel application of Ensemble Distribution Distillation (EDD) within a self-supervised learning framework for HAR aimed at overcoming these challenges. By leveraging unlabeled data and a partially supervised training strategy, our approach yields an increase in predictive accuracy, robust estimates of uncertainty, and substantial increases in robustness against adversarial perturbation; thereby significantly improving reliability in real-world scenarios without increasing computational complexity at inference. We demonstrate this with an evaluation on several publicly available datasets. The contributions of this work include the development of a self-supervised EDD framework, an innovative data augmentation technique designed for HAR, and empirical validation of the proposed method's effectiveness in increasing robustness and reliability.
Do GNN-based QEC Decoders Require Classical Knowledge? Evaluating the Efficacy of Knowledge Distillation from MWPM
Quantum computers hold the potential to outperform classical computers on specific computational problems, but their realization is hindered by the fragility of qubits due to decoherence. Quantum Error Correction (QEC) is an essential technology to overcome this challenge, enabling the detection and correction of errors by redundantly encoding a single logical qubit into multiple physical qubits. The performance of QEC is critically dependent on the classical "decoder" algorithm, which interprets the error syndrome to deduce the appropriate correction operation. The standard decoder for the surface code, Minimum-Weight Perfect Matching (MWPM) [1], performs well under a simplified noise model where errors are assumed to be independent and identically distributed (i.i.d.). However, noise in real quantum devices exhibits complex spatio-temporal correlations, and the discrepancy between the theoretical model and reality can degrade the decoder's performance. To address this, decoders based on machine learning, such as Graph Neural Networks (GNNs), have emerged as a promising alternative [2, 3]. GNNs have the ability to learn error patterns directly from data. It is generally anticipated that injecting physical prior knowledge into a GNN should improve its performance. Specifically, "knowledge distillation" [4], which transfers the knowledge of theoretical error structures from MWPM to a GNN, is considered a concrete method to realize this hypothesis. 1
Interview with Yuki Mitsufuji: Text-to-sound generation
Earlier this year, we spoke to Yuki Mitsufuji, Lead Research Scientist at Sony AI, about work concerning different aspects of image generation. Yuki and his team have since extended their work to sound generation, presenting work at ICLR 2025 entitled: SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation. We caught up with Yuki to find out more. Creating sounds for different types of multimedia, such as video games and movies, takes a lot of experimenting, as artists try to match sounds to their evolving creative ideas. New high-quality diffusion-based Text-to-Sound (T2S) generative models can help with this process, but they are often slow, which makes it harder for creators to experiment quickly.
Robust Planning for Autonomous Vehicles with Diffusion-Based Failure Samplers
Wang, Juanran, Schlichting, Marc R., Kochenderfer, Mykel J.
--High-risk traffic zones such as intersections are a major cause of collisions. This study leverages deep generative models to enhance the safety of autonomous vehicles in an intersection context. We train a 1000-step denoising diffusion probabilistic model to generate collision-causing sensor noise sequences for an autonomous vehicle navigating a four-way intersection based on the current relative position and velocity of an intruder . Using the generative adversarial architecture, the 1000-step model is distilled into a single-step denoising diffusion model which demonstrates fast inference speed while maintaining similar sampling quality. We demonstrate one possible application of the single-step model in building a robust planner for the autonomous vehicle. The planner uses the single-step model to efficiently sample potential failure cases based on the currently measured traffic state to inform its decision-making. Through simulation experiments, the robust planner demonstrates significantly lower failure rate and delay rate compared with the baseline Intelligent Driver Model controller .
A Survey on Pre-Trained Diffusion Model Distillations
Fan, Xuhui, Wu, Zhangkai, Wu, Hongyu
Diffusion Models~(DMs) have emerged as the dominant approach in Generative Artificial Intelligence (GenAI), owing to their remarkable performance in tasks such as text-to-image synthesis. However, practical DMs, such as stable diffusion, are typically trained on massive datasets and thus usually require large storage. At the same time, many steps may be required, i.e., recursively evaluating the trained neural network, to generate a high-quality image, which results in significant computational costs during sample generation. As a result, distillation methods on pre-trained DM have become widely adopted practices to develop smaller, more efficient models capable of rapid, few-step generation in low-resource environment. When these distillation methods are developed from different perspectives, there is an urgent need for a systematic survey, particularly from a methodological perspective. In this survey, we review distillation methods through three aspects: output loss distillation, trajectory distillation and adversarial distillation. We also discuss current challenges and outline future research directions in the conclusion.
Inference-Time Diffusion Model Distillation
Park, Geon Yeong, Lee, Sang Wan, Ye, Jong Chul
Diffusion distillation models effectively accelerate reverse sampling by compressing the process into fewer steps. However, these models still exhibit a performance gap compared to their pre-trained diffusion model counterparts, exacerbated by distribution shifts and accumulated errors during multi-step sampling. To address this, we introduce Distillation++, a novel inference-time distillation framework that reduces this gap by incorporating teacher-guided refinement during sampling. Inspired by recent advances in conditional sampling, our approach recasts student model sampling as a proximal optimization problem with a score distillation sampling loss (SDS). To this end, we integrate distillation optimization during reverse sampling, which can be viewed as teacher guidance that drives student sampling trajectory towards the clean manifold using pre-trained diffusion models. Thus, Distillation++ improves the denoising process in real-time without additional source data or fine-tuning. Distillation++ demonstrates substantial improvements over state-of-the-art distillation baselines, particularly in early sampling stages, positioning itself as a robust guided sampling process crafted for diffusion distillation models. Code: https://github.com/geonyeong-park/inference_distillation.
MEND: Meta dEmonstratioN Distillation for Efficient and Effective In-Context Learning
Li, Yichuan, Ma, Xiyao, Lu, Sixing, Lee, Kyumin, Liu, Xiaohu, Guo, Chenlei
Large Language models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities, where a LLM makes predictions for a given test input together with a few input-output pairs (demonstrations). Nevertheless, the inclusion of demonstrations leads to a quadratic increase in the computational overhead of the self-attention mechanism. Existing solutions attempt to distill lengthy demonstrations into compact vectors. However, they often require task-specific retraining or compromise LLM's in-context learning performance. To mitigate these challenges, we present Meta dEmonstratioN Distillation (MEND), where a language model learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task. We exploit the knowledge distillation to enhance alignment between MEND and LLM, achieving both efficiency and effectiveness simultaneously. MEND is endowed with the meta-knowledge of distilling demonstrations through a two-stage training process, which includes meta-distillation pretraining and fine-tuning. Comprehensive evaluations across seven diverse ICL task partitions using decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess. It not only matches but often outperforms the Vanilla ICL as well as other state-of-the-art distillation models, while significantly reducing the computational demands. This innovation promises enhanced scalability and efficiency for the practical deployment of large language models
Boosting the Cross-Architecture Generalization of Dataset Distillation through an Empirical Study
Zhao, Lirui, Zhang, Yuxin, Lin, Mingbao, Chao, Fei, Ji, Rongrong
The poor cross-architecture generalization of dataset distillation greatly weakens its practical significance. This paper attempts to mitigate this issue through an empirical study, which suggests that the synthetic datasets undergo an inductive bias towards the distillation model. Therefore, the evaluation model is strictly confined to having similar architectures of the distillation model. We propose a novel method of EvaLuation with distillation Feature (ELF), which utilizes features from intermediate layers of the distillation model for the cross-architecture evaluation. In this manner, the evaluation model learns from bias-free knowledge therefore its architecture becomes unfettered while retaining performance. By performing extensive experiments, we successfully prove that ELF can well enhance the cross-architecture generalization of current DD methods. Code of this project is at \url{https://github.com/Lirui-Zhao/ELF}.
XAI for time-series classification leveraging image highlight methods
Makridis, Georgios, Fatouros, Georgios, Koukos, Vasileios, Kotios, Dimitrios, Kyriazis, Dimosthenis, Soldatos, Ioannis
Although much work has been done on explainability in the computer vision and natural language processing (NLP) fields, there is still much work to be done to explain methods applied to time series as time series by nature can not be understood at first sight. In this paper, we present a Deep Neural Network (DNN) in a teacher-student architecture (distillation model) that offers interpretability in time-series classification tasks. The explainability of our approach is based on transforming the time series to 2D plots and applying image highlight methods (such as LIME and GradCam), making the predictions interpretable. At the same time, the proposed approach offers increased accuracy competing with the baseline model with the trade-off of increasing the training time.