Goto

Collaborating Authors

 An, Zhulin


Online Policy Distillation with Decision-Attention

arXiv.org Artificial Intelligence

Policy Distillation (PD) has become an effective method to improve deep reinforcement learning tasks. The core idea of PD is to distill policy knowledge from a teacher agent to a student agent. However, the teacher-student framework requires a well-trained teacher model which is computationally expensive.In the light of online knowledge distillation, we study the knowledge transfer between different policies that can learn diverse knowledge from the same environment.In this work, we propose Online Policy Distillation (OPD) with Decision-Attention (DA), an online learning framework in which different policies operate in the same environment to learn different perspectives of the environment and transfer knowledge to each other to obtain better performance together. With the absence of a well-performance teacher policy, the group-derived targets play a key role in transferring group knowledge to each student policy. However, naive aggregation functions tend to cause student policies quickly homogenize. To address the challenge, we introduce the Decision-Attention module to the online policies distillation framework. The Decision-Attention module can generate a distinct set of weights for each policy to measure the importance of group members. We use the Atari platform for experiments with various reinforcement learning algorithms, including PPO and DQN. In different tasks, our method can perform better than an independent training policy on both PPO and DQN algorithms. This suggests that our OPD-DA can transfer knowledge between different policies well and help agents obtain more rewards.


Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition

arXiv.org Artificial Intelligence

The teacher-free online Knowledge Distillation (KD) aims to train an ensemble of multiple student models collaboratively and distill knowledge from each other. Although existing online KD methods achieve desirable performance, they often focus on class probabilities as the core knowledge type, ignoring the valuable feature representational information. We present a Mutual Contrastive Learning (MCL) framework for online KD. The core idea of MCL is to perform mutual interaction and transfer of contrastive distributions among a cohort of networks in an online manner. Our MCL can aggregate cross-network embedding information and maximize the lower bound to the mutual information between two networks. This enables each network to learn extra contrastive knowledge from others, leading to better feature representations, thus improving the performance of visual recognition tasks. Beyond the final layer, we extend MCL to intermediate layers and perform an adaptive layer-matching mechanism trained by meta-optimization. Experiments on image classification and transfer learning to visual recognition tasks show that layer-wise MCL can lead to consistent performance gains against state-of-the-art online KD approaches. The superiority demonstrates that layer-wise MCL can guide the network to generate better feature representations. Our code is publicly avaliable at https://github.com/winycg/L-MCL.


Lifelong Generative Learning via Knowledge Reconstruction

arXiv.org Artificial Intelligence

Generative models often incur the catastrophic forgetting problem when they are used to sequentially learning multiple tasks, i.e., lifelong generative learning. Although there are some endeavors to tackle this problem, they suffer from high time-consumptions or error accumulation. In this work, we develop an efficient and effective lifelong generative model based on variational autoencoder (VAE). Unlike the generative adversarial network, VAE enjoys high efficiency in the training process, providing natural benefits with few resources. We deduce a lifelong generative model by expending the intrinsic reconstruction character of VAE to the historical knowledge retention. Further, we devise a feedback strategy about the reconstructed data to alleviate the error accumulation. Experiments on the lifelong generating tasks of MNIST, FashionMNIST, and SVHN verified the efficacy of our approach, where the results were comparable to SOTA.


GHFP: Gradually Hard Filter Pruning

arXiv.org Artificial Intelligence

Filter pruning is widely used to reduce the computation of deep learning, enabling the deployment of Deep Neural Networks (DNNs) in resource-limited devices. Conventional Hard Filter Pruning (HFP) method zeroizes pruned filters and stops updating them, thus reducing the search space of the model. On the contrary, Soft Filter Pruning (SFP) simply zeroizes pruned filters, keeping updating them in the following training epochs, thus maintaining the capacity of the network. However, SFP, together with its variants, converges much slower than HFP due to its larger search space. Our question is whether SFP-based methods and HFP can be combined to achieve better performance and speed up convergence. Firstly, we generalize SFP-based methods and HFP to analyze their characteristics. Then we propose a Gradually Hard Filter Pruning (GHFP) method to smoothly switch from SFP-based methods to HFP during training and pruning, thus maintaining a large search space at first, gradually reducing the capacity of the model to ensure a moderate convergence speed. Experimental results on CIFAR-10/100 show that our method achieves the state-of-the-art performance.


Softer Pruning, Incremental Regularization

arXiv.org Artificial Intelligence

Network pruning is widely used to compress Deep Neural Networks (DNNs). The Soft Filter Pruning (SFP) method zeroizes the pruned filters during training while updating them in the next training epoch. Thus the trained information of the pruned filters is completely dropped. To utilize the trained pruned filters, we proposed a SofteR Filter Pruning (SRFP) method and its variant, Asymptotic SofteR Filter Pruning (ASRFP), simply decaying the pruned weights with a monotonic decreasing parameter. Our methods perform well across various networks, datasets and pruning rates, also transferable to weight pruning. On ILSVRC-2012, ASRFP prunes 40% of the parameters on ResNet-34 with 1.63% top-1 and 0.68% top-5 accuracy improvement. In theory, SRFP and ASRFP are an incremental regularization of the pruned filters. Besides, We note that SRFP and ASRFP pursue better results while slowing down the speed of convergence.


EENA: Efficient Evolution of Neural Architecture

arXiv.org Machine Learning

Latest algorithms for automatic neural architecture search perform remarkable but are basically directionless in search space and computational expensive in training of every intermediate architecture. In this paper, we propose a method for efficient architecture search called EENA (Efficient Evolution of Neural Architecture). Due to the elaborately designed mutation and crossover operations, the evolution process can be guided by the information have already been learned. Therefore, less computational effort will be required while the searching and training time can be reduced significantly. On CIFAR-10 classification, EENA using minimal computational resources (0.65 GPU days) can design highly effective neural architecture which achieves 2.56% test error with 8.47M parameters. Furthermore, the best architecture discovered is also transferable for CIFAR-100.