Liu, Ding, Yao, Zekun, Zhang, Quan

Tensor networks (TN) have found a wide use in machine learning, and in particular, TN and deep learning bear striking similarities. In this work, we propose the quantum-classical hybrid tensor networks (HTN) which combine tensor networks with classical neural networks in a uniform deep learning framework to overcome the limitations of regular tensor networks in machine learning. We first analyze the limitations of regular tensor networks in the applications of machine learning involving the representation power and architecture scalability. We conclude that in fact the regular tensor networks are not competent to be the basic building blocks of deep learning. Then, we discuss the performance of HTN which overcome all the deficiency of regular tensor networks for machine learning. In this sense, we are able to train HTN in the deep learning way which is the standard combination of algorithms such as Back Propagation and Stochastic Gradient Descent. We finally provide two applicable cases to show the potential applications of HTN, including quantum states classification and quantum-classical autoencoder. These cases also demonstrate the great potentiality to design various HTN in deep learning way.

Changyong, Shu, Peng, Li, Yuan, Xie, Yanyun, Qu, Longquan, Dai, Lizhuang, Ma

Deep network compression has been achieved notable progress via knowledge distillation, where a teacher-student learning manner is adopted by using predetermined loss. Recently, more focuses have been transferred to employ the adversarial training to minimize the discrepancy between distributions of output from two networks. However, they always emphasize on result-oriented learning while neglecting the scheme of process-oriented learning, leading to the loss of rich information contained in the whole network pipeline. Inspired by the assumption that, the small network can not perfectly mimic a large one due to the huge gap of network scale, we propose a knowledge transfer method, involving effective intermediate supervision, under the adversarial training framework to learn the student network. To achieve powerful but highly compact intermediate information representation, the squeezed knowledge is realized by task-driven attention mechanism. Then, the transferred knowledge from teacher network could accommodate the size of student network. As a result, the proposed method integrates merits from both process-oriented and result-oriented learning. Extensive experimental results on three typical benchmark datasets, i.e., CIFAR-10, CIFAR-100, and ImageNet, demonstrate that our method achieves highly superior performances against other state-of-the-art methods.

I'm going to do a comparison on recent (or at least lesser-known) gradient optimization methods. The ones I've encountered up to now are the following: I would be particularly interested in approaches not using any hyperparameters at all (such as number 3 - COCOB), however I will consider all of the interesting and promising methods. Are you aware of some novelties or lesser-known approaches? Previously I've posted the question in r/MLQuestions but haven't received any feedback so I'm posting it here, hope it's not violating any rules.

Ephrath, Jonathan, Ruthotto, Lars, Haber, Eldad, Treister, Eran

Convolutional Neural Networks (CNNs) filter the In recent years there has been an effort to reduce the number input data using a series of spatial convolution of parameters in CNNs. Among the first approaches are operators with compact stencils and point-wise the methods of pruning (Hassibi & Stork, 1992; Han et al., non-linearities. Commonly, the convolution operators 2015; Li et al., 2017) and sparsity (Wen et al., 2016; couple features from all channels, which Changpinyo et al., 2017; Han et al., 2016) that have been leads to immense computational cost in the training typically applied to already trained full networks. It has of and prediction with CNNs. To improve been shown that once a network is trained, a large portion the efficiency of CNNs, we introduce lean convolution of its weights can be removed without hampering its operators that reduce the number of parameters efficiency by much.