Zhou, Pan, Yuan, Xiaotong, Xu, Huan, Yan, Shuicheng, Feng, Jiashi

We address the problem of meta-learning which learns a prior over hypothesis from a sample of meta-training tasks for fast adaptation on meta-testing tasks. A particularly simple yet successful paradigm for this research is model-agnostic meta-learning (MAML). Implementation and analysis of MAML, however, can be tricky; first-order approximation is usually adopted to avoid directly computing Hessian matrix but as a result the convergence and generalization guarantees remain largely mysterious for MAML. To remedy this deficiency, in this paper we propose a minibatch proximal update based meta-learning approach for learning to efficient hypothesis transfer. The principle is to learn a prior hypothesis shared across tasks such that the minibatch risk minimization biased regularized by this prior can quickly converge to the optimal hypothesis in each training task.

Zhou, Pan, Yuan, Xiaotong, Feng, Jiashi

Stochastic gradient hard thresholding methods have recently been shown to work favorably in solving large-scale empirical risk minimization problems under sparsity or rank constraint. Despite the improved iteration complexity over full gradient methods, the gradient evaluation and hard thresholding complexity of the existing stochastic algorithms usually scales linearly with data size, which could still be expensive when data is huge and the hard thresholding step could be as expensive as singular value decomposition in rank-constrained problems. To address these deficiencies, we propose an efficient hybrid stochastic gradient hard thresholding (HSG-HT) method that can be provably shown to have sample-size-independent gradient evaluation and hard thresholding complexity bounds. Specifically, we prove that the stochastic gradient evaluation complexity of HSG-HT scales linearly with inverse of sub-optimality and its hard thresholding complexity scales logarithmically. By applying the heavy ball acceleration technique, we further propose an accelerated variant of HSG-HT which can be shown to have improved factor dependence on restricted condition number.

Zhou, Pan, Yuan, Xiaotong, Feng, Jiashi

As an incremental-gradient algorithm, the hybrid stochastic gradient descent (HSGD) enjoys merits of both stochastic and full gradient methods for finite-sum minimization problem. However, the existing rate-of-convergence analysis for HSGD is made under with-replacement sampling (WRS) and is restricted to convex problems. It is not clear whether HSGD still carries these advantages under the common practice of without-replacement sampling (WoRS) for non-convex problems. In this paper, we affirmatively answer this open question by showing that under WoRS and for both convex and non-convex problems, it is still possible for HSGD (with constant step-size) to match full gradient descent in rate of convergence, while maintaining comparable sample-size-independent incremental first-order oracle complexity to stochastic gradient descent. For a special class of finite-sum problems with linear prediction models, our convergence results can be further improved in some cases.

Xue, Zeyue, Luo, Shuang, Wu, Chao, Zhou, Pan, Bian, Kaigui, Du, Wei

Peer-to-peer knowledge transfer in distributed environments has emerged as a promising method since it could accelerate learning and improve team-wide performance without relying on pre-trained teachers in deep reinforcement learning. However, for traditional peer-to-peer methods such as action advising, they have encountered difficulties in how to efficiently expressed knowledge and advice. As a result, we propose a brand new solution to reuse experiences and transfer value functions among multiple students via model distillation. But it is still challenging to transfer Q-function directly since it is unstable and not bounded. To address this issue confronted with existing works, we adopt Categorical Deep Q-Network. We also describe how to design an efficient communication protocol to exploit heterogeneous knowledge among multiple distributed agents. Our proposed framework, namely Learning and Teaching Categorical Reinforcement (LTCR), shows promising performance on stabilizing and accelerating learning progress with improved team-wide reward in four typical experimental environments.

Huang, Huawei, Lin, Kangying, Guo, Song, Zhou, Pan, Zheng, Zibin

--Federated Learning (FL) is viewed as a promising technique for future distributed machine learning. It permits a large number of mobile devices participating in the training of a global model collaboratively without having to expose their local private data. Although the challenge of the network connection will be much relieved in 5G/B5G era, the training latency is still an obstacle preventing FL from being largely adopted. One of the most fundamental problems that leads to large training latency is the bad candidate-selection of FL participants. T o the best of our knowledge, the existing candidate-selection algorithms belong to the reactive manner . Under such reactive selection, the FL parameter server only knows the currently-observed resources of all candidates. In the dynamic FL environment, the mobile devices selected by the reactive candidate-selection algorithms very possibly fail to complete the training and reporting phases of FL. T o this end, we study the proactive candidate-selection for FL in this paper . We first let each candidate device locally predict the qualities of both its training and reporting phases using the LSTM network. Then, the proposed candidate-selection algorithm is implemented by the Deep Reinforcement Learning (DRL) framework, which can adapt to the dynamically varying factors in the metropolitan edge computing environment. Finally, the real-world trace-driven experiments prove that the proposed proactive approach outperforms the existing reactive algorithms with respect to the ratio of valid participants and the test accuracy of the aggregated global FL model. Federated Learning (FL) [1], [2] is a branch of distributed machine learning that enables a group of distributed devices to train their individual local models using the local dataset. Thus, FL is a promising computing paradigm in our future intelligent life, especially under the fifth generation (5G) and the beyond (B5G) communications networks. For example, the FederatedAveraging (FedAvg) algorithm [1] can help mobile users predict the next-words when users are using the Google's GBoard [3] in their smartphones.

Lu, Canyi, Zhou, Pan

This work studies the Tensor Robust Principal Component Analysis (TRPCA) problem, which aims to exactly recover the low-rank and sparse components from their sum. Our model is motivated by the recently proposed linear transforms based tensor-tensor product and tensor SVD. We define a new transforms depended tensor rank and the corresponding tensor nuclear norm. Then we solve the TRPCA problem by convex optimization whose objective is a weighted combination of the new tensor nuclear norm and the $\ell_1$-norm. In theory, we show that under certain incoherence conditions, the convex program exactly recovers the underlying low-rank and sparse components. It is of great interest that our new TRPCA model generalizes existing works. In particular, if the studied tensor reduces to a matrix, our TRPCA model reduces to the known matrix RPCA. Our new TRPCA which is allowed to use general linear transforms can be regarded as an extension of our former TRPCA work which uses the discrete Fourier transform. But their proof of the recovery guarantee is different. Numerical experiments verify our results and the application on image recovery demonstrates the superiority of our method.

Xu, Shuangjie, Liu, Daizong, Bao, Linchao, Liu, Wei, Zhou, Pan

We address the problem of semi-supervised video object segmentation (VOS), where the masks of objects of interests are given in the first frame of an input video. To deal with challenging cases where objects are occluded or missing, previous work relies on greedy data association strategies that make decisions for each frame individually. In this paper, we propose a novel approach to defer the decision making for a target object in each frame, until a global view can be established with the entire video being taken into consideration. Our approach is in the same spirit as Multiple Hypotheses Tracking (MHT) methods, making several critical adaptations for the VOS problem. We employ the bounding box (bbox) hypothesis for tracking tree formation, and the multiple hypotheses are spawned by propagating the preceding bbox into the detected bbox proposals within a gated region starting from the initial object mask in the first frame. The gated region is determined by a gating scheme which takes into account a more comprehensive motion model rather than the simple Kalman filtering model in traditional MHT. To further design more customized algorithms tailored for VOS, we develop a novel mask propagation score instead of the appearance similarity score that could be brittle due to large deformations. The mask propagation score, together with the motion score, determines the affinity between the hypotheses during tree pruning. Finally, a novel mask merging strategy is employed to handle mask conflicts between objects. Extensive experiments on challenging datasets demonstrate the effectiveness of the proposed method, especially in the case of object missing.

Shen, Zebang, Zhou, Pan, Fang, Cong, Ribeiro, Alejandro

We target the problem of finding a local minimum in non-convex finite-sum minimization. Towards this goal, we first prove that the trust region method with inexact gradient and Hessian estimation can achieve a convergence rate of order $\mathcal{O}(1/{k^{2/3}})$ as long as those differential estimations are sufficiently accurate. Combining such result with a novel Hessian estimator, we propose the sample-efficient stochastic trust region (STR) algorithm which finds an $(\epsilon, \sqrt{\epsilon})$-approximate local minimum within $\mathcal{O}({\sqrt{n}}/{\epsilon^{1.5}})$ stochastic Hessian oracle queries. This improves state-of-the-art result by $\mathcal{O}(n^{1/6})$. Experiments verify theoretical conclusions and the efficiency of STR.

Zhou, Pan, Yuan, Xiaotong, Feng, Jiashi

Stochastic gradient hard thresholding methods have recently been shown to work favorably in solving large-scale empirical risk minimization problems under sparsity or rank constraint. Despite the improved iteration complexity over full gradient methods, the gradient evaluation and hard thresholding complexity of the existing stochastic algorithms usually scales linearly with data size, which could still be expensive when data is huge and the hard thresholding step could be as expensive as singular value decomposition in rank-constrained problems. To address these deficiencies, we propose an efficient hybrid stochastic gradient hard thresholding (HSG-HT) method that can be provably shown to have sample-size-independent gradient evaluation and hard thresholding complexity bounds. Specifically, we prove that the stochastic gradient evaluation complexity of HSG-HT scales linearly with inverse of sub-optimality and its hard thresholding complexity scales logarithmically. By applying the heavy ball acceleration technique, we further propose an accelerated variant of HSG-HT which can be shown to have improved factor dependence on restricted condition number. Numerical results confirm our theoretical affirmation and demonstrate the computational efficiency of the proposed methods.

Zhou, Pan, Yuan, Xiaotong, Feng, Jiashi

As an incremental-gradient algorithm, the hybrid stochastic gradient descent (HSGD) enjoys merits of both stochastic and full gradient methods for finite-sum minimization problem. However, the existing rate-of-convergence analysis for HSGD is made under with-replacement sampling (WRS) and is restricted to convex problems. It is not clear whether HSGD still carries these advantages under the common practice of without-replacement sampling (WoRS) for non-convex problems. In this paper, we affirmatively answer this open question by showing that under WoRS and for both convex and non-convex problems, it is still possible for HSGD (with constant step-size) to match full gradient descent in rate of convergence, while maintaining comparable sample-size-independent incremental first-order oracle complexity to stochastic gradient descent. For a special class of finite-sum problems with linear prediction models, our convergence results can be further improved in some cases. Extensive numerical results confirm our theoretical affirmation and demonstrate the favorable efficiency of WoRS-based HSGD.