Feng, Jiashi
Hyperparameter Transfer Learning through Surrogate Alignment for Efficient Deep Neural Network Training
Ilievski, Ilija, Feng, Jiashi
Recently, several optimization methods have been successfully applied to the hyperparameter optimization of deep neural networks (DNNs). The methods work by modeling the joint distribution of hyperparameter values and corresponding error. Those methods become less practical when applied to modern DNNs whose training may take a few days and thus one cannot collect sufficient observations to accurately model the distribution. To address this challenging issue, we propose a method that learns to transfer optimal hyperparameter values for a small source dataset to hyperparameter values with comparable performance on a dataset of interest. As opposed to existing transfer learning methods, our proposed method does not use hand-designed features. Instead, it uses surrogates to model the hyperparameter-error distributions of the two datasets and trains a neural network to learn the transfer function. Extensive experiments on three CV benchmark datasets clearly demonstrate the efficiency of our method.
Return of Frustratingly Easy Domain Adaptation
Sun, Baochen (University of Massachusetts, Lowell) | Feng, Jiashi (University of California, Berkeley and University of Singapore) | Saenko, Kate (University of Massachusetts, Lowell)
Unlike human learning, machine learning often fails to handle changes between training (source) and test (target) input distributions. Such domain shifts, common in practical scenarios, severely damage the performance of conventional machine learning methods. Supervised domain adaptation methods have been proposed for the case when the target data have labels, including some that perform very well despite being ``frustratingly easy'' to implement. However, in practice, the target domain is often unlabeled, requiring unsupervised adaptation. We propose a simple, effective, and efficient method for unsupervised domain adaptation called CORrelation ALignment (CORAL). CORAL minimizes domain shift by aligning the second-order statistics of source and target distributions, without requiring any target labels. Even though it is extraordinarily simple--it can be implemented in four lines of Matlab code--CORAL performs remarkably well in extensive evaluations on standard benchmark datasets.
Deep Learning with S-Shaped Rectified Linear Activation Units
Jin, Xiaojie (National University of Singapore) | Xu, Chunyan (Nanjing University of Science and Technology) | Feng, Jiashi (National University of Singapore) | Wei, Yunchao (National University of Singapore) | Xiong, Junjun (Beijing Samsung Telecom) | Yan, Shuicheng (National University of Singapore)
Rectified linear activation units are important components for state-of-the-art deep convolutional networks. In this paper, we propose a novel S-shaped rectifiedlinear activation unit (SReLU) to learn both convexand non-convex functions, imitating the multiple function forms given by the two fundamental laws, namely the Webner-Fechner law and the Stevens law, in psychophysics and neural sciences. Specifically, SReLU consists of three piecewise linear functions, which are formulated by four learnable parameters. The SReLU is learned jointly with the training of the whole deep network through back propagation. During the training phase, to initialize SReLU in different layers, we propose a โfreezingโ method to degenerate SReLU into a predefined leaky rectified linear unit in the initial several training epochs and then adaptively learn the good initial values. SReLU can be universally used in the existing deep networks with negligible additional parameters and computation cost. Experiments with two popular CNN architectures, Network in Network and GoogLeNet on scale-various benchmarks including CIFAR10, CIFAR100, MNIST and ImageNet demonstrate that SReLU achieves remarkable improvement compared to other activation functions.
Distributed Robust Learning
Feng, Jiashi, Xu, Huan, Mannor, Shie
We propose a framework for distributed robust statistical learning on {\em big contaminated data}. The Distributed Robust Learning (DRL) framework can reduce the computational time of traditional robust learning methods by several orders of magnitude. We analyze the robustness property of DRL, showing that DRL not only preserves the robustness of the base robust learning method, but also tolerates contaminations on a constant fraction of results from computing nodes (node failures). More precisely, even in presence of the most adversarial outlier distribution over computing nodes, DRL still achieves a breakdown point of at least $ \lambda^*/2 $, where $ \lambda^* $ is the break down point of corresponding centralized algorithm. This is in stark contrast with naive division-and-averaging implementation, which may reduce the breakdown point by a factor of $ k $ when $ k $ computing nodes are used. We then specialize the DRL framework for two concrete cases: distributed robust principal component analysis and distributed robust regression. We demonstrate the efficiency and the robustness advantages of DRL through comprehensive simulations and predicting image tags on a large-scale image set.
Robust Logistic Regression and Classification
Feng, Jiashi, Xu, Huan, Mannor, Shie, Yan, Shuicheng
We consider logistic regression with arbitrary outliers in the covariate matrix. We propose a new robust logistic regression algorithm, called RoLR, that estimates the parameter through a simple linear programming procedure. We prove that RoLR is robust to a constant fraction of adversarial outliers. To the best of our knowledge, this is the first result on estimating logistic regression model when the covariate matrix is corrupted with any performance guarantees. Besides regression, we apply RoLR to solving binary classification problems where a fraction of training samples are corrupted.
Online Robust PCA via Stochastic Optimization
Feng, Jiashi, Xu, Huan, Yan, Shuicheng
Robust PCA methods are typically based on batch optimization and have to load all the samples into memory. This prevents them from efficiently processing big data. In this paper, we develop an Online Robust Principal Component Analysis (OR-PCA) that processes one sample per time instance and hence its memory cost is independent of the data size, significantly enhancing the computation and storage efficiency. The proposed method is based on stochastic optimization of an equivalent reformulation of the batch RPCA method. Indeed, we show that OR-PCA provides a sequence of subspace estimations converging to the optimum of its batch counterpart and hence is provably robust to sparse corruption. Moreover, OR-PCA can naturally be applied for tracking dynamic subspace. Comprehensive simulations on subspace recovering and tracking demonstrate the robustness and efficiency advantages of the OR-PCA over online PCA and batch RPCA methods.
Online PCA for Contaminated Data
Feng, Jiashi, Xu, Huan, Mannor, Shie, Yan, Shuicheng
We consider the online Principal Component Analysis (PCA) where contaminated samples (containing outliers) are revealed sequentially to the Principal Components (PCs)estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily skewed by the outliers. Herewe propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the final result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a 50% breakdown point. Moreover, online RPCA is shown to be efficient for both storage and computation, sinceit need not re-explore the previous samples as in traditional robust PCA algorithms.
Robust PCA in High-dimension: A Deterministic Approach
Feng, Jiashi, Xu, Huan, Yan, Shuicheng
We consider principal component analysis for contaminated data-set in the high dimensional regime, where the dimensionality of each observation is comparable or even more than the number of observations. We propose a deterministic high-dimensional robust PCA algorithm which inherits all theoretical properties of its randomized counterpart, i.e., it is tractable, robust to contaminated points, easily kernelizable, asymptotic consistent and achieves maximal robustness -- a breakdown point of 50%. More importantly, the proposed method exhibits significantly better computational efficiency, which makes it suitable for large-scale real applications.