Goto

Collaborating Authors

 tian and feng


Co-Regularization Enhances Knowledge Transfer in High Dimensions

Neural Information Processing Systems

Most existing transfer learning algorithms for high-dimensional models employ a two-step regularization framework, whose success heavily hinges on the assumption that the pre-trained model closely resembles the target. To relax this assumption, we propose a co-regularization process to directly exploit beneficial knowledge from the source domain for high-dimensional generalized linear models. The proposed method learns the target parameter by constraining the source parameters to be close to the target one, thereby preventing fine-tuning failures caused by significantly deviated pre-trained parameters. Our theoretical analysis demonstrates that the proposed method accommodates a broader range of sources than existing two-step frameworks, thus being more robust to less similar sources. Its effectiveness is validated through extensive empirical studies.


Unified Transfer Learning Models for High-Dimensional Linear Regression

arXiv.org Machine Learning

Transfer learning plays a key role in modern data analysis when: (1) the target data are scarce but the source data are sufficient; (2) the distributions of the source and target data are heterogeneous. This paper develops an interpretable unified transfer learning model, termed as UTrans, which can detect both transferable variables and source data. More specifically, we establish the estimation error bounds and prove that our bounds are lower than those with target data only. Besides, we propose a source detection algorithm based on hypothesis testing to exclude the nontransferable data. We evaluate and compare UTrans to the existing algorithms in multiple experiments. It is shown that UTrans attains much lower estimation and prediction errors than the existing methods, while preserving interpretability. We finally apply it to the US intergenerational mobility data and compare our proposed algorithms to the classical machine learning algorithms.


Penalised regression with multiple sources of prior effects

arXiv.org Machine Learning

In many high-dimensional prediction or classification tasks, complementary data on the features are available, e.g. prior biological knowledge on (epi)genetic markers. Here we consider tasks with numerical prior information that provide an insight into the importance (weight) and the direction (sign) of the feature effects, e.g. regression coefficients from previous studies. We propose an approach for integrating multiple sources of such prior information into penalised regression. If suitable co-data are available, this improves the predictive performance, as shown by simulation and application. The proposed method is implemented in the R package `transreg' (https://github.com/lcsb-bds/transreg).


RaSE: Random Subspace Ensemble Classification

arXiv.org Machine Learning

We propose a flexible ensemble classification framework, Random Subspace Ensemble (RaSE), for sparse classification. In the RaSE algorithm, we aggregate many weak learners, where each weak learner is a base classifier trained in a subspace optimally selected from a collection of random subspaces. To conduct subspace selection, we propose a new criterion, ratio information criterion (RIC), based on weighted Kullback-Leibler divergence. The theoretical analysis includes the risk and Monte-Carlo variance of RaSE classifier, establishing the screening consistency and weak consistency of RIC, and providing an upper bound for the misclassification rate of RaSE classifier. In addition, we show that in a high-dimensional framework, the number of random subspaces needs to be very large to guarantee that a subspace covering signals is selected. Therefore, we propose an iterative version of RaSE algorithm and prove that under some specific conditions, a smaller number of generated random subspaces are needed to find a desirable subspace through iteration. An array of simulations under various models and real-data applications demonstrate the effectiveness and robustness of the RaSE classifier and its iterative version in terms of low misclassification rate and accurate feature ranking. The RaSE algorithm is implemented in the R package RaSEn on CRAN.