Zeng, Peng
Statistical Inference in Classification of High-dimensional Gaussian Mixture
Huang, Hanwen, Zeng, Peng
We consider the classification problem of a high-dimensional mixture of two Gaussians with general covariance matrices. Using the replica method from statistical physics, we investigate the asymptotic behavior of a general class of regularized convex classifiers in the high-dimensional limit, where both the sample size $n$ and the dimension $p$ approach infinity while their ratio $\alpha=n/p$ remains fixed. Our focus is on the generalization error and variable selection properties of the estimators. Specifically, based on the distributional limit of the classifier, we construct a de-biased estimator to perform variable selection through an appropriate hypothesis testing procedure. Using $L_1$-regularized logistic regression as an example, we conducted extensive computational experiments to confirm that our analytical findings are consistent with numerical simulations in finite-sized systems. We also explore the influence of the covariance structure on the performance of the de-biased estimator.
Differentiable Genetic Programming for High-dimensional Symbolic Regression
Zeng, Peng, Song, Xiaotian, Lensen, Andrew, Ou, Yuwei, Sun, Yanan, Zhang, Mengjie, Lv, Jiancheng
Symbolic regression (SR) is the process of discovering hidden relationships from data with mathematical expressions, which is considered an effective way to reach interpretable machine learning (ML). Genetic programming (GP) has been the dominator in solving SR problems. However, as the scale of SR problems increases, GP often poorly demonstrates and cannot effectively address the real-world high-dimensional problems. This limitation is mainly caused by the stochastic evolutionary nature of traditional GP in constructing the trees. In this paper, we propose a differentiable approach named DGP to construct GP trees towards high-dimensional SR for the first time. Specifically, a new data structure called differentiable symbolic tree is proposed to relax the discrete structure to be continuous, thus a gradient-based optimizer can be presented for the efficient optimization. In addition, a sampling method is proposed to eliminate the discrepancy caused by the above relaxation for valid symbolic expressions. Furthermore, a diversification mechanism is introduced to promote the optimizer escaping from local optima for globally better solutions. With these designs, the proposed DGP method can efficiently search for the GP trees with higher performance, thus being capable of dealing with high-dimensional SR. To demonstrate the effectiveness of DGP, we conducted various experiments against the state of the arts based on both GP and deep neural networks. The experiment results reveal that DGP can outperform these chosen peer competitors on high-dimensional regression benchmarks with dimensions varying from tens to thousands. In addition, on the synthetic SR problems, the proposed DGP method can also achieve the best recovery rate even with different noisy levels. It is believed this work can facilitate SR being a powerful alternative to interpretable ML for a broader range of real-world problems.
Sparse Principal Component Analysis via Variable Projection
Erichson, N. Benjamin, Zeng, Peng, Manohar, Krithika, Brunton, Steven L., Kutz, J. Nathan, Aravkin, Aleksandr Y.
Sparse principal component analysis (SPCA) has emerged as a powerful technique for modern data analysis. We discuss a robust and scalable algorithm for computing sparse principal component analysis. Specifically, we model SPCA as a matrix factorization problem with orthogonality constraints, and develop specialized optimization algorithms that partially minimize a subset of the variables (variable projection). The framework incorporates a wide variety of sparsity-inducing regularizers for SPCA. We also extend the variable projection approach to robust SPCA, for any robust loss that can be expressed as the Moreau envelope of a simple function, with the canonical example of the Huber loss. Finally, randomized methods for linear algebra are used to extend the approach to the large-scale (big data) setting. The proposed algorithms are demonstrated using both synthetic and real world data.