best subset selection
Continuous Optimization for Offline Change Point Detection and Estimation
Reimann, Hans, Moka, Sarat, Sofronov, Georgy
Change point detection and estimation are an incredibly diverse and widely scattered field in applied and mathematical statistics, with a large variety of applications. To provide a high-level intuition, change point detection may be understood as a signal processing tool for identifying abrupt changes in the generative parameters of a data sequence. While a strong line of work in change point detection is well established with Page's pioneering work (see Page [1954]) and rigorous results by Chernoff and Zacks [1964], Lorden [1971] and Sen and Srivastava [1975], many aspects of this problem are open and the general understanding of good solutions depends strongly on the problem at hand Niu et al. [2016], Truong et al. [2020], and Ma et al. [2020]. Among the open research questions, the simultaneous detection of multiple change points in large data sets is of major interest. Taking a machine learning and data scientific motivated approach, in this paper, we explore the applicability of recent advances in best subset selection of covariates in linear regression proposed by Moka et al. [2024]. This method, a continuous optimization approach for best subset selection, claims to offer faster performance compared to existing exhaustive search methods, while maintaining comparable accuracy.
Cost-Sensitive Best Subset Selection for Logistic Regression: A Mixed-Integer Conic Optimization Perspective
A key challenge in machine learning is to design interpretable models that can reduce their inputs to the best subset for making transparent predictions, especially in the clinical domain. In this work, we propose a certifiably optimal feature selection procedure for logistic regression from a mixed-integer conic optimization perspective that can take an auxiliary cost to obtain features into account. Based on an extensive review of the literature, we carefully create a synthetic dataset generator for clinical prognostic model research. This allows us to systematically evaluate different heuristic and optimal cardinality- and budget-constrained feature selection procedures. The analysis shows key limitations of the methods for the low-data regime and when confronted with label noise. Our paper not only provides empirical recommendations for suitable methods and dataset designs, but also paves the way for future research in the area of meta-learning.
A Consistent and Scalable Algorithm for Best Subset Selection in Single Index Models
Tang, Borui, Zhu, Jin, Zhu, Junxian, Wang, Xueqin, Zhang, Heping
Analysis of high-dimensional data has led to increased interest in both single index models (SIMs) and best subset selection. SIMs provide an interpretable and flexible modeling framework for high-dimensional data, while best subset selection aims to find a sparse model from a large set of predictors. However, best subset selection in high-dimensional models is known to be computationally intractable. Existing methods tend to relax the selection, but do not yield the best subset solution. In this paper, we directly tackle the intractability by proposing the first provably scalable algorithm for best subset selection in high-dimensional SIMs. Our algorithmic solution enjoys the subset selection consistency and has the oracle property with a high probability. The algorithm comprises a generalized information criterion to determine the support size of the regression coefficients, eliminating the model selection tuning. Moreover, our method does not assume an error distribution or a specific link function and hence is flexible to apply. Extensive simulation results demonstrate that our method is not only computationally efficient but also able to exactly recover the best subset in various settings (e.g., linear regression, Poisson regression, heteroscedastic models).