Solving FDR-Controlled Sparse Regression Problems with Five Million Variables on a Laptop
Scheidt, Fabian, Machkour, Jasin, Muma, Michael
Currently, there is an urgent demand for scalable multivariate and high-dimensional false discovery rate (FDR)-controlling variable selection methods to ensure the repro-ducibility of discoveries. However, among existing methods, only the recently proposed Terminating-Random Experiments (T-Rex) selector scales to problems with millions of variables, as encountered in, e.g., genomics research. The T-Rex selector is a new learning framework based on early terminated random experiments with computer-generated dummy variables. In this work, we propose the Big T-Rex, a new implementation of T-Rex that drastically reduces its Random Access Memory (RAM) consumption to enable solving FDR-controlled sparse regression problems with millions of variables on a laptop. We incorporate advanced memory-mapping techniques to work with matrices that reside on solid-state drive and two new dummy generation strategies based on permutations of a reference matrix. Our nu-merical experiments demonstrate a drastic reduction in memory demand and computation time. We showcase that the Big T-Rex can efficiently solve FDR-controlled Lasso-type problems with five million variables on a laptop in thirty minutes. Our work empowers researchers without access to high-performance clusters to make reproducible discoveries in large-scale high-dimensional data.
Sep-27-2024
- Country:
- Europe
- Germany > Hesse
- Darmstadt Region > Darmstadt (0.05)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Germany > Hesse
- Europe
- Genre:
- Research Report (0.64)
- Industry:
- Technology: