Goto

Collaborating Authors

 false discovery rate


Online control of the false discovery rate with decaying memory

Neural Information Processing Systems

In the online multiple testing problem, p-values corresponding to different null hypotheses are presented one by one, and the decision of whether to reject a hypothesis must be made immediately, after which the next p-value is presented. Alpha-investing algorithms to control the false discovery rate were first formulated by Foster and Stine and have since been generalized and applied to various settings, varying from quality-preserving databases for science to multiple A/B tests for internet commerce. This paper improves the class of generalized alpha-investing algorithms (GAI) in four ways: (a) we show how to uniformly improve the power of the entire class of GAI procedures under independence by awarding more alpha-wealth for each rejection, giving a near win-win resolution to a dilemma raised by Javanmard and Montanari, (b) we demonstrate how to incorporate prior weights to indicate domain knowledge of which hypotheses are likely to be null or non-null, (c) we allow for differing penalties for false discoveries to indicate that some hypotheses may be more meaningful/important than others, (d) we define a new quantity called the \emph{decaying memory false discovery rate, or $\memfdr$} that may be more meaningful for applications with an explicit time component, using a discount factor to incrementally forget past decisions and alleviate some potential problems that we describe and name ``piggybacking'' and ``alpha-death''. Our GAI++ algorithms incorporate all four generalizations (a, b, c, d) simulatenously, and reduce to more powerful variants of earlier algorithms when the weights and decay are all set to unity.








Learning False Discovery Rate Control via Model-Based Neural Networks

Vilella, Arnau, Machkour, Jasin, Muma, Michael, Palomar, Daniel P.

arXiv.org Machine Learning

Controlling the false discovery rate (FDR) in high-dimensional variable selection requires balancing rigorous error control with statistical power. Existing methods with provable guarantees are often overly conservative, creating a persistent gap between the realized false discovery proportion (FDP) and the target FDR level. We introduce a learning-augmented enhancement of the T-Rex Selector framework that narrows this gap. Our approach replaces the analytical FDP estimator with a neural network trained solely on diverse synthetic datasets, enabling a substantially tighter and more accurate approximation of the FDP. This refinement allows the procedure to operate much closer to the desired FDR level, thereby increasing discovery power while maintaining effective approximate control. Through extensive simulations and a challenging synthetic genome-wide association study (GWAS), we demonstrate that our method achieves superior detection of true variables compared to existing approaches.


Conformal novelty detection with false discovery rate control at the boundary

Gao, Zijun, Roquain, Etienne, Xiang, Daniel

arXiv.org Machine Learning

Conformal novelty detection is a classical machine learning task for which uncertainty quantification is essential for providing reliable results. Recent work has shown that the BH procedure applied to conformal p-values controls the false discovery rate (FDR). Unfortunately, the BH procedure can lead to over-optimistic assessments near the rejection threshold, with an increase of false discoveries at the margin as pointed out by Soloff et al. (2024). This issue is solved therein by the support line (SL) correction, which is proven to control the boundary false discovery rate (bFDR) in the independent, non-conformal setting. The present work extends the SL method to the conformal setting: first, we show that the SL procedure can violate the bFDR control in this specific setting. Second, we propose several alternatives that provably control the bFDR in the conformal setting. Finally, numerical experiments with both synthetic and real data support our theoretical findings and show the relevance of the new proposed procedures.


Deep Direct Likelihood Knockoffs

Neural Information Processing Systems

Predictive modeling often uses black box machine learning methods, such as deep neural networks, to achieve state-of-the-art performance. In scientific domains, the scientist often wishes to discover which features are actually important for making the predictions. These discoveries may lead to costly follow-up experiments and as such it is important that the error rate on discoveries is not too high. Model-X knockoffs enable important features to be discovered with control of the false discovery rate (FDR). However, knockoffs require rich generative models capable of accurately modeling the knockoff features while ensuring they obey the so-called swap property. We develop Deep Direct Likelihood Knockoffs (DDLK), which directly minimizes the KL divergence implied by the knockoff swap property. DDLK consists of two stages: it first maximizes the explicit likelihood of the features, then minimizes the KL divergence between the joint distribution of features and knockoffs and any swap between them. To ensure that the generated knockoffs are valid under any possible swap, DDLK uses the Gumbel-Softmax trick to optimize the knockoff generator under the worst-case swap. We find DDLK has higher power than baselines while controlling the false discovery rate on a variety of synthetic and real benchmarks including a task involving the largest COVID-19 health record dataset in the United States.