r2VIM: A new variable selection method for random forests in genome-wide association studies
In the last few years, more than one thousand single-nucleotide polymorphisms (SNPs) have been reproducibly associated with more than two hundred phenotypes and quantitative traits in genome-wide association studies (GWAS) [1]. These loci are usually identified by linear or logistic regression analysis which is performed separately for each SNP. The resulting p-values are then used to rank the SNPs and to select those with a p-value smaller than a pre-specified significance level which is adjusted for the large number of statistical tests. In such a scenario, comparable to analyses of other genomic data sets such as gene expression, p-values are not used in a confirmatory setting but rather as a screening tool to identify associated, i.e. important, SNPs while controlling the number of false positive findings. Nonparametric, model-free statistical learning machines provide a promising alternative to classical, model-based statistical methods for the selection of important variables in high dimensional data sets.
Feb-10-2017, 18:51:06 GMT