Finding Significant Combinations of Continuous Features

Sugiyama, Mahito, Borgwardt, Karsten M.

arXiv.org Machine Learning 

This problem is relevant in a broad range of applications including natural language processing, statistical genetics, and healthcare. To date, this problem of feature selection (Guyon and Elisseeff, 2003) has been extensively studied in machine learning, including the recent advances in selective inference (Taylor and Tibshirani, 2015), a technique that can assess the statistical significance of features selected by linear models such as the Lasso (Lee et al., 2016). However, current approaches have a crucial limitation: They can only find single features or linear combinations of features, but it is still an open problem to find patterns, that is, combinations of features with multiplicative effect. A relevant line of research towards this goal is significant pattern mining (Llinares-López et al., 2015; Papaxanthos et al., 2016; Terada et al., 2013), which tries to find statistically associated feature combinations while controlling the family-wise error rate (FWER), that is, the probability to detect one or more false positive patterns. However, all existing methods for significant pattern mining only apply to combinations of binary or discrete features, and none of methods can handle real-valued data, although such data is common in many applications. If we binarize data beforehand to use significant pattern mining approaches, a binarization-based method cannot distinguish correlated and uncorrelated features (see Figure 1 for an example). Subgroup discovery (Atzmueller, 2015; Herrera et al., 2011; Novak et al., 2009) also has the same goal of finding associated feature combinations, but the existing methods are also designed for discrete data, which means that binarization is required (Grosskreutz and Rüping, 2009) for real-valued data and the above problem still exists. To date, there is no method that can find all combinations of continuous features that are significantly associated with an output variable and that accounts for the inherent multiple testing problem.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found