Goto

Collaborating Authors

Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing

arXiv.org Machine Learning

We present a novel algorithm, Westfall-Young light, for detecting patterns, such as itemsets and subgraphs, which are statistically significantly enriched in one of two classes. Our method corrects rigorously for multiple hypothesis testing and correlations between patterns through the Westfall-Young permutation procedure, which empirically estimates the null distribution of pattern frequencies in each class via permutations. In our experiments, Westfall-Young light dramatically outperforms the current state-of-the-art approach in terms of both runtime and memory efficiency on popular real-world benchmark datasets for pattern mining. The key to this efficiency is that unlike all existing methods, our algorithm neither needs to solve the underlying frequent itemset mining problem anew for each permutation nor needs to store the occurrence list of all frequent patterns. Westfall-Young light opens the door to significant pattern mining on large datasets that previously led to prohibitive runtime or memory costs.


Finding significant combinations of features in the presence of categorical covariates

Neural Information Processing Systems

In high-dimensional settings, where the number of features p is typically much larger than the number of samples n, methods which can systematically examine arbitrary combinations of features, a huge 2^p-dimensional space, have recently begun to be explored. However, none of the current methods is able to assess the association between feature combinations and a target variable while conditioning on a categorical covariate, in order to correct for potential confounding effects. We propose the Fast Automatic Conditional Search (FACS) algorithm, a significant discriminative itemset mining method which conditions on categorical covariates and only scales as O(k log k), where k is the number of states of the categorical covariate. Based on the Cochran-Mantel-Haenszel Test, FACS demonstrates superior speed and statistical power on simulated and real-world datasets compared to the state of the art, opening the door to numerous applications in biomedicine.


Identifying Higher-order Combinations of Binary Features

arXiv.org Machine Learning

Finding statistically significant interactions between binary variables is computationally and statistically challenging in high-dimensional settings, due to the combinatorial explosion in the number of hypotheses. Terada et al. recently showed how to elegantly address this multiple testing problem by excluding non-testable hypotheses. Still, it remains unclear how their approach scales to large datasets. We here proposed strategies to speed up the approach by Terada et al. and evaluate them thoroughly in 11 real-world benchmark datasets. We observe that one approach, incremental search with early stopping, is orders of magnitude faster than the current state-of-the-art approach.


Graph Classification Based on Skeleton and Component Features

arXiv.org Artificial Intelligence

In these areas, data can be usually represented as graphs with labels. For example, in bioinformatics, a protein molecule can be represented as a graph whose nodes corresponds to atoms, and edges signify there exits chemical bonds or not between atoms. The graphs are allocated with different labels based on having specific function or not. To make classification in this task, we usually make a common assumption that protein molecules with similar structure have similar functional properties. More recently, there has been a surge of approaches that seek to learn representations or embeddings that encode features about the graphs and then make classification. The idea behind these learning approaches focuses on graph structure representation and learning a mapping that embeds nodes or entire (sub)graphs, into a low-dimensional vector. Most of these methods can be classified into two categories: (1) neural networks manners [4] that learn the large-scale structures of target graph, (2) kernel methods [5] that learn small-size structures of target graph. Different structures of graph imply dissimilar features. Corresponding author Email address: weiw@buaa.edu.cn


Finding Significant Combinations of Continuous Features

arXiv.org Machine Learning

We present an efficient feature selection method that can find all multiplicative combinations of continuous features that are statistically significantly associated with the class variable, while rigorously correcting for multiple testing. The key to overcome the combinatorial explosion in the number of candidates is to derive a lower bound on the $p$-value for each feature combination, which enables us to massively prune combinations that can never be significant and gain more statistical power. While this problem has been addressed for binary features in the past, we here present the first solution for continuous features. In our experiments, our novel approach detects true feature combinations with higher precision and recall than competing methods that require a prior binarization of the data.