A broad spectrum of data from different modalities are generated in the healthcare domain every day, including scalar data (e.g., clinical measures collected at hospitals), tensor data (e.g., neuroimages analyzed by research institutes), graph data (e.g., brain connectivity networks), and sequence data (e.g., digital footprints recorded on smart sensors). Capability for modeling information from these heterogeneous data sources is potentially transformative for investigating disease mechanisms and for informing therapeutic interventions. Our works in this thesis attempt to facilitate healthcare applications in the setting of broad learning which focuses on fusing heterogeneous data sources for a variety of synergistic knowledge discovery and machine learning tasks. We are generally interested in computer-aided diagnosis, precision medicine, and mobile health by creating accurate user profiles which include important biomarkers, brain connectivity patterns, and latent representations. In particular, our works involve four different data mining problems with application to the healthcare domain: multi-view feature selection, subgraph pattern mining, brain network embedding, and multi-view sequence prediction.
Mining discriminative features for graph data has attracted much attention in recent years due to its important role in constructing graph classifiers, generating graph indices, etc. Most measurement of interestingness of discriminative subgraph features are defined on certain graphs, where the structure of graph objects are certain, and the binary edges within each graph represent the "presence" of linkages among the nodes. In many real-world applications, however, the linkage structure of the graphs is inherently uncertain. Therefore, existing measurements of interestingness based upon certain graphs are unable to capture the structural uncertainty in these applications effectively. In this paper, we study the problem of discriminative subgraph feature selection from uncertain graphs. This problem is challenging and different from conventional subgraph mining problems because both the structure of the graph objects and the discrimination score of each subgraph feature are uncertain. To address these challenges, we propose a novel discriminative subgraph feature selection method, DUG, which can find discriminative subgraph features in uncertain graphs based upon different statistical measures including expectation, median, mode and phi-probability. We first compute the probability distribution of the discrimination scores for each subgraph feature based on dynamic programming. Then a branch-and-bound algorithm is proposed to search for discriminative subgraphs efficiently. Extensive experiments on various neuroimaging applications (i.e., Alzheimer's Disease, ADHD and HIV) have been performed to analyze the gain in performance by taking into account structural uncertainties in identifying discriminative subgraph features for graph classification.
Wu, Jia (University of Technology, Sydney) | Pan, Shirui (University of Technology, Sydney) | Zhu, Xingquan (Florida Atlantic University) | Cai, Zhihua (China University of Geosciences, Wuhan) | Zhang, Chengqi (University of Technology, Sydney)
In this paper, we propose to represent and classify complicated objects. In order to represent the objects, we propose a multi-graph-view model which uses graphs constructed from multiple graph-views to represent an object. In addition, a bag based multi-graph model is further used to relax labeling by only requiring one label for a bag of graphs, which represent one object. In order to learn classification models, we propose a multi-graph-view bag learning algorithm (MGVBL), which aims to explore subgraph features from multiple graph-views for learning. By enabling a joint regularization across multiple graph-views, and enforcing labeling constraints at the bag and graph levels, MGVBL is able to discover most effective subgraph features across all graph-views for learning. Experiments on real-world learning tasks demonstrate the performance of MGVBL for complicated object classification.
Graph representation learning has attracted increasing research attention. However, most existing studies fuse all structural features and node attributes to provide an overarching view of graphs, neglecting finer substructures' semantics, and suffering from interpretation enigmas. This paper presents a novel hierarchical subgraph-level selection and embedding based graph neural network for graph classification, namely SUGAR, to learn more discriminative subgraph representations and respond in an explanatory way. SUGAR reconstructs a sketched graph by extracting striking subgraphs as the representative part of the original graph to reveal subgraph-level patterns. To adaptively select striking subgraphs without prior knowledge, we develop a reinforcement pooling mechanism, which improves the generalization ability of the model. To differentiate subgraph representations among graphs, we present a self-supervised mutual information mechanism to encourage subgraph embedding to be mindful of the global graph structural properties by maximizing their mutual information. Extensive experiments on six typical bioinformatics datasets demonstrate a significant and consistent improvement in model quality with competitive performance and interpretability.
In this work we propose gRegress, a new algorithm which given set of labeled graphs and a real value associated with each graph extracts the complete set of subgraphs such that a) each subgraph in this set has correlation with the real value above a user-specified threshold and b) each subgraph in this set has correlation with any other subgraph in the set below a user-specified threshold. gRegress incorporates novel pruning mechanisms based on correlation of a subgraph feature with the output and correlation with other subgraph features. These pruning mechanisms lead to significant speedup. Experimental results indicate that in terms of runtime, gRegress substantially outperforms gSpan, often by an order of magnitude while the regression models produced by both approaches have comparable accuracy.