Wang, Minjie
Thresholded Graphical Lasso Adjusts for Latent Variables: Application to Functional Neural Connectivity
Wang, Minjie, Allen, Genevera I.
Emerging neuroscience technologies such as electrophysiology and calcium imaging can record from tens-of-thousands of neurons in the live animal brain while the animal is responding to stimuli and behaving freely. Scientists often seek to understand how neurons are communicating during certain stimuli or activities, something termed functional neural connectivity. To learn functional connections from large-scale neuroscience data, many have proposed using probabilistic graphical models (Yatsenko et al. 2015; Narayan et al. 2015; Chang et al. 2019), where each edge denotes conditional dependencies between nodes. Yet, applying such models in neuroscience poses a major challenge as only a small subset of neurons in the animal brain can be recorded at once, leading to abundant latent variables. Chandrasekaran et al. (2012) termed this the latent variable graphical model problem and proposed a convex program to solve this. While conceptually attractive, this approach poses several statistical, computational and practical challenges, discussed subsequently, for the task of learning functional neural connectivity from large-scale neuroscience data. Because of this, we are motivated to consider an incredibly simple solution to the latent variable graphical model problem: apply a hard thresholding operator to existing graph selection estimators. In this paper, we study this approach showing that thresholding has more desirable theoretical properties as well as superior empirical performance.
Supervised Convex Clustering
Wang, Minjie, Yao, Tianyi, Allen, Genevera I.
Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to its unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named Supervised Convex Clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's Disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's Disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults.
Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs
Wang, Minjie, Yu, Lingfan, Zheng, Da, Gan, Quan, Gai, Yu, Ye, Zihao, Li, Mufei, Zhou, Jinjing, Huang, Qi, Ma, Chao, Huang, Ziyue, Guo, Qipeng, Zhang, Hao, Lin, Haibin, Zhao, Junbo, Li, Jinyang, Smola, Alexander, Zhang, Zheng
DGL is platform-agnostic so that it can easily be integrated with tensor-oriented frameworks like PyTorch and MXNet. It is an open-source project under active development. Appendix A summarizes the models released in DGL repository. In this paper, we compare DGL against the state-of- the-art library on multiple standard GNN setups and show the improvement of training speed and memory efficiency. 2 F RAMEWORK REQUIREMENTS OF D EEP L EARNING ON G RAPHS Message passing paradigm. Formally, we define a graph G(V,E). V is the set of nodes with v i being the feature vector associated with each node. E is the set of the edge tuples (e k,r k,s k), where s k r k represents the edge from node s k to r k, and e k is feature vector associated with the edge. DGNs are defined by the following edgewise and node-wise computation: Edgewise: m (t) k φ e (e ( t 1) k, v ( t 1) r k, v ( t 1) s k), Node-wise: v ( t) i φ v (v (t 1) i, null k s.t.
Learned Indexes for Dynamic Workloads
Tang, Chuzhe, Dong, Zhiyuan, Wang, Minjie, Wang, Zhaoguo, Chen, Haibo
The recent proposal of learned index structures opens up a new perspective on how traditional range indexes can be optimized. However, the current learned indexes assume the data distribution is relatively static and the access pattern is uniform, while real-world scenarios consist of skew query distribution and evolving data. In this paper, we demonstrate that the missing consideration of access patterns and dynamic data distribution notably hinders the applicability of learned indexes. To this end, we propose solutions for learned indexes for dynamic workloads (called Doraemon). To improve the latency for skew queries, Doraemon augments the training data with access frequencies. To address the slow model re-training when data distribution shifts, Doraemon caches the previously-trained models and incrementally fine-tunes them for similar access patterns and data distribution. Our preliminary result shows that, Doraemon improves the query latency by 45.1% and reduces the model re-training time to 1/20.