Goto

Collaborating Authors

 Zhang, Weixiong


Transformer-based toxin-protein interaction analysis prioritizes airborne particulate matter components with potential adverse health effects

arXiv.org Artificial Intelligence

Air pollution, particularly airborne particulate matter (PM), poses a significant threat to public health globally. It is crucial to comprehend the association between PM-associated toxic components and their cellular targets in humans to understand the mechanisms by which air pollution impacts health and to establish causal relationships between air pollution and public health consequences. Current methods for modeling and analyzing these interactions are rudimentary, with experimental approaches offering limited throughput and comprehensiveness. Leveraging cutting-edge deep learning technologies, we developed tipFormer (toxin-protein interaction prediction based on transformer), a novel machine-learning approach for identifying toxic components capable of penetrating human cells and instigating pathogenic biological activities and signaling cascades. It incorporates dual pre-trained language models to derive encodings for protein sequences and chemicals. It employs a convolutional encoder to assimilate the sequential attributes of proteins and chemicals. It then introduces a novel learning module with a cross-attention mechanism to decode and elucidate the multifaceted interactions pivotal for the hotspots binding proteins and chemicals. Through thorough experimentation, tipFormer was shown to be proficient in capturing interactions between proteins and toxic components. This approach offers significant value to the air quality and toxicology research communities by enabling high-throughput, high-content identification and prioritization of hazards. Keywords: Air pollution, toxin-protein interaction, computational modeling, attention mechanisms 1. Introduction Air pollution has emerged as a critical global health concern, primarily driven by rapid economic, industrial and population growth and further exacerbated by climate change and other non-anthropogenic factors [1]. The World Health Organization estimates that approximately 7 million premature deaths occur every year due to air pollution exposure. The consequences of air pollution extend far beyond individual health implications and exacerbate the strain on societal and healthcare systems in numerous ways [2]. The health risks associated with airborne particulate matter (PM) are particularly concerning for public health [3].


dnaGrinder: a lightweight and high-capacity genomic foundation model

arXiv.org Artificial Intelligence

Foundation models (aka large language models) such as BERT [1] and GPT [2], have demonstrated their stellar performance in learning the complex characteristics and structures of natural languages, making them well-suited for a variety of subsequent applications, such as sentiment analysis, text generation, and translation [3]. These foundation models have recently been adapted to analyze biological sequences as their deep structure and large-scale parameters are well suited for dealing with the intricacy of biological sequences and structures [4, 5, 6, 7, 8, 9, 10, 11]. Biological sequences composed of nucleotides like DNA and RNA, as well as amino acids forming peptides and proteins, are regarded as natural languages of life and can be effectively leveraged by using the technology of foundation models to uncover the underlying patterns and functions they encode [12]. Typically, these foundation models build robust feature representations from biological sequences through a process known as pretraining. Encoder-based models like BERT perform such pretraining by using a method called Masked Language Modeling (MLM), where they predict the actual words of some masked or corrupted ones in given sequences. By pretraining on millions of biological sequences, foundation models gain a comprehensive contextual understanding of the given sequences. Once trained, they only need a few fine-tuning steps to be effectively applicable to specific downstream tasks [13], including prediction of epigenetic marks, gene expressions, protein folding structures, and more.


Heterogeneous Graph Neural Networks using Self-supervised Reciprocally Contrastive Learning

arXiv.org Artificial Intelligence

Heterogeneous graph neural network (HGNN) is a very popular technique for the modeling and analysis of heterogeneous graphs. Most existing HGNN-based approaches are supervised or semi-supervised learning methods requiring graphs to be annotated, which is costly and time-consuming. Self-supervised contrastive learning has been proposed to address the problem of requiring annotated data by mining intrinsic information hidden within the given data. However, the existing contrastive learning methods are inadequate for heterogeneous graphs because they construct contrastive views only based on data perturbation or pre-defined structural properties (e.g., meta-path) in graph data while ignore the noises that may exist in both node attributes and graph topologies. We develop for the first time a novel and robust heterogeneous graph contrastive learning approach, namely HGCL, which introduces two views on respective guidance of node attributes and graph topologies and integrates and enhances them by reciprocally contrastive mechanism to better model heterogeneous graphs. In this new approach, we adopt distinct but most suitable attribute and topology fusion mechanisms in the two views, which are conducive to mining relevant information in attributes and topologies separately. We further use both attribute similarity and topological correlation to construct high-quality contrastive samples. Extensive experiments on three large real-world heterogeneous graphs demonstrate the superiority and robustness of HGCL over state-of-the-art methods.


A Survey of Community Detection Approaches: From Statistical Modeling to Deep Learning

arXiv.org Artificial Intelligence

Community detection, a fundamental task for network analysis, aims to partition a network into multiple sub-structures to help reveal their latent functions. Community detection has been extensively studied in and broadly applied to many real-world network problems. Classical approaches to community detection typically utilize probabilistic graphical models and adopt a variety of prior knowledge to infer community structures. As the problems that network methods try to solve and the network data to be analyzed become increasingly more sophisticated, new approaches have also been proposed and developed, particularly those that utilize deep learning and convert networked data into low dimensional representation. Despite all the recent advancement, there is still a lack of insightful understanding of the theoretical and methodological underpinning of community detection, which will be critically important for future development of the area of network analysis. In this paper, we develop and present a unified architecture of network community-finding methods to characterize the state-of-the-art of the field of community detection. Specifically, we provide a comprehensive review of the existing community detection methods and introduce a new taxonomy that divides the existing methods into two categories, namely probabilistic graphical model and deep learning. We then discuss in detail the main idea behind each method in the two categories. Furthermore, to promote future development of community detection, we release several benchmark datasets from several problem domains and highlight their applications to various network analysis tasks. We conclude with discussions of the challenges of the field and suggestions of possible directions for future research.


Semi-supervised Semantic Segmentation of Organs at Risk on 3D Pelvic CT Images

arXiv.org Artificial Intelligence

Automated segmentation of organs-at-risk in pelvic computed tomography (CT) images can assist the radiotherapy treatment planning by saving time and effort of manual contouring and reducing intra-observer and inter-observer variation. However, training high-performance deep-learning segmentation models usually requires broad labeled data, which are labor-intensive to collect. Lack of annotated data presents a significant challenge for many medical imaging-related deep learning solutions. This paper proposes a novel end-to-end convolutional neural network-based semi-supervised adversarial method that can segment multiple organs-at-risk, including prostate, bladder, rectum, left femur, and right femur. New design schemes are introduced to enhance the baseline residual U-net architecture to improve performance. Importantly, new unlabeled CT images are synthesized by a generative adversarial network (GAN) that is trained on given images to overcome the inherent problem of insufficient annotated data in practice. A semi-supervised adversarial strategy is then introduced to utilize labeled and unlabeled 3D CT images. The new method is evaluated on a dataset of 100 training cases and 20 testing cases. Experimental results, including four metrics (dice similarity coefficient, average Hausdorff distance, average surface Hausdorff distance, and relative volume difference), show that the new method outperforms several state-of-the-art segmentation approaches.


Robust Detection of Link Communities in Large Social Networks by Exploiting Link Semantics

AAAI Conferences

Community detection has been extensively studied for various applications, focusing primarily on network topologies. Recent research has started to explore node contents to identify semantically meaningful communities and interpret their structures using selected words. However, links in real networks typically have semantic descriptions, e.g., comments and emails in social media, supporting the notion of communities of links. Indeed, communities of links can better describe multiple roles that nodes may play and provide a richer characterization of community behaviors than communities of nodes. The second issue in community finding is that most existing methods assume network topologies and descriptive contents to be consistent and to carry the compatible information of node group membership, which is generally violated in real networks. These methods are also restricted to interpret one community with one topic. The third problem is that the existing methods have used top ranked words or phrases to label topics when interpreting communities. However, it is often difficult to comprehend the derived topics using words or phrases, which may be irrelevant. To address these issues altogether, we propose a new unified probabilistic model that can be learned by a dual nested expectation-maximization algorithm. Our new method explores the intrinsic correlation between communities and topics to discover link communities robustly and extract adequate community summaries in sentences instead of words for topic labeling at the same time. It is able to derive more than one topical summary per community to provide rich explanations. We present experimental results to show the effectiveness of our new approach, and evaluate the quality of the results by a case study.


A Network-Specific Markov Random Field Approach to Community Detection

AAAI Conferences

Markov Random Field (MRF) is a powerful framework for developing probabilistic models of complex problems. MRF models possess rich structures to represent properties and constraints of a problem. It has been successful on many application problems, particularly those of computer vision and image processing, where data are structured, e.g., pixels are organized on grids. The problem of identifying communities in networks, which is essential for network analysis, is in principle analogous to finding objects in images. It is surprising that MRF has not yet been explored for network community detection. It is challenging to apply MRF to network analysis problems where data are organized on graphs with irregular structures. Here we present a network-specific MRF approach to community detection. The new method effectively encodes the structural properties of an irregular network in an energy function (the core of an MRF model) so that the minimization of the function gives rise to the best community structures. We analyzed the new MRF-based method on several synthetic benchmarks and real-world networks, showing its superior performance over the state-of-the-art methods for community identification.


Joint Identification of Network Communities and Semantics via Integrative Modeling of Network Topologies and Node Contents

AAAI Conferences

The objective of discovering network communities, an essential step in complex systems analysis, is two-fold: identification of functional modules and their semantics at the same time. However, most existing community-finding methods have focused on finding communities using network topologies, and the problem of extracting module semantics has not been well studied and node contents, which often contain semantic information of nodes and networks, have not been fully utilized. We considered the problem of identifying network communities and module semantics at the same time. We introduced a novel generative model with two closely correlated parts, one for communities and the other for semantics. We developed a co-learning strategy to jointly train the two parts of the model by combining a nested EM algorithm and belief propagation. By extracting the latent correlation between the two parts, our new method is not only robust for finding communities and semantics, but also able to provide more than one semantic explanation to a community. We evaluated the new method on artificial benchmarks and analyzed the semantic interpretability by a case study. We compared the new method with eight state-of-the-art methods on ten real-world networks, showing its superior performance over the existing methods.


Detect Overlapping Communities via Ranking Node Popularities

AAAI Conferences

Detection of overlapping communities has drawn much attention lately as they are essential properties of real complex networks. Despite its influence and popularity, the well studied and widely adopted stochastic model has not been made effective for finding overlapping communities. Here we extend the stochastic model method to detection of overlapping communities with the virtue of autonomous determination of the number of communities. Our approach hinges upon the idea of ranking node popularities within communities and using a Bayesian method to shrink communities to optimize an objective function based on the stochastic generative model. We evaluated the novel approach, showing its superior performance over five state-of-the-art methods, on large real networks and synthetic networks with ground-truths of overlapping communities.


Semantic Community Identification in Large Attribute Networks

AAAI Conferences

Identification of modular or community structures of a network is a key to understanding the semantics and functions of the network. While many network community detection methods have been developed, which primarily explore network topologies, they provide little semantic information of the communities discovered. Although structures and semantics are closely related, little effort has been made to discover and analyze these two essential network properties together. By integrating network topology and semantic information on nodes, e.g., node attributes, we study the problems of detection of communities and inference of their semantics simultaneously. We propose a novel nonnegative matrix factorization (NMF) model with two sets of parameters, the community membership matrix and community attribute matrix, and present efficient updating rules to evaluate the parameters with a convergence guarantee. The use of node attributes improves upon community detection and provides a semantic interpretation to the resultant network communities. Extensive experimental results on synthetic and real-world networks not only show the superior performance of the new method over the state-of-the-art approaches, but also demonstrate its ability to semantically annotate the communities.