glycan
Modeling All-Atom Glycan Structures via Hierarchical Message Passing and Multi-Scale Pre-training
Xu, Minghao, Song, Jiaze, Wu, Keming, Zhou, Xiangxin, Cui, Bin, Zhang, Wentao
Understanding the various properties of glycans with machine learning has shown some preliminary promise. However, previous methods mainly focused on modeling the backbone structure of glycans as graphs of monosaccharides (i.e., sugar units), while they neglected the atomic structures underlying each monosaccharide, which are actually important indicators of glycan properties. We fill this blank by introducing the GlycanAA model for All-Atom-wise Glycan modeling. GlycanAA models a glycan as a heterogeneous graph with monosaccharide nodes representing its global backbone structure and atom nodes representing its local atomic-level structures. Based on such a graph, GlycanAA performs hierarchical message passing to capture from local atomic-level interactions to global monosaccharide-level interactions. To further enhance model capability, we pre-train GlycanAA on a high-quality unlabeled glycan dataset, deriving the PreGlycanAA model. We design a multi-scale mask prediction algorithm to endow the model about different levels of dependencies in a glycan. Extensive benchmark results show the superiority of GlycanAA over existing glycan encoders and verify the further improvements achieved by PreGlycanAA. We maintain all resources at https://github.com/kasawa1234/GlycanAA
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Immunology (0.46)
Higher-Order Message Passing for Glycan Representation Learning
Glycans are the most complex biological sequence, with monosaccharides forming extended, non-linear sequences. As post-translational modifications, they modulate protein structure, function, and interactions. Due to their diversity and complexity, predictive models of glycan properties and functions are still insufficient. Graph Neural Networks (GNNs) are deep learning models designed to process and analyze graph-structured data. These architectures leverage the connectivity and relational information in graphs to learn effective representations of nodes, edges, and entire graphs. Iteratively aggregating information from neighboring nodes, GNNs capture complex patterns within graph data, making them particularly well-suited for tasks such as link prediction or graph classification across domains. This work presents a new model architecture based on combinatorial complexes and higher-order message passing to extract features from glycan structures into a latent space representation. The architecture is evaluated on an improved GlycanML benchmark suite, establishing a new state-of-the-art performance. We envision that these improvements will spur further advances in computational glycosciences and reveal the roles of glycans in biology.
- Europe > Germany > Saarland > Saarbrücken (0.14)
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Greece (0.04)
GlycanML: A Multi-Task and Multi-Structure Benchmark for Glycan Machine Learning
Xu, Minghao, Geng, Yunteng, Zhang, Yihang, Yang, Ling, Tang, Jian, Zhang, Wentao
Glycans are basic biomolecules and perform essential functions within living organisms. The rapid increase of functional glycan data provides a good opportunity for machine learning solutions to glycan understanding. However, there still lacks a standard machine learning benchmark for glycan function prediction. In this work, we fill this blank by building a comprehensive benchmark for Glycan Machine Learning (GlycanML). The GlycanML benchmark consists of diverse types of tasks including glycan taxonomy prediction, glycan immunogenicity prediction, glycosylation type prediction, and protein-glycan interaction prediction. Glycans can be represented by both sequences and graphs in GlycanML, which enables us to extensively evaluate sequence-based models and graph neural networks (GNNs) on benchmark tasks. Furthermore, by concurrently performing eight glycan taxonomy prediction tasks, we introduce the GlycanML-MTL testbed for multi-task learning (MTL) algorithms. Experimental results show the superiority of modeling glycans with multi-relational GNNs, and suitable MTL methods can further boost model performance. We provide all datasets and source codes at https://github.com/GlycanML/GlycanML and maintain a leaderboard at https://GlycanML.github.io/project
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Immunology (0.94)
- Education (0.68)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Artificial intelligence could revolutionise glycomics datasets
Researchers have created a tool that allows glycomics datasets to be analysed using artificial intelligence for early cancer diagnoses. A team at the University of California (UC) San Diego, US, have developed a tool called GlyCompare that enables researchers to analyse glycomics datasets using artificial intelligence (AI), potentially leading to early cancer diagnoses. GlyCompare takes a systems-level perspective that accounts for shared biosynthetic pathways of glycans within and across samples. According to the team, one of the keys to the GlyCompare approach is that it looks at the biological steps needed to synthesise the subunits that make up glycans, rather than only looking at only the whole glycans themselves, thereby improving the accuracy of statistical analyses of glycomics data. To introduce their technology, the team demonstrated their ability to enhance comparisons of glycomics datasets by focusing on the hidden relationships between glycans in several contexts, including gastric cancer tissues.
Graph Convolutional Neural Networks to Analyze Complex Carbohydrates
Graph convolutional neural networks (GCNs) have attracted increasing amounts of attention over the last couple of years, with more and more disciplines finding use for them. This has also been extended into the life sciences, as GCNs have been used to analyze proteins, drugs, and of course biological networks. One key advantage of GCNs that has enabled this expansion is their ability to natively work with nonlinear data formats, in contrast to more linear data structures such as in natural languages. Because of this feature, we also implemented GCNs for our own topic of interest, the study of complex carbohydrates or glycans. Glycans are ubiquitous in biology, decorating every cell and playing key roles in processes such as viral infection or tumor immune evasion.
New AI model helps understand virus spread from animals to humans
The image shows a glimpse of glycan diversity, showcasing several classes of glycans from various kingdoms of life. A new model that applies artificial intelligence to carbohydrates improves the understanding of the infection process and could help predict which viruses are likely to spread from animals to humans. This is reported in a recent study led by researchers at the University of Gothenburg. Carbohydrates participate in nearly all biological processes - yet they are still not well understood. Referred to as glycans, these carbohydrates are crucial to making our body work the way it is supposed to.
Spotlight on AI: Latest Developments in the Field of Artificial Intelligence
Artificial intelligence is changing the course of our lives with its constant developments. Before the pandemic and now in the new normal, AI remains to be a key trend in the tech industry. It is reaching wider audiences as years pass and scientists, engineers, and entrepreneurs who involve themselves with modern technologies are reaping the benefits of AI and its branches, IoT and machine learning. Organizations that overlooked digital transformation and the power of artificial intelligence are picking the pace of AI adoption. When COVID-19 was creating chaos across industries, it became evident that disruptive technologies and the automation that comes with it are more than crucial.
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.52)
- Health & Medicine > Therapeutic Area > Immunology (0.52)
Chemistry-informed Macromolecule Graph Representation for Similarity Computation and Supervised Learning
Mohapatra, Somesh, An, Joyce, Gómez-Bombarelli, Rafael
Macromolecules are large, complex molecules composed of covalently bonded monomer units, existing in different stereochemical configurations and topologies. As a result of such chemical diversity, representing, comparing, and learning over macromolecules emerge as critical challenges. To address this, we developed a macromolecule graph representation, with monomers and bonds as nodes and edges, respectively. We captured the inherent chemistry of the macromolecule by using molecular fingerprints for node and edge attributes. For the first time, we demonstrated computation of chemical similarity between 2 macromolecules of varying chemistry and topology, using exact graph edit distances and graph kernels. We also trained graph neural networks for a variety of glycan classification tasks, achieving state-of-the-art results. Our work has two-fold implications - it provides a general framework for representation, comparison, and learning of macromolecules; and enables quantitative chemistry-informed decision-making and iterative design in the macromolecular chemical space. Macromolecules are ubiquitous and indispensable, from constituting what we are made up of to being present in almost everything we use. As biological macromolecules, they form the basis of life, serving as drivers of survival and growth functions. As synthetic macromolecules, humans have engineered the composition and topology to design structural components, sensors, shape-memory materials, drugs, encode messages, and much more (Lutz et al., 2016; Romio et al., 2020; Boydston et al., 2020; Thompson & Korley, 2020).
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > California > Los Angeles County > Pasadena (0.04)
- Europe > France (0.04)
Mining Massive Hierarchical Data Using a Scalable Probabilistic Graphical Model
AlJadda, Khalifeh, Korayem, Mohammed, Ortiz, Camilo, Grainger, Trey, Miller, John A., Rasheed, Khaled, Kochut, Krys J., York, William S., Ranzinger, Rene, Porterfield, Melody
Probabilistic Graphical Models (PGM) are very useful in the fields of machine learning and data mining. The crucial limitation of those models,however, is the scalability. The Bayesian Network, which is one of the most common PGMs used in machine learning and data mining, demonstrates this limitation when the training data consists of random variables, each of them has a large set of possible values. In the big data era, one would expect new extensions to the existing PGMs to handle the massive amount of data produced these days by computers, sensors and other electronic devices. With hierarchical data - data that is arranged in a treelike structure with several levels - one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian Networks become infeasible for representing the probability distributions. In this paper we introduce an extension to Bayesian Networks to handle massive sets of hierarchical data in a reasonable amount of time and space. The proposed model achieves perfect precision of 1.0 and high recall of 0.93 when it is used as multi-label classifier for the annotation of mass spectrometry data. On another data set of 1.5 billion search logs provided by CareerBuilder.com the model was able to predict latent semantic relationships between search keywords with accuracy up to 0.80.
- North America > United States > Georgia > Clarke County > Athens (0.14)
- North America > United States > Indiana > Monroe County > Bloomington (0.04)
- North America > United States > California (0.04)
- (2 more...)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems
AlJadda, Khalifeh, Korayem, Mohammed, Ortiz, Camilo, Grainger, Trey, Miller, John A., York, William S.
In the big data era, scalability has become a crucial requirement for any useful computational model. Probabilistic graphical models are very useful for mining and discovering data insights, but they are not scalable enough to be suitable for big data problems. Bayesian Networks particularly demonstrate this limitation when their data is represented using few random variables while each random variable has a massive set of values. With hierarchical data - data that is arranged in a treelike structure with several levels - one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian networks become infeasible for representing the probability distributions for the following reasons: i) Each level represents a single random variable with hundreds of thousands of values, ii) The number of levels is usually small, so there are also few random variables, and iii) The structure of the network is predefined since the dependency is modeled top-down from each parent to each of its child nodes, so the network would contain a single linear path for the random variables from each parent to each child node. In this paper we present a scalable probabilistic graphical model to overcome these limitations for massive hierarchical data. We believe the proposed model will lead to an easily-scalable, more readable, and expressive implementation for problems that require probabilistic-based solutions for massive amounts of hierarchical data. We successfully applied this model to solve two different challenging probabilistic-based problems on massive hierarchical data sets for different domains, namely, bioinformatics and latent semantic discovery over search logs.
- North America > United States > Georgia > Clarke County > Athens (0.14)
- North America > United States > Indiana > Monroe County > Bloomington (0.04)
- North America > United States > California (0.04)
- (3 more...)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.96)