BioBERT and BioMegatron are Transformers models adapted for the biomedical domain based on publicly available biomedical corpora. As such, they have the potential to encode large-scale biological knowledge. We investigate the encoding and representation of biological knowledge in these models, and its potential utility to support inference in cancer precision medicine - namely, the interpretation of the clinical significance of genomic alterations. We compare the performance of different transformer baselines; we use probing to determine the consistency of encodings for distinct entities; and we use clustering methods to compare and contrast the internal properties of the embeddings for genes, variants, drugs and diseases. We show that these models do indeed encode biological knowledge, although some of this is lost in fine-tuning for specific tasks. Finally, we analyse how the models behave with regard to biases and imbalances in the dataset.
With new technologies in biomedicine, we are able to generate and collect data of various modalities, including genomics, epigenomics, transcriptomics, and proteomics (Figure 1A). Integrating heterogeneous features on a single set of observations provides a unique opportunity to gain a comprehensive understanding of an outcome of interest. It offers the potential for making discoveries that are hidden in data analyses of a single modality and achieving more accurate predictions of the outcome (Kristensen et al. 2014, Ritchie et al. 2015, Gligorijević et al. 2016, Karczewski & Snyder 2018, Ma et al. 2020). While "multi-view data analysis" can mean different things, we use it here in the context of supervised learning, where the goal is to fuse different data views to model an outcome of interest. To give a concrete example, assume that a researcher wants to predict cancer outcomes from RNA expression and DNA methylation measurements for a set of patients. The researcher suspects that: (1) both data views could potentially have prognostic value; (2) the two views share some underlying relationship with each other, as DNA methylation regulates gene expression and can repress the expression of tumor suppressor genes or promote the expression of oncogenes. Should the researcher use both data views for downstream prediction, or just use one view or the other?
There are many concerns about increasing reliance on technology. Nevertheless, as a society, we continue to push technology to new heights. From how we order food to how we provide healthcare, machine learning, and artificial intelligence continue to help us surpass our wildest dreams. This is very important, especially if early detection and treatment can bring the best results. Using such an algorithm can literally save lives. Some universities have found through the creation of databases that their artificial intelligence can do as well as doctors in the diagnosis process but also do better in early detection. Artificial Intelligence can also give diagnosis suggestions based on the structured data entered by symptoms, give medication suggestions based on the diagnosis code, and predict adverse drug reactions based on other medications taken. Organizations such as Health Catalyst are using artificial intelligence to reduce hospital-acquired infections (HAI). If we can detect these dangerous infections early, we can reduce the mortality and morbidity associated with them. With the help of machine learning, the company is working to understand medical issues through crowdsourcing better. Having a larger and more diverse database can help the research be more accurate.
In the age of Big Data in biology, data science and machine learning have flourished and benefitted from their interdisciplinary application to biology. As a graduate student in this discipline, I read a lot of papers to stay up to date on the literature (and still have a large reading list to catch up on!), and thought I would share what have been some of the best papers I've read this year. In about 80–90% of the single-cell papers you'll encounter, depending on the research question, there will be at least one or two tSNE or UMAP plots to visualize the data they've collected, usually single-cell RNA-sequencing (scRNA-seq) data, where individual cells are profiled for their RNA abundance across the genome. These unsupervised dimensionality reduction methods have been more or less accepted as the status quo for data visualization in the world of single-cell genomics, so it took Academic Twitter by storm this summer when a new preprint boldly challenged that norm, arguing that these methods do little to preserve the latent structure of the data it seeks to convey to our 3D minds. Using the extreme example of preserving equidistant cells in high-dimensional space, and later relaxing it to near-equidistance, they show how tSNE and UMAP distort the orientation of groups of cells with near-equidistance spacing in the original space, clustering them with groups of cells that are evenly spread further apart.
The TriRhenaTech alliance presents the accepted papers of the 'Upper-Rhine Artificial Intelligence Symposium' held on October 27th 2021 in Kaiserslautern, Germany. Topics of the conference are applications of Artificial Intellgence in life sciences, intelligent systems, industry 4.0, mobility and others. The TriRhenaTech alliance is a network of universities in the Upper-Rhine Trinational Metropolitan Region comprising of the German universities of applied sciences in Furtwangen, Kaiserslautern, Karlsruhe, Offenburg and Trier, the Baden-Wuerttemberg Cooperative State University Loerrach, the French university network Alsace Tech (comprised of 14 'grandes \'ecoles' in the fields of engineering, architecture and management) and the University of Applied Sciences and Arts Northwestern Switzerland. The alliance's common goal is to reinforce the transfer of knowledge, research, and technology, as well as the cross-border mobility of students.
Graph kernels have attracted a lot of attention during the last decade, and have evolved into a rapidly developing branch of learning on structured data. During the past 20 years, the considerable research activity that occurred in the field resulted in the development of dozens of graph kernels, each focusing on specific structural properties of graphs. Graph kernels have proven successful in a wide range of domains, ranging from social networks to bioinformatics. The goal of this survey is to provide a unifying view of the literature on graph kernels. In particular, we present a comprehensive overview of a wide range of graph kernels. Furthermore, we perform an experimental evaluation of several of those kernels on publicly available datasets, and provide a comparative study. Finally, we discuss key applications of graph kernels, and outline some challenges that remain to be addressed.
Background: Survival analysis concerns the study of timeline data where the event of interest may remain unobserved (i.e., censored). Studies commonly record more than one type of event, but conventional survival techniques focus on a single event type. We set out to integrate both multiple independently censored time-to-event variables as well as missing observations. Methods: An energy-based approach is taken with a bi-partite structure between latent and visible states, commonly known as harmoniums (or restricted Boltzmann machines). Results: The present harmonium is shown, both theoretically and experimentally, to capture non-linear patterns between distinct time recordings. We illustrate on real world data that, for a single time-to-event variable, our model is on par with established methods. In addition, we demonstrate that discriminative predictions improve by leveraging an extra time-to-event variable. Conclusions: Multiple time-to-event variables can be successfully captured within the harmonium paradigm.
The use of multi-modal data such as the combination of whole slide images (WSIs) and gene expression data for survival analysis can lead to more accurate survival predictions. Previous multi-modal survival models are not able to efficiently excavate the intrinsic information within each modality. Moreover, despite experimental results show that WSIs provide more effective information than gene expression data, previous methods regard the information from different modalities as similarly important so they cannot flexibly utilize the potential connection between the modalities. To address the above problems, we propose a new asymmetrical multi-modal method, termed as AMMASurv. Specifically, we design an asymmetrical multi-modal attention mechanism (AMMA) in Transformer encoder for multi-modal data to enable a more flexible multi-modal information fusion for survival prediction. Different from previous works, AMMASurv can effectively utilize the intrinsic information within every modality and flexibly adapts to the modalities of different importance. Extensive experiments are conducted to validate the effectiveness of the proposed model. Encouraging results demonstrate the superiority of our method over other state-of-the-art methods.
Strong correlations between explanatory variables are problematic for high-dimensional regularized regression methods. Due to the violation of the Irrepresentable Condition, the popular LASSO method may suffer from false inclusions of inactive variables. In this paper, we propose pre-processing with orthogonal decompositions (PROD) for the explanatory variables in high-dimensional regressions. The PROD procedure is constructed based upon a generic orthogonal decomposition of the design matrix. We demonstrate by two concrete cases that the PROD approach can be effectively constructed for improving the performance of high-dimensional penalized regression. Our theoretical analysis reveals their properties and benefits for high-dimensional penalized linear regression with LASSO.
Data-driven artificial intelligence models fed with published scientific findings have been used to create powerful prediction engines for scientific and technological advance, such as the discovery of novel materials with desired properties and the targeted invention of new therapies and vaccines. These AI approaches typically ignore the distribution of human prediction engines -- scientists and inventor -- who continuously alter the landscape of discovery and invention. As a result, AI hypotheses are designed to substitute for human experts, failing to complement them for punctuated collective advance. Here we show that incorporating the distribution of human expertise into self-supervised models by training on inferences cognitively available to experts dramatically improves AI prediction of future human discoveries and inventions. Including expert-awareness into models that propose (a) valuable energy-relevant materials increases the precision of materials predictions by ~100%, (b) repurposing thousands of drugs to treat new diseases increases precision by 43%, and (c) COVID-19 vaccine candidates examined in clinical trials by 260%. These models succeed by predicting human predictions and the scientists who will make them. By tuning AI to avoid the crowd, however, it generates scientifically promising "alien" hypotheses unlikely to be imagined or pursued without intervention, not only accelerating but punctuating scientific advance. By identifying and correcting for collective human bias, these models also suggest opportunities to improve human prediction by reformulating science education for discovery.