Machine learning-based methods have shown potential for optimizing existing molecules with more desirable properties, a critical step towards accelerating new chemical discovery. Here we propose QMO, a generic query-based molecule optimization framework that exploits latent embeddings from a molecule autoencoder. QMO improves the desired properties of an input molecule based on efficient queries, guided by a set of molecular property predictions and evaluation metrics. We show that QMO outperforms existing methods in the benchmark tasks of optimizing small organic molecules for drug-likeness and solubility under similarity constraints. We also demonstrate substantial property improvement using QMO on two new and challenging tasks that are also important in real-world discovery problems: (1) optimizing existing potential SARS-CoV-2 main protease inhibitors towards higher binding affinity and (2) improving known antimicrobial peptides towards lower toxicity. Results from QMO show high consistency with external validations, suggesting an effective means to facilitate material optimization problems with design constraints. Zeroth-order optimization is used on problems where no explicit gradient function is accessible, but single points can be queried. Hoffman et al. present here a molecular design method that uses zeroth-order optimization to deal with the discreteness of molecule sequences and to incorporate external guidance from property evaluations and design constraints.
That the sequential structure of genomes is important has been known since the discovery of DNA. In this paper we employ a statistics and stochastic process perspective on triplets of successive bases to address two important applications: identifying outliers in genome databases, and classifying reads in the metagenomic context of reference-guided assembly. From this stochastic process perspective, triplets are a second-order Markov chain specified by the distribution of each base conditional on its two immediate predecessors. To be sure, studying genomes via base sequence distributions is not novel. Previous papers have addressed genome signatures (Karlin et al., 1997; Campbell et al., 1999; Takashi et al., 2003), as well as frequentist (Rosen et al., 2008) and Bayesian (Wang et al., 2007) approaches to classification problems.
Automatic diagnosis has attracted increasing attention but remains challenging due to multi-step reasoning. Recent works usually address it by reinforcement learning methods. However, these methods show low efficiency and require taskspecific reward functions. Considering the conversation between doctor and patient allows doctors to probe for symptoms and make diagnoses, the diagnosis process can be naturally seen as the generation of a sequence including symptoms and diagnoses. Inspired by this, we reformulate automatic diagnosis as a symptoms Sequence Generation (SG) task and propose a simple but effective automatic Diagnosis model based on Transformer (Diaformer). We firstly design the symptom attention framework to learn the generation of symptom inquiry and the disease diagnosis. To alleviate the discrepancy between sequential generation and disorder of implicit symptoms, we further design three orderless training mechanisms. Experiments on three public datasets show that our model outperforms baselines on disease diagnosis by 1%, 6% and 11.5% with the highest training efficiency. Detailed analysis on symptom inquiry prediction demonstrates that the potential of applying symptoms sequence generation for automatic diagnosis.
Protein-ligand interactions (PLIs) are fundamental to biochemical research and their identification is crucial for estimating biophysical and biochemical properties for rational therapeutic design. Currently, experimental characterization of these properties is the most accurate method, however, this is very time-consuming and labor-intensive. A number of computational methods have been developed in this context but most of the existing PLI prediction heavily depends on 2D protein sequence data. Here, we present a novel parallel graph neural network (GNN) to integrate knowledge representation and reasoning for PLI prediction to perform deep learning guided by expert knowledge and informed by 3D structural data. We develop two distinct GNN architectures, GNNF is the base implementation that employs distinct featurization to enhance domain-awareness, while GNNP is a novel implementation that can predict with no prior knowledge of the intermolecular interactions. The comprehensive evaluation demonstrated that GNN can successfully capture the binary interactions between ligand and proteins 3D structure with 0.979 test accuracy for GNNF and 0.958 for GNNP for predicting activity of a protein-ligand complex. These models are further adapted for regression tasks to predict experimental binding affinities and pIC50 is crucial for drugs potency and efficacy. We achieve a Pearson correlation coefficient of 0.66 and 0.65 on experimental affinity and 0.50 and 0.51 on pIC50 with GNNF and GNNP, respectively, outperforming similar 2D sequence-based models. Our method can serve as an interpretable and explainable artificial intelligence (AI) tool for predicted activity, potency, and biophysical properties of lead candidates. To this end, we show the utility of GNNP on SARS-Cov-2 protein targets by screening a large compound library and comparing our prediction with the experimentally measured data.
For Artificial Intelligence to have a greater impact in biology and medicine, it is crucial that recommendations are both accurate and transparent. In other domains, a neurosymbolic approach of multi-hop reasoning on knowledge graphs has been shown to produce transparent explanations. However, there is a lack of research applying it to complex biomedical datasets and problems. In this paper, the approach is explored for drug discovery to draw solid conclusions on its applicability. For the first time, we systematically apply it to multiple biomedical datasets and recommendation tasks with fair benchmark comparisons. The approach is found to outperform the best baselines by 21.7% on average whilst producing novel, biologically relevant explanations.
High-dimensional classification and feature selection tasks are ubiquitous with the recent advancement in data acquisition technology. In several application areas such as biology, genomics and proteomics, the data are often functional in their nature and exhibit a degree of roughness and non-stationarity. These structures pose additional challenges to commonly used methods that rely mainly on a two-stage approach performing variable selection and classification separately. We propose in this work a novel Gaussian process discriminant analysis (GPDA) that combines these steps in a unified framework. Our model is a two-layer non-stationary Gaussian process coupled with an Ising prior to identify differentially-distributed locations. Scalable inference is achieved via developing a variational scheme that exploits advances in the use of sparse inverse covariance matrices. We demonstrate the performance of our methodology on simulated datasets and two proteomics datasets: breast cancer and SARS-CoV-2. Our approach distinguishes itself by offering explainability as well as uncertainty quantification in addition to low computational cost, which are crucial to increase trust and social acceptance of data-driven tools.
When patients develop acute respiratory failure, accurately identifying the underlying etiology is essential for determining the best treatment, but it can be challenging to differentiate between common diagnoses in clinical practice. Machine learning models could improve medical diagnosis by augmenting clinical decision making and play a role in the diagnostic evaluation of patients with acute respiratory failure. While machine learning models have been developed to identify common findings on chest radiographs (e.g. pneumonia), augmenting these approaches by also analyzing clinically relevant data from the electronic health record (EHR) could aid in the diagnosis of acute respiratory failure. Machine learning models were trained to predict the cause of acute respiratory failure (pneumonia, heart failure, and/or COPD) using chest radiographs and EHR data from patients within an internal cohort using diagnoses based on physician chart review. Models were also tested on patients in an external cohort using discharge diagnosis codes. A model combining chest radiographs and EHR data outperformed models based on each modality alone for pneumonia and COPD. For pneumonia, the combined model AUROC was 0.79 (0.78-0.79), image model AUROC was 0.73 (0.72-0.75), and EHR model AUROC was 0.73 (0.70-0.76); for COPD, combined: 0.89 (0.83-0.91), image: 0.85 (0.77-0.89), and EHR: 0.80 (0.76-0.84); for heart failure, combined: 0.80 (0.77-0.84), image: 0.77 (0.71-0.81), and EHR: 0.80 (0.75-0.82). In the external cohort, performance was consistent for heart failure and COPD, but declined slightly for pneumonia. Overall, machine learning models combing chest radiographs and EHR data can accurately differentiate between common causes of acute respiratory failure. Further work is needed to determine whether these models could aid clinicians in the diagnosis of acute respiratory failure in clinical settings.
During the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, polymerase chain reaction (PCR) tests were generally reported only as binary positive or negative outcomes. However, these test results contain a great deal more information than that. As viral load declines exponentially, the PCR cycle threshold (Ct) increases linearly. Hay et al. developed an approach for extracting epidemiological information out of the Ct values obtained from PCR tests used in surveillance for a variety of settings (see the Perspective by Lopman and McQuade). Although there are challenges to relying on single Ct values for individual-level decision-making, even a limited aggregation of data from a population can inform on the trajectory of the pandemic. Therefore, across a population, an increase in aggregated Ct values indicates that a decline in cases is occurring. Science , abh0635, this issue p. [eabh0635]; see also abj4185, p.  ### INTRODUCTION Current approaches to epidemic monitoring rely on case counts, test positivity rates, and reported deaths or hospitalizations. These metrics, however, provide a limited and often biased picture as a result of testing constraints, unrepresentative sampling, and reporting delays. Random cross-sectional virologic surveys can overcome some of these biases by providing snapshots of infection prevalence but currently offer little information on the epidemic trajectory without sampling across multiple time points. ### RATIONALE We develop a new method that uses information inherent in cycle threshold (Ct) values from reverse transcription quantitative polymerase chain reaction (RT-qPCR) tests to robustly estimate the epidemic trajectory from multiple or even a single cross section of positive samples. Ct values are related to viral loads, which depend on the time since infection; Ct values are generally lower when the time between infection and sample collection is short. Despite variation across individuals, samples, and testing platforms, Ct values provide a probabilistic measure of time since infection. We find that the distribution of Ct values across positive specimens at a single time point reflects the epidemic trajectory: A growing epidemic will necessarily have a high proportion of recently infected individuals with high viral loads, whereas a declining epidemic will have more individuals with older infections and thus lower viral loads. Because of these changing proportions, the epidemic trajectory or growth rate should be inferable from the distribution of Ct values collected in a single cross section, and multiple successive cross sections should enable identification of the longer-term incidence curve. Moreover, understanding the relationship between sample viral loads and epidemic dynamics provides additional insights into why viral loads from surveillance testing may appear higher for emerging viruses or variants and lower for outbreaks that are slowing, even absent changes in individual-level viral kinetics. ### RESULTS Using a mathematical model for population-level viral load distributions calibrated to known features of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) viral load kinetics, we show that the median and skewness of Ct values in a random sample change over the course of an epidemic. By formalizing this relationship, we demonstrate that Ct values from a single random cross section of virologic testing can estimate the time-varying reproductive number of the virus in a population, which we validate using data collected from comprehensive SARS-CoV-2 testing in long-term care facilities. Using a more flexible approach to modeling infection incidence, we also develop a method that can reliably estimate the epidemic trajectory in even more-complex populations, where interventions may be implemented and relaxed over time. This method performed well in estimating the epidemic trajectory in the state of Massachusetts using routine hospital admissions RT-qPCR testing data—accurately replicating estimates from other sources for the entire state. ### CONCLUSION This work provides a new method for estimating the epidemic growth rate and a framework for robust epidemic monitoring using RT-qPCR Ct values that are often simply discarded. By deploying single or repeated (but small) random surveillance samples and making the best use of the semiquantitative testing data, we can estimate epidemic trajectories in real time and avoid biases arising from nonrandom samples or changes in testing practices over time. Understanding the relationship between population-level viral loads and the state of an epidemic reveals important implications and opportunities for interpreting virologic surveillance data. It also highlights the need for such surveillance, as these results show how to use it most informatively. ![Figure] Ct values reflect the epidemic trajectory and can be used to estimate incidence. ( A and B ) Whether an epidemic has rising or falling incidence will be reflected in the distribution of times since infection (A), which in turn affects the distribution of Ct values in a surveillance sample (B). ( C ) These values can be used to assess whether the epidemic is rising or falling and estimate the incidence curve. Estimating an epidemic’s trajectory is crucial for developing public health responses to infectious diseases, but case data used for such estimation are confounded by variable testing practices. We show that the population distribution of viral loads observed under random or symptom-based surveillance—in the form of cycle threshold (Ct) values obtained from reverse transcription quantitative polymerase chain reaction testing—changes during an epidemic. Thus, Ct values from even limited numbers of random samples can provide improved estimates of an epidemic’s trajectory. Combining data from multiple such samples improves the precision and robustness of this estimation. We apply our methods to Ct values from surveillance conducted during the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic in a variety of settings and offer alternative approaches for real-time estimates of epidemic trajectories for outbreak management and response. : /lookup/doi/10.1126/science.abh0635 : /lookup/doi/10.1126/science.abj4185 : pending:yes
Accurate and trustworthy epidemic forecasting is an important problem that has impact on public health planning and disease mitigation. Most existing epidemic forecasting models disregard uncertainty quantification, resulting in mis-calibrated predictions. Recent works in deep neural models for uncertainty-aware time-series forecasting also have several limitations; e.g. it is difficult to specify meaningful priors in Bayesian NNs, while methods like deep ensembling are computationally expensive in practice. In this paper, we fill this important gap. We model the forecasting task as a probabilistic generative process and propose a functional neural process model called EPIFNP, which directly models the probability density of the forecast value. EPIFNP leverages a dynamic stochastic correlation graph to model the correlations between sequences in a non-parametric way, and designs different stochastic latent variables to capture functional uncertainty from different perspectives. Our extensive experiments in a real-time flu forecasting setting show that EPIFNP significantly outperforms previous state-of-the-art models in both accuracy and calibration metrics, up to 2.5x in accuracy and 2.4x in calibration. Additionally, due to properties of its generative process,EPIFNP learns the relations between the current season and similar patterns of historical seasons,enabling interpretable forecasts. Beyond epidemic forecasting, the EPIFNP can be of independent interest for advancing principled uncertainty quantification in deep sequential models for predictive analytics
In this paper, we take a human-centered approach to interpretable machine learning. First, drawing inspiration from the study of explanation in philosophy, cognitive science, and the social sciences, we propose a list of design principles for machine-generated explanations that are meaningful to humans. Using the concept of weight of evidence from information theory, we develop a method for producing explanations that adhere to these principles. We show that this method can be adapted to handle high-dimensional, multi-class settings, yielding a flexible meta-algorithm for generating explanations. We demonstrate that these explanations can be estimated accurately from finite samples and are robust to small perturbations of the inputs. We also evaluate our method through a qualitative user study with machine learning practitioners, where we observe that the resulting explanations are usable despite some participants struggling with background concepts like prior class probabilities. Finally, we conclude by surfacing design implications for interpretability tools