Science


Response to Comment on "Predicting reaction performance in C-N cross-coupling using machine learning"

Science

We demonstrate that the chemical-feature model described in our original paper is distinguishable from the nongeneralizable models introduced by Chuang and Keiser. Furthermore, the chemical-feature model significantly outperforms these models in out-of-sample predictions, justifying the use of chemical featurization from which machine learning models can extract meaningful patterns in the dataset, as originally described. In Ahneman et al. (1), we showed that a random forest (RF) algorithm built using computationally derived chemical descriptors for the components of a Pd-catalyzed C–N cross-coupling reaction (aryl halide, ligand, base, and potentially inhibitory isoxazole additive) could identify predictive and meaningful relationships in a multidimensional chemical dataset comprising 4608 reactions. Chuang and Keiser (2) built alternative models using random barcode features ("straw" models), wherein the chemical descriptors are replaced with random numbers selected from a standard normal distribution. One-hot encoded features, wherein each reagent acts as a categorical descriptor and is marked as absent or present, were also evaluated.


Comment on "Predicting reaction performance in C-N cross-coupling using machine learning"

Science

Ahneman et al. (Reports, 13 April 2018) applied machine learning models to predict C–N cross-coupling reaction yields. The models use atomic, electronic, and vibrational descriptors as input features. However, the experimental design is insufficient to distinguish models trained on chemical features from those trained solely on random-valued features in retrospective and prospective test scenarios, thus failing classical controls in machine learning. A recent report by Ahneman et al. (1) describes a machine learning approach for modeling chemical reactions with data collected through ultrahigh-throughput experimentation. The Buchwald-Hartwig coupling (2) is used as a model reaction, with a Glorius interference approach (3) to study reaction poisoning by isoxazole additives.



Quantifying reputation and success in art

Science

Art appreciation is highly subjective. Fraiberger et al. used an extensive record of exhibition and auction data to study and model the career trajectory of individual artists relative to a network of galleries and museums. They observed a lock-in effect among highly reputed artists who started their career in high-prestige institutions and a long struggle for access to elite institutions among those who started their career at the network periphery. In areas of human activity where performance is difficult to quantify in an objective fashion, reputation and networks of influence play a key role in determining access to resources and rewards. To understand the role of these factors, we reconstructed the exhibition history of half a million artists, mapping out the coexhibition network that captures the movement of art between institutions.


The gut microbiota at the intersection of diet and human health

Science

Diet affects multiple facets of human health and is inextricably linked to chronic metabolic conditions such as obesity, type 2 diabetes, and cardiovascular disease. Dietary nutrients are essential not only for human health but also for the health and survival of the trillions of microbes that reside within the human intestines. Diet is a key component of the relationship between humans and their microbial residents; gut microbes use ingested nutrients for fundamental biological processes, and the metabolic outputs of those processes may have important impacts on human physiology. Studies in humans and animal models are beginning to unravel the underpinnings of this relationship, and increasing evidence suggests that it may underlie some of the broader effects of diet on human health and disease. Controversy regarding what constitutes a healthful diet has persisted since the advent of nutrition as a scientific discipline and establishment of government nutritional guidelines (1). The emergence of the gut microbiota as a key regulator of health and disease has further complicated this issue. A mutualistic relation exists between diet and the gut microbiota so that dietary factors are among the most potent modulators of microbiota composition and function. The human gut microbiota consists of trillions of microbial cells and thousands of bacterial species. The specific compositional features differ among individuals, and although the mature microbiota is fairly resilient, it can be altered within individuals by both internal and external stimuli.


Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes

Science

During outbreaks of mysterious infections, events can rapidly become dangerous and confusing. A combination of increasing experience with outbreaks and genome-sequencing technology now means the pathogen can often be identified within days. But for some of the most frightening viral pathogens, the originating hosts and possible vectors often remain obscure. Babayan et al. took sequence data from more than 500 single-stranded RNA viruses (see the Perspective by Woolhouse) and used machine-learning algorithms to extract evolutionary signals imprinted in the virus sequence that offer information about its original hosts and if an arthropod vector, and what type, plays a part in the virus's natural ecology. Science, this issue p. 577; see also p. 524


Sources of human viruses

Science

Most emerging infectious diseases are caused by RNA viruses (1). Many of these that are newly found in humans have a natural mammal or bird reservoir; some are transmitted by arthropod vectors, such as mosquitos (2). If we do not know the reservoir host and/or vector, it is harder to identify individuals and populations at greatest risk of infection and to design an effective public health response. On page 577 of this issue, Babayan et al. (3) report their efforts to predict the reservoir hosts and vectors of human RNA viruses by applying machine-learning algorithms to virus genome sequence data.


The chromatin accessibility landscape of primary human cancers

Science

The Cancer Genome Atlas (TCGA) provides a high-quality resource of molecular data on a large variety of human cancers. Corces et al. used a recently modified assay to profile chromatin accessibility to determine the accessible chromatin landscape in 410 TCGA samples from 23 cancer types (see the Perspective by Taipale). When the data were integrated with other omics data available for the same tumor samples, inherited risk loci for cancer predisposition were revealed, transcription factors and enhancers driving molecular subtypes of cancer with patient survival differences were identified, and noncoding mutations associated with clinical prognosis were discovered. Science, this issue p. eaav1898; see also p. 401 Cancer is one of the leading causes of death worldwide. Although the 2% of the human genome that encodes proteins has been extensively studied, much remains to be learned about the noncoding genome and gene regulation in cancer. Genes are turned on and off in the proper cell types and cell states by transcription factor (TF) proteins acting on DNA regulatory elements that are scattered over the vast noncoding genome and exert long-range influences. The Cancer Genome Atlas (TCGA) is a global consortium that aims to accelerate the understanding of the molecular basis of cancer. TCGA has systematically collected DNA mutation, methylation, RNA expression, and other comprehensive datasets from primary human cancer tissue. TCGA has served as an invaluable resource for the identification of genomic aberrations, altered transcriptional networks, and cancer subtypes. Nonetheless, the gene regulatory landscapes of these tumors have largely been inferred through indirect means. A hallmark of active DNA regulatory elements is chromatin accessibility. Eukaryotic genomes are compacted in chromatin, a complex of DNA and proteins, and only the active regulatory elements are accessible by the cell's machinery such as TFs. ATAC-seq enables the genome-wide profiling of TF binding events that orchestrate gene expression programs and give a cell its identity. We generated high-quality ATAC-seq data in 410 tumor samples from TCGA, identifying diverse regulatory landscapes across 23 cancer types. These chromatin accessibility profiles identify cancer- and tissue-specific DNA regulatory elements that enable classification of tumor subtypes with newly recognized prognostic importance. We identify distinct TF activities in cancer based on differences in the inferred patterns of TF-DNA interaction and gene expression. Genome-wide correlation of gene expression and chromatin accessibility predicts tens of thousands of putative interactions between distal regulatory elements and gene promoters, including key oncogenes and targets in cancer immunotherapy, such as MYC, SRC, BCL2, and PDL1.


Relationship of gender differences in preferences to economic development and gender equality

Science

The relationships are predicted from local polynomial regressions. Shaded areas indicate 95% confidence intervals. Preferences concerning time, risk, and social interactions systematically shape human behavior and contribute to differential economic and social outcomes between women and men. We present a global investigation of gender differences in six fundamental preferences. Our data consist of measures of willingness to take risks, patience, altruism, positive and negative reciprocity, and trust for 80,000 individuals in 76 representative country samples. Gender differences in preferences were positively related to economic development and gender equality. This finding suggests that greater availability of and gender-equal access to material and social resources favor the manifestation of gender-differentiated preferences across countries. Fundamental preferences such as altruism, risk-taking, reciprocity, patience, or trust constitute the foundation of choice theories and govern human behavior.


Pan-tumor genomic biomarkers for PD-1 checkpoint blockade-based immunotherapy

Science

Clinical trial data can provide a wealth of information about how drugs work. Yet such information often belongs to pharmaceutical companies and is rarely accessible to the scientific community at large. Cristescu et al. provide exploratory analysis of a cancer genomics dataset, collected from four separate clinical trials of Merck's PD-1 immunotherapy drug, pembrolizumab. This informative public resource examines more than 300 patient samples representing 22 different tumor types. Two widely used signatures that currently predict immunotherapy response are tumor mutational burden and a "hot" T cell–inflamed microenvironment. The study analyzed these two proposed biomarkers in combination to see what predictive clinical utility they may hold. Immunotherapy targeting the programmed cell death protein–1 (PD-1) axis elicits durable antitumor responses in multiple cancer types. However, clinical responses vary, and biomarkers predictive of response may help to identify patients who will derive the greatest therapeutic benefit. Clinically validated biomarkers predictive of response to the anti–PD-1 monoclonal antibody pembrolizumab include PD-1 ligand 1 (PD-L1) expression in specific cancers and high microsatellite instability (MSI-H) regardless of tumor type. Tumor mutational burden (TMB) and T cell–inflamed gene expression profile (GEP) are emerging predictive biomarkers for pembrolizumab. Both PD-L1 and GEP are inflammatory biomarkers indicative of a T cell–inflamed tumor microenvironment (TME), whereas TMB and MSI-H are indirect measures of tumor antigenicity generated by somatic tumor mutations.