compound
scGeneScope: ATreatment-Matched Single Cell Imaging and Transcriptomics Dataset and Benchmark for Treatment Response Modeling
Understanding cellular responses to chemical interventions is critical to the discovery of effective therapeutics. Because individual biological techniques often measure only one axis of cellular response at a time, high-quality multimodal datasets are needed to unlock a holistic understanding of how cells respond to treatments and to advance computational methods that integrate modalities. However, many techniques destroy cells and thus preclude paired measurements, and attempts to match disparate unimodal datasets are often confounded by data being generated in incompatible experimental settings. Here we introduce scGeneScope, a multimodal single-cell RNA sequencing (scRNA-seq) and Cell Painting microscopy image dataset conditionally paired by chemical treatment, designed to facilitate the development and benchmarking of unimodal, multimodal, and multiple profile machine learning methods for cellular profiling.
ADetails on the models and benchmarks862
For regression on the dataset, we perform leave-one-out cross validation. For the single solvents,865 we leave out one solvent at a time. For the full data, we leave out one solvent ramp at a time. We866 measure the performance of the model on each leave-one-out data split, then take the mean of their867 performance across the dataset. We exclude any experiments involving acetonitrile and acetic acid,868 due to the observed side-reactions.
ADataset for Distilling Knowledge Priors from Literature for Therapeutic Design
AI-driven discovery can greatly reduce design time and enhance new therapeutics' effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis across diverse models on the GuacaMol benchmark using supervised classifiers, over 60% of molecules proposed had a high probability of being mutagenic. In this work, we introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. Medex consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e.
OligoGym: Curated Datasets and Benchmarks for Oligonucleotide Drug Discovery
Oligonucleotide therapeutics offer great potential to address previously undruggable targets and enable personalized medicine. However, their progress is often hindered by insufficient safety and efficacy profiles. Predictive modeling and machine learning could significantly accelerate oligonucleotide drug discovery by identifying suboptimal compounds early on, but their application in this area lags behind other modalities. A key obstacle to the adoption of machine learning in the field is the scarcity of readily accessible and standardized datasets for model development, as data are often scattered across diverse experiments with inconsistent molecular representations. To overcome this challenge, we introduce OligoGym, a curated collection of standardized, machine learning-ready datasets encompassing various oligonucleotide therapeutic modalities and endpoints. We used OligoGym to benchmark diverse classical and deep learning methods, establishing performance baselines for each dataset across different featurization techniques, model configurations, and splitting strategies. Our work represents a crucial first step in creating a more unified framework for oligonucleotide therapeutic dataset generation and model training.
Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model
Understanding molecules is key to understanding organisms and driving advances in drug discovery, requiring interdisciplinary knowledge across chemistry and biology. Although large molecular language models have achieved notable success in task transfer, they often struggle to accurately analyze molecular features due to limited knowledge and reasoning capabilities. To address this issue, we present Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules and exhibits explainability and reasoning ability. To this end, we design key data types that encompass the fundamental molecular features, taking into account the essential abilities for molecular reasoning. Further, to improve molecular understanding, we propose a module that integrates complementary information from different molecular encoders, leveraging the distinct advantages of molecular representations. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules and providing informative responses, implying its potential as a general-purpose assistant for molecular analysis. Our project page is at https://mol-llama.github.io/.
ChemOrch: Empowering LLMs with Chemical Intelligence via Synthetic Instructions
Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction-response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks, and ensures response precision through tool planning & distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1) the high quality of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the reliable generation of evaluation tasks that more effectively reveal LLM weaknesses in chemistry; and 3) the significant improvement of LLM chemistry capabilities when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs.
David Sinclair plans to test whole-body rejuvenation drugs in the XPrize competition
The outspoken longevity scientist David Sinclair has been predicting that one day, you'll go to the doctor and get a prescription that will make you 10 years younger. Now has learned that he has plans to launch human tests of an oral reprogramming drug as part of a $101 million competition organized by the XPrize Foundation. The foundation is offering cash awards to teams able to "restore" a person to an earlier apparent age, as measured by improvements in immune, cognitive, and muscle function. The grand prize goes to any team able to show a 10-year (or greater) relative improvement after one year of treatment. Reached by phone, Sinclair, a biologist at Harvard Medical School, confirmed that he plans to give an oral drug mixture to volunteers in a bid to seek "evidence for age restoration in humans."
Deployment-complete benchmarking
Mansouri, El Mustapha, Arai, Keigo
Benchmarks increasingly guide deployment, procurement and scientific screening, yet a score supports only the response it records, not necessarily the deployment action. We introduce deployment-complete benchmarking, which tests whether benchmark evidence determines a deployment action. A benchmark is complete for a claim exactly when the action is constant on each evidence fiber; mixed fibers expose missing deployment information, and completion curves quantify the evidence required to resolve ambiguity. In controlled response spaces, benchmark-channel conformal coverage of 94.98% transferred poorly to an unmeasured deployment channel (10.07%), whereas response-rank intervals achieved 94.91% coverage; even zero benchmark error certified only 45.4% of candidates at the largest residual size. Public audits revealed incompleteness, including 97.9% mixed Tox21 fibers and zero median certifiable fraction in main Matbench and JARVIS audits. In held-out replays, certify-then-acquire reduced false decisions from 1.19% to 0.027% in Tox21 and from 20.3% to 0.128% in JARVIS, while changing model choice and identifying deployment-relevant probes. Deployment-ready benchmarks should report evidence, supported actions, ambiguity and completion cost rather than scores alone.
Your next sunscreen could be made from E. coli
Science Biology Your next sunscreen could be made from E. coli A chemical compound inside the bacterium may offer an eco-friendly way to block harmful UV rays. More information Adding us as a Preferred Source in Google by using this link indicates that you would like to see more of our content in Google News results. Scientists are turning to nature for eco-friendly sunscreens. Breakthroughs, discoveries, and DIY tips sent six days a week. Let's face it, sunscreen is important to our health, but can really be a drag.
Path-Based Gradient Boosting for Graph-Level Prediction
Meggio, Claudio, Pensar, Johan, De Bin, Riccardo
We propose PathBoost, a gradient tree boosting method for graph-level classification and regression that learns discriminative path-based features directly from the input graph structure. Building on a previous work, which was tailored to a specific chemistry application, PathBoost introduces three key extensions: (i) adaptation to binary classification through gradient boosting with a logistic loss, (ii) incorporation of multiple node and edge attributes into the path feature space via a prefix-based decomposition, and (iii) automatic anchor node selection based on categorical attribute diversity, eliminating the need for the user to specify the starting point of the considered path features. We compared PathBoost to graph neural networks and graph kernel approaches on several benchmark datasets, obtaining better results in half of them, and comparable results in the rest. PathBoost shows better performances on graphs with larger average node counts. Overall, the results demonstrate that path-based boosting methods can be competitive with more complex black-box approaches.