allele
Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA
Huang, Mingyu, Zhou, Shasha, Li, Ke
Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from mutagensis data in diverse modalities (e.g., DNA, RNA, protein, and beyond) with up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. In addition, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2.2 million sequences across various modalities. All the codes and datasets are available at https://github.com/COLA-Laboratory/GraphFLA.
- North America > United States (0.14)
- Europe > United Kingdom > Wales (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (0.92)
- Europe > France (0.14)
- Africa > Middle East > Egypt (0.05)
- Europe > Sweden (0.05)
- (28 more...)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Transportation > Ground (0.68)
- Government > Regional Government (0.67)
- North America > United States > North Carolina (0.05)
- Antarctica (0.04)
- Europe (0.04)
- (16 more...)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.45)
- Health & Medicine (0.46)
- Education (0.46)
- Information Technology > Security & Privacy (0.45)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
Generation of structure-guided pMHC-I libraries using Diffusion Models
Mares, Sergio, Weinberger, Ariel Espinoza, Ioannidis, Nilah M.
Personalized vaccines and T-cell immunotherapies depend critically on identifying peptide-MHC class I (pMHC-I) interactions capable of eliciting potent immune responses. However, current benchmarks and models inherit biases present in mass-spectrometry and binding-assay datasets, limiting discovery of novel peptide ligands. To address this issue, we introduce a structure-guided benchmark of pMHC-I peptides designed using diffusion models conditioned on crystal structure interaction distances. Spanning twenty high-priority HLA alleles, this benchmark is independent of previously characterized peptides yet reproduces canonical anchor residue preferences, indicating structural generalization without experimental dataset bias. Using this resource, we demonstrate that state-of-the-art sequence-based predictors perform poorly at recognizing the binding potential of these structurally stable designs, indicating allele-specific limitations invisible in conventional evaluations. Our geometry-aware design pipeline yields peptides with high predicted structural integrity and higher residue diversity than existing datasets, representing a key resource for unbiased model training and evaluation. Our code, and data are available at: https://github.com/sermare/struct-mhc-dev.
- North America > United States > California > Alameda County > Berkeley (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
deepNoC: A deep learning system to assign the number of contributors to a short tandem repeat DNA profile
Taylor, Duncan, Humphries, Melissa A.
A common task in forensic biology is to interpret and evaluate short tandem repeat DNA profiles. The first step in these interpretations is to assign a number of contributors to the profiles, a task that is most often performed manually by a scientist using their knowledge of DNA profile behaviour. Studies using constructed DNA profiles have shown that as DNA profiles become more complex, and the number of DNA-donating individuals increases, the ability for scientists to assign the target number. There have been a number of machine learning algorithms developed that seek to assign the number of contributors to a DNA profile, however due to practical limitations in being able to generate DNA profiles in a laboratory, the algorithms have been based on summaries of the available information. In this work we develop an analysis pipeline that simulates the electrophoretic signal of an STR profile, allowing virtually unlimited, pre-labelled training material to be generated. We show that by simulating 100 000 profiles and training a number of contributors estimation tool using a deep neural network architecture (in an algorithm named deepNoC) that a high level of performance is achieved (89% for 1 to 10 contributors). The trained network can then have fine-tuning training performed with only a few hundred profiles in order to achieve the same accuracy within a specific laboratory. We also build into deepNoC secondary outputs that provide a level of explainability to a user of algorithm, and show how they can be displayed in an intuitive manner.
- Oceania > Australia > South Australia > Adelaide (0.14)
- Oceania > New Zealand (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- North America > United States > California > Los Angeles County > Santa Monica (0.04)
A novel application of Shapley values for large multidimensional time-series data: Applying explainable AI to a DNA profile classification neural network
Elborough, Lauren, Taylor, Duncan, Humphries, Melissa
The application of Shapley values to high-dimensional, time-series-like data is computationally challenging - and sometimes impossible. For $N$ inputs the problem is $2^N$ hard. In image processing, clusters of pixels, referred to as superpixels, are used to streamline computations. This research presents an efficient solution for time-seres-like data that adapts the idea of superpixels for Shapley value computation. Motivated by a forensic DNA classification example, the method is applied to multivariate time-series-like data whose features have been classified by a convolutional neural network (CNN). In DNA processing, it is important to identify alleles from the background noise created by DNA extraction and processing. A single DNA profile has $31,200$ scan points to classify, and the classification decisions must be defensible in a court of law. This means that classification is routinely performed by human readers - a monumental and time consuming process. The application of a CNN with fast computation of meaningful Shapley values provides a potential alternative to the classification. This research demonstrates the realistic, accurate and fast computation of Shapley values for this massive task
- Research Report (0.64)
- Overview > Innovation (0.40)
PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks
Yax, Nicolas, Oudeyer, Pierre-Yves, Palminteri, Stefano
This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metrics based on the similarity of LLMs' output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
- Europe > France > Nouvelle-Aquitaine > Gironde > Bordeaux (0.04)
Bayesian Pedigree Analysis using Measure Factorization
Pedigrees, or family trees, are directed graphs used to identify sites of the genome that are correlated with the presence or absence of a disease. With the advent of genotyping and sequencing technologies, there has been an explosion in the amount of data available, both in the number of individuals and in the number of sites.
- North America > Canada > British Columbia (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Oregon > Benton County > Corvallis (0.04)
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
Zheng, Ge, Yang, Bin, Tang, Jiajin, Zhou, Hong-Yu, Yang, Sibei
A long-standing goal of AI systems is to perform complex multimodal reasoning like humans. Recently, large language models (LLMs) have made remarkable strides in such multi-step reasoning on the language modality solely by leveraging the chain of thought (CoT) to mimic human thinking. However, the transfer of these advancements to multimodal contexts introduces heightened challenges, including but not limited to the impractical need for labor-intensive annotation and the limitations in terms of flexibility, generalizability, and explainability. To evoke CoT reasoning in multimodality, this work first conducts an in-depth analysis of these challenges posed by multimodality and presents two key insights: "keeping critical thinking" and "letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this study proposes a novel DDCoT prompting that maintains a critical attitude through negative-space prompting and incorporates multimodality into reasoning by first dividing the reasoning responsibility of LLMs into reasoning and recognition and then integrating the visual recognition capability of visual models into the joint reasoning process. The rationales generated by DDCoT not only improve the reasoning abilities of both large and small language models in zero-shot prompting and fine-tuning learning, significantly outperforming state-of-the-art methods but also exhibit impressive generalizability and explainability.
- North America > United States > North Carolina (0.05)
- Antarctica (0.04)
- Europe (0.04)
- (16 more...)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.65)
- Health & Medicine (0.46)
- Education (0.46)
- Information Technology > Security & Privacy (0.45)
MultiQG-TI: Towards Question Generation from Multi-modal Sources
Wang, Zichao, Baraniuk, Richard
We study the new problem of automatic question generation (QG) from multi-modal sources containing images and texts, significantly expanding the scope of most of the existing work that focuses exclusively on QG from only textual sources. We propose a simple solution for our new problem, called MultiQG-TI, which enables a text-only question generator to process visual input in addition to textual input. Specifically, we leverage an image-to-text model and an optical character recognition model to obtain the textual description of the image and extract any texts in the image, respectively, and then feed them together with the input texts to the question generator. We only fine-tune the question generator while keeping the other components fixed. On the challenging ScienceQA dataset, we demonstrate that MultiQG-TI significantly outperforms ChatGPT with few-shot prompting, despite having hundred-times less trainable parameters. Additional analyses empirically confirm the necessity of both visual and textual signals for QG and show the impact of various modeling choices.
- Overview (0.86)
- Research Report (0.82)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- (2 more...)