Goto

Collaborating Authors

 allele


Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA

Huang, Mingyu, Zhou, Shasha, Li, Ke

arXiv.org Artificial Intelligence

Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from mutagensis data in diverse modalities (e.g., DNA, RNA, protein, and beyond) with up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. In addition, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2.2 million sequences across various modalities. All the codes and datasets are available at https://github.com/COLA-Laboratory/GraphFLA.


Ancient origin of an urban underground mosquito Science

Science

Understanding how life is adapting to urban environments represents an important challenge in evolutionary biology. In this work, we investigate a widely cited example of urban adaptation, Culex pi...



Generation of structure-guided pMHC-I libraries using Diffusion Models

Mares, Sergio, Weinberger, Ariel Espinoza, Ioannidis, Nilah M.

arXiv.org Artificial Intelligence

Personalized vaccines and T-cell immunotherapies depend critically on identifying peptide-MHC class I (pMHC-I) interactions capable of eliciting potent immune responses. However, current benchmarks and models inherit biases present in mass-spectrometry and binding-assay datasets, limiting discovery of novel peptide ligands. To address this issue, we introduce a structure-guided benchmark of pMHC-I peptides designed using diffusion models conditioned on crystal structure interaction distances. Spanning twenty high-priority HLA alleles, this benchmark is independent of previously characterized peptides yet reproduces canonical anchor residue preferences, indicating structural generalization without experimental dataset bias. Using this resource, we demonstrate that state-of-the-art sequence-based predictors perform poorly at recognizing the binding potential of these structurally stable designs, indicating allele-specific limitations invisible in conventional evaluations. Our geometry-aware design pipeline yields peptides with high predicted structural integrity and higher residue diversity than existing datasets, representing a key resource for unbiased model training and evaluation. Our code, and data are available at: https://github.com/sermare/struct-mhc-dev.


deepNoC: A deep learning system to assign the number of contributors to a short tandem repeat DNA profile

Taylor, Duncan, Humphries, Melissa A.

arXiv.org Artificial Intelligence

A common task in forensic biology is to interpret and evaluate short tandem repeat DNA profiles. The first step in these interpretations is to assign a number of contributors to the profiles, a task that is most often performed manually by a scientist using their knowledge of DNA profile behaviour. Studies using constructed DNA profiles have shown that as DNA profiles become more complex, and the number of DNA-donating individuals increases, the ability for scientists to assign the target number. There have been a number of machine learning algorithms developed that seek to assign the number of contributors to a DNA profile, however due to practical limitations in being able to generate DNA profiles in a laboratory, the algorithms have been based on summaries of the available information. In this work we develop an analysis pipeline that simulates the electrophoretic signal of an STR profile, allowing virtually unlimited, pre-labelled training material to be generated. We show that by simulating 100 000 profiles and training a number of contributors estimation tool using a deep neural network architecture (in an algorithm named deepNoC) that a high level of performance is achieved (89% for 1 to 10 contributors). The trained network can then have fine-tuning training performed with only a few hundred profiles in order to achieve the same accuracy within a specific laboratory. We also build into deepNoC secondary outputs that provide a level of explainability to a user of algorithm, and show how they can be displayed in an intuitive manner.


A novel application of Shapley values for large multidimensional time-series data: Applying explainable AI to a DNA profile classification neural network

Elborough, Lauren, Taylor, Duncan, Humphries, Melissa

arXiv.org Machine Learning

The application of Shapley values to high-dimensional, time-series-like data is computationally challenging - and sometimes impossible. For $N$ inputs the problem is $2^N$ hard. In image processing, clusters of pixels, referred to as superpixels, are used to streamline computations. This research presents an efficient solution for time-seres-like data that adapts the idea of superpixels for Shapley value computation. Motivated by a forensic DNA classification example, the method is applied to multivariate time-series-like data whose features have been classified by a convolutional neural network (CNN). In DNA processing, it is important to identify alleles from the background noise created by DNA extraction and processing. A single DNA profile has $31,200$ scan points to classify, and the classification decisions must be defensible in a court of law. This means that classification is routinely performed by human readers - a monumental and time consuming process. The application of a CNN with fast computation of meaningful Shapley values provides a potential alternative to the classification. This research demonstrates the realistic, accurate and fast computation of Shapley values for this massive task


PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks

Yax, Nicolas, Oudeyer, Pierre-Yves, Palminteri, Stefano

arXiv.org Artificial Intelligence

This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metrics based on the similarity of LLMs' output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.


Bayesian Pedigree Analysis using Measure Factorization

Neural Information Processing Systems

Pedigrees, or family trees, are directed graphs used to identify sites of the genome that are correlated with the presence or absence of a disease. With the advent of genotyping and sequencing technologies, there has been an explosion in the amount of data available, both in the number of individuals and in the number of sites.


DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Zheng, Ge, Yang, Bin, Tang, Jiajin, Zhou, Hong-Yu, Yang, Sibei

arXiv.org Artificial Intelligence

A long-standing goal of AI systems is to perform complex multimodal reasoning like humans. Recently, large language models (LLMs) have made remarkable strides in such multi-step reasoning on the language modality solely by leveraging the chain of thought (CoT) to mimic human thinking. However, the transfer of these advancements to multimodal contexts introduces heightened challenges, including but not limited to the impractical need for labor-intensive annotation and the limitations in terms of flexibility, generalizability, and explainability. To evoke CoT reasoning in multimodality, this work first conducts an in-depth analysis of these challenges posed by multimodality and presents two key insights: "keeping critical thinking" and "letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this study proposes a novel DDCoT prompting that maintains a critical attitude through negative-space prompting and incorporates multimodality into reasoning by first dividing the reasoning responsibility of LLMs into reasoning and recognition and then integrating the visual recognition capability of visual models into the joint reasoning process. The rationales generated by DDCoT not only improve the reasoning abilities of both large and small language models in zero-shot prompting and fine-tuning learning, significantly outperforming state-of-the-art methods but also exhibit impressive generalizability and explainability.


MultiQG-TI: Towards Question Generation from Multi-modal Sources

Wang, Zichao, Baraniuk, Richard

arXiv.org Artificial Intelligence

We study the new problem of automatic question generation (QG) from multi-modal sources containing images and texts, significantly expanding the scope of most of the existing work that focuses exclusively on QG from only textual sources. We propose a simple solution for our new problem, called MultiQG-TI, which enables a text-only question generator to process visual input in addition to textual input. Specifically, we leverage an image-to-text model and an optical character recognition model to obtain the textual description of the image and extract any texts in the image, respectively, and then feed them together with the input texts to the question generator. We only fine-tune the question generator while keeping the other components fixed. On the challenging ScienceQA dataset, we demonstrate that MultiQG-TI significantly outperforms ChatGPT with few-shot prompting, despite having hundred-times less trainable parameters. Additional analyses empirically confirm the necessity of both visual and textual signals for QG and show the impact of various modeling choices.