Born, Jannis
Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling
Egli, Eric, Manica, Matteo, Born, Jannis
Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte Language Models (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of $5$M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q\&A tasks and find that, despite serializing images and the absence of an encoder, a MBLM with pure next token prediction can match custom CNN-LSTM architectures with designated classification heads. We show that MBLMs exhibit strong adaptability in integrating diverse data representations, including pixel and image filestream bytes, underlining their potential toward omnimodal foundation models. Source code is publicly available at: https://github.com/ai4sd/multiscale-byte-lm
Regress, Don't Guess -- A Regression-like Loss on Number Tokens for Language Models
Zausinger, Jonas, Pennig, Lars, Chlodny, Kacper, Limbach, Vincent, Ketteler, Anna, Prein, Thorben, Singh, Vishwa Mohan, Danziger, Michael Morris, Born, Jannis
While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving reasoning over quantities, especially arithmetics. This has particular relevance in scientific datasets where combinations of text and numerical data are abundant. One fundamental limitation is the nature of the CE loss, which assumes a nominal (categorical) scale and thus cannot convey proximity between generated number tokens. As a remedy, we here present two versions of a number token loss. The first is based on an $L_p$ loss between the ground truth token value and the weighted sum of the predicted class probabilities. The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution. These regression-like losses can easily be added to any language model and extend the CE objective during training. We compare the proposed schemes on a mathematics dataset against existing tokenization, encoding, and decoding schemes for improving number representation in language models. Our results reveal a significant improvement in numerical accuracy when equipping a standard T5 model with the proposed loss schemes.
Quantum Theory and Application of Contextual Optimal Transport
Mariella, Nicola, Akhriev, Albert, Tacchino, Francesco, Zoufal, Christa, Gonzalez-Espitia, Juan Carlos, Harsanyi, Benedek, Koskin, Eugene, Tavernelli, Ivano, Woerner, Stefan, Rapsomaniki, Marianna, Zhuk, Sergiy, Born, Jannis
Optimal Transport (OT) has fueled machine learning (ML) across many domains. When paired data measurements $(\boldsymbol{\mu}, \boldsymbol{\nu})$ are coupled to covariates, a challenging conditional distribution learning setting arises. Existing approaches for learning a $\textit{global}$ transport map parameterized through a potentially unseen context utilize Neural OT and largely rely on Brenier's theorem. Here, we propose a first-of-its-kind quantum computing formulation for amortized optimization of contextualized transportation plans. We exploit a direct link between doubly stochastic matrices and unitary operators thus unravelling a natural connection between OT and quantum computation. We verify our method (QontOT) on synthetic and real data by predicting variations in cell type distributions conditioned on drug dosage. Importantly we conduct a 24-qubit hardware experiment on a task challenging for classical computers and report a performance that cannot be matched with our classical neural OT approach. In sum, this is a first step toward learning to predict contextualized transportation plans through quantum computing.
Language models in molecular discovery
Janakarajan, Nikita, Erdmann, Tim, Swaminathan, Sarath, Laino, Teodoro, Born, Jannis
The success of language models, especially transformer-based architectures, has trickled into other domains giving rise to "scientific language models" that operate on small molecules, proteins or polymers. In chemistry, language models contribute to accelerating the molecule discovery cycle as evidenced by promising recent findings in early-stage drug discovery. Here, we review the role of language models in molecular discovery, underlining their strength in de novo drug design, property prediction and reaction chemistry. We highlight valuable open-source software assets thus lowering the entry barrier to the field of scientific language modeling. Last, we sketch a vision for future molecular design that combines a chatbot interface with access to computational chemistry tools. Our contribution serves as a valuable resource for researchers, chemists, and AI enthusiasts interested in understanding how language models can and will be used to accelerate chemical discovery.
Unifying Molecular and Textual Representations via Multi-task Language Modelling
Christofidellis, Dimitrios, Giannone, Giorgio, Born, Jannis, Winther, Ole, Laino, Teodoro, Manica, Matteo
The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains. Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.
Accelerating Material Design with the Generative Toolkit for Scientific Discovery
Manica, Matteo, Born, Jannis, Cadow, Joris, Christofidellis, Dimitrios, Dave, Ashish, Clarke, Dean, Teukam, Yves Gaetan Nana, Giannone, Giorgio, Hoffman, Samuel C., Buchan, Matthew, Chenthamarakshan, Vijil, Donovan, Timothy, Hsu, Hsiang Han, Zipoli, Federico, Schilter, Oliver, Kishimoto, Akihiro, Hamada, Lisa, Padhi, Inkit, Wehden, Karl, McHugh, Lauren, Khrabrov, Alexy, Das, Payel, Takeda, Seiji, Smith, John R.
The rapid technological progress in the last centuries has been largely fueled by the success of the scientific method. However, in some of the most important fields, such as material or drug discovery, the productivity has been decreasing dramatically (Smietana et al., 2016) and by today it can take almost a decade to discover a new material and cost upwards of $10-$100 million. One of the most daunting challenges in materials discovery is hypothesis generation. The reservoir of natural products and their derivatives has been largely emptied (Atanasov et al., 2021) and bottom-up human-driven hypotheses have shown that it is extremely challenging to identify and select novel and useful candidates in search spaces that are overwhelming in size, e.g., the chemical space for drug-like molecules is estimated to contain > 10
Domain-agnostic and Multi-level Evaluation of Generative Models
Tadesse, Girmaw Abebe, Born, Jannis, Cintas, Celia, Ogallo, William, Zubarev, Dmitry, Manica, Matteo, Weldemariam, Komminist
Machine Learning (ML) methods, particularly generative models, are effective in addressing critical problems across different domains, which includes material sciences. Examples include the design of novel molecules by combining data-driven techniques and domain knowledge to efficiently search the space of all plausible molecules and generate new and valid ones [1, 2, 3, 4]. Traditional high-throughput wet-lab experiments, physics-based simulations, and bioinformatics tools for the molecular design process heavily depend on human expertise. These processes require significant resource expenditure to propose, synthesize and test new molecules, thereby limiting the exploration space [5, 6, 7]. For example, generative models have been applied to facilitate the material discovery process by employing inverse molecular design problem. This approach transforms the conventional and slow discovery process by mapping the desired set of properties to a set of structures. The generative process is then optimized to encourage the generation of molecules with those selected properties. Countless approaches have been suggested for such tasks, most prominently VAEs with different sampling techniques [8, 9, 10]), GANs [11, 12], diffusion models [13], flow networks [14] and Transformers [15].
TITAN: T Cell Receptor Specificity Prediction with Bimodal Attention Networks
Weber, Anna, Born, Jannis, Martínez, María Rodríguez
Motivation: The activity of the adaptive immune system is governed by T-cells and their specific T-cell receptors (TCR), which selectively recognize foreign antigens. Recent advances in experimental techniques have enabled sequencing of TCRs and their antigenic targets (epitopes), allowing to research the missing link between TCR sequence and epitope binding specificity. Scarcity of data and a large sequence space make this task challenging, and to date only models limited to a small set of epitopes have achieved good performance. Here, we establish a k-nearest-neighbor (K-NN) classifier as a strong baseline and then propose TITAN (Tcr epITope bimodal Attention Networks), a bimodal neural network that explicitly encodes both TCR sequences and epitopes to enable the independent study of generalization capabilities to unseen TCRs and/or epitopes. Results: By encoding epitopes at the atomic level with SMILES sequences, we leverage transfer learning and data augmentation to enrich the input data space and boost performance. TITAN achieves high performance in the prediction of specificity of unseen TCRs (ROC-AUC 0.87 in 10-fold CV) and surpasses the results of the current state-of-the-art (ImRex) by a large margin. Notably, our Levenshtein-distance-based K-NN classifier also exhibits competitive performance on unseen TCRs. While the generalization to unseen epitopes remains challenging, we report two major breakthroughs. First, by dissecting the attention heatmaps, we demonstrate that the sparsity of available epitope data favors an implicit treatment of epitopes as classes. This may be a general problem that limits unseen epitope performance for sufficiently complex models. Second, we show that TITAN nevertheless exhibits significantly improved performance on unseen epitopes and is capable of focusing attention on chemically meaningful molecular structures.
PaccMann$^{RL}$ on SARS-CoV-2: Designing antiviral candidates with conditional generative models
Born, Jannis, Manica, Matteo, Cadow, Joris, Markert, Greta, Mill, Nil Adell, Filipavicius, Modestas, Martínez, María Rodríguez
With the fast development of COVID-19 into a global pandemic, scientists around the globe are desperately searching for effective antiviral therapeutic agents. Bridging systems biology and drug discovery, we propose a deep learning framework for conditional de novo design of antiviral candidate drugs tailored against given protein targets. First, we train a multimodal ligand--protein binding affinity model on predicting affinities of antiviral compounds to target proteins and couple this model with pharmacological toxicity predictors. Exploiting this multi-objective as a reward function of a conditional molecular generator (consisting of two VAEs), we showcase a framework that navigates the chemical space toward regions with more antiviral molecules. Specifically, we explore a challenging setting of generating ligands against unseen protein targets by performing a leave-one-out-cross-validation on 41 SARS-CoV-2-related target proteins. Using deep RL, it is demonstrated that in 35 out of 41 cases, the generation is biased towards sampling more binding ligands, with an average increase of 83% comparing to an unbiased VAE. We present a case-study on a potential Envelope-protein inhibitor and perform a synthetic accessibility assessment of the best generated molecules is performed that resembles a viable roadmap towards a rapid in-vitro evaluation of potential SARS-CoV-2 inhibitors.
Reinforcement learning-driven de-novo design of anticancer compounds conditioned on biomolecular profiles
Born, Jannis, Manica, Matteo, Oskooei, Ali, Martínez, María Rodríguez
With the advent of deep generative models in computational chemistry, in silico anticancer drug design has undergone an unprecedented transformation. While state-of-the-art deep learning approaches have shown potential in generating compounds with desired chemical properties, they entirely overlook the genetic profile and properties of the target disease. In the case of cancer, this is problematic since it is a highly genetic disease in which the biomolecular profile of target cells determines the response to therapy. Here, we introduce the first deep generative model capable of generating anticancer compounds given a target biomolecular profile. Using a reinforcement learning framework, the transcriptomic profile of cancer cells is used as a context in which anticancer molecules are generated and optimized to obtain effective compounds for the given profile. Our molecule generator combines two pretrained variational autoencoders (VAEs) and a multimodal efficacy predictor - the first VAE generates transcriptomic profiles while the second conditional VAE generates novel molecular structures conditioned on the given transcriptomic profile. The efficacy predictor is used to optimize the generated molecules through a reward determined by the predicted IC50 drug sensitivity for the generated molecule and the target profile. We demonstrate how the molecule generation can be biased towards compounds with high inhibitory effect against individual cell lines or specific cancer sites. We verify our approach by investigating candidate drugs generated against specific cancer types and investigate their structural similarity to existing compounds with known efficacy against these cancer types. We envision our approach to transform in silico anticancer drug design by increasing success rates in lead compound discovery via leveraging the biomolecular characteristics of the disease.