Bhatnagar, Aadyot
Conditional Enzyme Generation Using Protein Language Models with Adapters
Yang, Jason, Bhatnagar, Aadyot, Ruffolo, Jeffrey A., Madani, Ali
The conditional generation of proteins with desired functions and/or properties is a key goal for generative models. Existing methods based on prompting of language models can generate proteins conditioned on a target functionality, such as a desired enzyme family. However, these methods are limited to simple, tokenized conditioning and have not been shown to generalize to unseen functions. In this study, we propose ProCALM (Protein Conditionally Adapted Language Model), an approach for the conditional generation of proteins using adapters to protein language models. Our specific implementation of ProCALM involves finetuning ProGen2 to incorporate conditioning representations of enzyme function and taxonomy. ProCALM matches existing methods at conditionally generating sequences from target enzyme families. Impressively, it can also generate within the joint distribution of enzymatic function and taxonomy, and it can generalize to rare and unseen enzyme families and taxonomies. Overall, ProCALM is a flexible and computationally efficient approach, and we expect that it can be extended to a wide range of generative language models. Proteins, sequences of amino acids, are important molecules in all living organisms and have many industrial applications. Protein sequences can be modified or designed to have desired function(s) or optimized properties so that they are more useful for applications ranging from greener chemical synthesis to gene-editing for disease treatment (Buller et al., 2023).
Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion
Morehead, Alex, Ruffolo, Jeffrey, Bhatnagar, Aadyot, Madani, Ali
Generative models of macromolecules carry abundant and impactful implications for industrial and biomedical efforts in protein engineering. However, existing methods are currently limited to modeling protein structures or sequences, independently or jointly, without regard to the interactions that commonly occur between proteins and other macromolecules. In this work, we introduce MMDiff, a generative model that jointly designs sequences and structures of nucleic acid and protein complexes, independently or in complex, using joint SE(3)-discrete diffusion noise. Such a model has important implications for emerging areas of macromolecular design including structure-based transcription factor design and design of noncoding RNA sequences. We demonstrate the utility of MMDiff through a rigorous new design benchmark for macromolecular complex generation that we introduce in this work. Our results demonstrate that MMDiff is able to successfully generate micro-RNA and single-stranded DNA molecules while being modestly capable of joint modeling DNA and RNA molecules in interaction with multi-chain protein complexes. Source code: https://github.com/Profluent-Internships/MMDiff.
Improved Online Conformal Prediction via Strongly Adaptive Online Learning
Bhatnagar, Aadyot, Wang, Huan, Xiong, Caiming, Bai, Yu
We study the problem of uncertainty quantification via prediction sets, in an online setting where the data distribution may vary arbitrarily over time. Recent work develops online conformal prediction techniques that leverage regret minimization algorithms from the online learning literature to learn prediction sets with approximately valid coverage and small regret. However, standard regret minimization could be insufficient for handling changing environments, where performance guarantees may be desired not only over the full time horizon but also in all (sub-)intervals of time. We develop new online conformal prediction methods that minimize the strongly adaptive regret, which measures the worst-case regret over all intervals of a fixed length. We prove that our methods achieve near-optimal strongly adaptive regret for all interval lengths simultaneously, and approximately valid coverage. Experiments show that our methods consistently obtain better coverage and smaller prediction sets than existing methods on real-world tasks, such as time series forecasting and image classification under distribution shift.
Momentum Contrastive Autoencoder: Using Contrastive Learning for Latent Space Distribution Matching in WAE
Arpit, Devansh, Bhatnagar, Aadyot, Wang, Huan, Xiong, Caiming
Wasserstein autoencoder (WAE) shows that matching two distributions is equivalent to minimizing a simple autoencoder (AE) loss under the constraint that the latent space of this AE matches a pre-specified prior distribution. This latent space distribution matching is a core component of WAE, and a challenging task. In this paper, we propose to use the contrastive learning framework that has been shown to be effective for self-supervised representation learning, as a means to resolve this problem. We do so by exploiting the fact that contrastive learning objectives optimize the latent space distribution to be uniform over the unit hyper-sphere, which can be easily sampled from. We show that using the contrastive learning framework to optimize the WAE loss achieves faster convergence and more stable optimization compared with existing popular algorithms for WAE. This is also reflected in the FID scores on CelebA and CIFAR-10 datasets, and the realistic generated image quality on the CelebA-HQ dataset. The main goal of generative modeling is to learn a good approximation of the underlying data distribution from finite data samples, while facilitating an efficient way to draw samples. Popular algorithms such as variational autoencoders (VAE, Kingma & Welling (2013); Rezende et al. (2014)) and generative adversarial networks (GAN, Goodfellow et al. (2014)) are theoretically-grounded models designed to meet this goal. However, they come with some challenges. For instance, VAEs suffer from the posterior collapse problem (Chen et al., 2016; Zhao et al., 2017; Van Den Oord et al., 2017), and a mismatch between the posterior and prior distribution (Kingma et al., 2016; Tomczak & Welling, 2018; Dai & Wipf, 2019; Bauer & Mnih, 2019).
Merlion: A Machine Learning Library for Time Series
Bhatnagar, Aadyot, Kassianik, Paul, Liu, Chenghao, Lan, Tian, Yang, Wenzhuo, Cassius, Rowan, Sahoo, Doyen, Arpit, Devansh, Subramanian, Sri, Woo, Gerald, Saha, Amrita, Jagota, Arun Kumar, Gopalakrishnan, Gokulakrishnan, Singh, Manpreet, Krithika, K C, Maddineni, Sukumar, Cho, Daeki, Zong, Bo, Zhou, Yingbo, Xiong, Caiming, Savarese, Silvio, Hoi, Steven, Wang, Huan
We introduce Merlion, an open-source machine learning library for time series. It features a unified interface for many commonly used models and datasets for anomaly detection and forecasting on both univariate and multivariate time series, along with standard pre/post-processing layers. It has several modules to improve ease-of-use, including visualization, anomaly score calibration to improve interpetability, AutoML for hyperparameter tuning and model selection, and model ensembling. Merlion also provides a unique evaluation framework that simulates the live deployment and re-training of a model in production. This library aims to provide engineers and researchers a one-stop solution to rapidly develop models for their specific time series needs and benchmark them across multiple time series datasets. In this technical report, we highlight Merlion's architecture and major functionalities, and we report benchmark numbers across different baseline models and ensembles.