Goto

Collaborating Authors

 Fox, Emily


Building Trust: Foundations of Security, Safety and Transparency in AI

arXiv.org Artificial Intelligence

This p aper explore s the rapidly evolving ecosystem of publicly available AI models, and their potential implications on the s ecurit y and s afet y lands cape. A s AI models become increasingly prevalent, understanding their potential risks and vulnerabilitie s is crucial. We review the current s ecurit y and s afet y s cenarios while highlighting challenge s such as tracking issue s, remediation, and the app arent abs ence of AI model lifecycle and ownership proce ss e s. Comprehensive strategie s to enhance s ecurit y and s afet y for both model developers and end-us ers are propos ed. This p aper aims to provide s ome of the foundational piece s for more standardized s ecurit y, s afet y, and transp arency in the development and operation of AI models and the larger open ecosystems and communitie s forming around them. Generative AI, a branch of artificial intelligence focus ed on AI produc tion of content such as text, image s and video, has s een significant advancement s since the introduc tion of generative advers arial net works (GANs) in 2014 (Goodfellow et al., 2014), which improved data generation but faced issue s like training instabilit y. The development of transformers and s elf at tention mechanisms in 2017 (Vaswani et al., 2017) facilitated further improvement s in natural language proce ssing, leading to large language models (LLMs) like GPT (Radford et al., 2018) with highly advanced text generation cap abilitie s. Dif fusion models (S ohl-Dickstein et al., 2015) have als o s een rapid advancement in image and video generation. This rapid advancement in technology cap abilit y has been matched by an equally rapid uptake in adoption. A s with any new technology, it is worth noting that the industr y is still identif ying new and valuable us e s for AI and the s e market predic tions may fluc tuate as us e cas e s are te sted in real world environment s with real world problems. For the purpos e of clarit y we shall be using the term public model, for a model which is publicly available for download and us e. LLMs are the next evolution of data s cience, a field focus ed on math and data. Unlike traditional systems and applications which rely on logic and programming for a specified outcome, large language model development t ypically consist s of architec ture re s earch and de sign, which is then coded.


KinDEL: DNA-Encoded Library Dataset for Kinase Inhibitors

arXiv.org Artificial Intelligence

DNA-Encoded Libraries (DEL) are combinatorial small molecule libraries that offer an efficient way to characterize diverse chemical spaces. Selection experiments using DELs are pivotal to drug discovery efforts, enabling high-throughput screens for hit finding. However, limited availability of public DEL datasets hinders the advancement of computational techniques designed to utilize such data. To bridge this gap, we present KinDEL, one of the first large, publicly available DEL datasets on two kinases: Mitogen-Activated Protein Kinase 14 (MAPK14) and Discoidin Domain Receptor Tyrosine Kinase 1 (DDR1). Interest in this data modality is growing due to its ability to generate extensive supervised chemical data that densely samples around select molecular structures. Demonstrating one such application of the data, we benchmark different machine learning techniques to develop predictive models for hit identification; in particular, we highlight recent structure-based probabilistic approaches. Finally, we provide biophysical assay data, both on-and off-DNA, to validate our models on a smaller subset of molecules. Data and code for our benchmarks can be found at https://github.com/insitro/kindel. DNA-Encoded Libraries (DEL) have emerged as a powerful tool in drug discovery, enabling highly efficient screens of small molecule libraries against therapeutically relevant targets (Yuen & Franzini, 2017; Gironda-Martรญnez et al., 2021; Kunig et al., 2021; Peterson & Liu, 2023). These massive libraries are efficiently constructed through combinatorial synthesis of chemical building blocks, or synthons, with each resulting molecule being assigned a DNA barcode (see Figure 1). DELs are then used in selection experiments against proteins of interest, wherein multiple rounds of washing are conducted to remove any weak binders, and the DNA tags of surviving molecules are sequenced as a measure of binding affinity. Despite the highly efficient throughput of DELs, data generated through these experiments are intrinsically noisy with various sources of bias arising from the DEL synthesis and selection processes, necessitating modern machine learning methods to learn signal from the data. Unfortunately, there is still a lack of large, publicly available DEL datasets and benchmarking tasks to drive this important research area.


Learning Insulin-Glucose Dynamics in the Wild

arXiv.org Machine Learning

We develop a new model of insulin-glucose dynamics for forecasting blood glucose in type 1 diabetics. We augment an existing biomedical model by introducing time-varying dynamics driven by a machine learning sequence model. Our model maintains a physiologically plausible inductive bias and clinically interpretable parameters -- e.g., insulin sensitivity -- while inheriting the flexibility of modern pattern recognition algorithms. Critical to modeling success are the flexible, but structured representations of subject variability with a sequence model. In contrast, less constrained models like the LSTM fail to provide reliable or physiologically plausible forecasts. We conduct an extensive empirical study. We show that allowing biomedical model dynamics to vary in time improves forecasting at long time horizons, up to six hours, and produces forecasts consistent with the physiological effects of insulin and carbohydrates.


Large-Scale Stochastic Sampling from the Probability Simplex

Neural Information Processing Systems

Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space the time-discretization error can dominate when we are near the boundary of the space. We demonstrate that because of this, current SGMCMC methods for the simplex struggle with sparse simplex spaces; when many of the components are close to zero. Unfortunately, many popular large-scale Bayesian models, such as network or topic models, require inference on sparse simplex spaces.


Large-Scale Stochastic Sampling from the Probability Simplex

Neural Information Processing Systems

Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space the time-discretization error can dominate when we are near the boundary of the space. We demonstrate that because of this, current SGMCMC methods for the simplex struggle with sparse simplex spaces; when many of the components are close to zero. Unfortunately, many popular large-scale Bayesian models, such as network or topic models, require inference on sparse simplex spaces. To avoid the biases caused by this discretization error, we propose the stochastic Cox-Ingersoll-Ross process (SCIR), which removes all discretization error and we prove that samples from the SCIR process are asymptotically unbiased. We discuss how this idea can be extended to target other constrained spaces. Use of the SCIR process within a SGMCMC algorithm is shown to give substantially better performance for a topic model and a Dirichlet process mixture model than existing SGMCMC approaches.


Large-Scale Stochastic Sampling from the Probability Simplex

Neural Information Processing Systems

Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space the time-discretization error can dominate when we are near the boundary of the space. We demonstrate that because of this, current SGMCMC methods for the simplex struggle with sparse simplex spaces; when many of the components are close to zero. Unfortunately, many popular large-scale Bayesian models, such as network or topic models, require inference on sparse simplex spaces. To avoid the biases caused by this discretization error, we propose the stochastic Cox-Ingersoll-Ross process (SCIR), which removes all discretization error and we prove that samples from the SCIR process are asymptotically unbiased. We discuss how this idea can be extended to target other constrained spaces. Use of the SCIR process within a SGMCMC algorithm is shown to give substantially better performance for a topic model and a Dirichlet process mixture model than existing SGMCMC approaches.


Neural Granger Causality for Nonlinear Time Series

arXiv.org Machine Learning

While most classical approaches to Granger causality detection assume linear dynamics, many interactions in applied domains, like neuroscience and genomics, are inherently nonlinear. In these cases, using linear models may lead to inconsistent estimation of Granger causal interactions. We propose a class of nonlinear methods by applying structured multilayer perceptrons (MLPs) or recurrent neural networks (RNNs) combined with sparsity-inducing penalties on the weights. By encouraging specific sets of weights to be zero---in particular through the use of convex group-lasso penalties---we can extract the Granger causal structure. To further contrast with traditional approaches, our framework naturally enables us to efficiently capture long-range dependencies between series either via our RNNs or through an automatic lag selection in the MLP. We show that our neural Granger causality methods outperform state-of-the-art nonlinear Granger causality methods on the DREAM3 challenge data. This data consists of nonlinear gene expression and regulation time courses with only a limited number of time points. The successes we show in this challenging dataset provide a powerful example of how deep learning can be useful in cases that go beyond prediction on large datasets. We likewise demonstrate our methods in detecting nonlinear interactions in a human motion capture dataset.


Interpretable VAEs for nonlinear group factor analysis

arXiv.org Machine Learning

Deep generative models have recently yielded encouraging results in producing subjectively realistic samples of complex data. Far less attention has been paid to making these generative models interpretable. In many scenarios, ranging from scientific applications to finance, the observed variables have a natural grouping. It is often of interest to understand systems of interaction amongst these groups, and latent factor models (LFMs) are an attractive approach. However, traditional LFMs are limited by assuming a linear correlation structure. We present an output interpretable VAE (oi-VAE) for grouped data that models complex, nonlinear latent-to-observed relationships. We combine a structured VAE comprised of group-specific generators with a sparsity-inducing prior. We demonstrate that oi-VAE yields meaningful notions of interpretability in the analysis of motion capture and MEG data. We further show that in these situations, the regularization inherent to oi-VAE can actually lead to improved generalization and learned generative processes.


A Unified Framework for Long Range and Cold Start Forecasting of Seasonal Profiles in Time Series

arXiv.org Machine Learning

Providing long-range forecasts is a fundamental challenge in time series modeling, which is only compounded by the challenge of having to form such forecasts when a time series has never previously been observed. The latter challenge is the time series version of the cold-start problem seen in recommender systems which, to our knowledge, has not been directly addressed in previous work. In addition, modern time series datasets are often plagued by missing data. We focus on forecasting seasonal profiles---or baseline demand---for periods on the order of a year long, even in the cold-start setting or with otherwise missing data. Traditional time series approaches that perform iterated step-ahead methods struggle to provide accurate forecasts on such problems, let alone in the missing data regime. We present a computationally efficient framework which combines ideas from high-dimensional regression and matrix factorization on a carefully constructed data matrix. Key to our formulation and resulting performance is (1) leveraging repeated patterns over fixed periods of time and across series, and (2) metadata associated with the individual series. We provide analyses of our framework on large messy real-world datasets.


A Complete Recipe for Stochastic Gradient MCMC

Neural Information Processing Systems

Many recent Markov chain Monte Carlo (MCMC) samplers leverage continuous dynamics to define a transition kernel that efficiently explores a target distribution. In tandem, a focus has been on devising scalable variants that subsample the data and use stochastic gradients in place of full-data gradients in the dynamic simulations. However, such stochastic gradient MCMC samplers have lagged behind their full-data counterparts in terms of the complexity of dynamics considered since proving convergence in the presence of the stochastic gradient noise is non-trivial. Even with simple dynamics, significant physical intuition is often required to modify the dynamical system to account for the stochastic gradient noise. In this paper, we provide a general recipe for constructing MCMC samplers--including stochastic gradient versions--based on continuous Markov processes specified via two matrices. We constructively prove that the framework is complete. That is, any continuous Markov process that provides samples from the target distribution can be written in our framework. We show how previous continuous-dynamic samplers can be trivially reinvented in our framework, avoiding the complicated sampler-specific proofs. We likewise use our recipe to straightforwardly propose a new state-adaptive sampler: stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC). Our experiments on simulated data and a streaming Wikipedia analysis demonstrate that the proposed SGRHMC sampler inherits the benefits of Riemann HMC, with the scalability of stochastic gradient methods.