Goto

Collaborating Authors

 Regression


Accelerating Distributed SGD for Linear Regression using Iterative Pre-Conditioning

arXiv.org Machine Learning

This paper considers the multi-agent distributed linear least-squares problem. The system comprises multiple agents, each agent with a locally observed set of data points, and a common server with whom the agents can interact. The agents' goal is to compute a linear model that best fits the collective data points observed by all the agents. In the server-based distributed settings, the server cannot access the data points held by the agents. The recently proposed Iteratively Pre-conditioned Gradient-descent (IPG) method has been shown to converge faster than other existing distributed algorithms that solve this problem. In the IPG algorithm, the server and the agents perform numerous iterative computations. Each of these iterations relies on the entire batch of data points observed by the agents for updating the current estimate of the solution. Here, we extend the idea of iterative pre-conditioning to the stochastic settings, where the server updates the estimate and the iterative pre-conditioning matrix based on a single randomly selected data point at every iteration. We show that our proposed Iteratively Pre-conditioned Stochastic Gradient-descent (IPSG) method converges linearly in expectation to a proximity of the solution. Importantly, we empirically show that the proposed IPSG method's convergence rate compares favorably to prominent stochastic algorithms for solving the linear least-squares problem in server-based networks.


How Lasso Regression Works in Machine Learning

#artificialintelligence

Regularization solves the problem of overfitting. It happens when the model learns the data as well as the noises in the training set. Noises are random datum in the training set which don't represent the actual properties of the data. Y represents the dependent variable, X represents the independent variables and C represents the coefficient estimates for different variables in the above linear regression equation. The model fitting involves a loss function known as the sum of squares.


Linear regression and gradient descent for absolute beginners

#artificialintelligence

In machine learning terminology, the sum of squared error is called the "cost". This equation is therefore roughly "sum of squared errors" as it computes the sum of predicted value minus actual value squared. The 1/2mis to "average" the squared error over the number of data points so that the number of data points doesn't affect the function. See this explanation for why we divide by 2. In gradient descent, the goal is to minimize the cost function. We do this by trying different values of slope and intercept.


Randomized Transferable Machine

arXiv.org Artificial Intelligence

Feature-based transfer is one of the most effective methodologies for transfer learning. Existing studies usually assume that the learned new feature representation is truly \emph{domain-invariant}, and thus directly train a transfer model $\mathcal{M}$ on source domain. In this paper, we consider a more realistic scenario where the new feature representation is suboptimal and small divergence still exists across domains. We propose a new learning strategy with a transfer model called Randomized Transferable Machine (RTM). More specifically, we work on source data with the new feature representation learned from existing feature-based transfer methods. The key idea is to enlarge source training data populations by randomly corrupting source data using some noises, and then train a transfer model $\widetilde{\mathcal{M}}$ that performs well on all the corrupted source data populations. In principle, the more corruptions are made, the higher the probability of the target data can be covered by the constructed source populations, and thus better transfer performance can be achieved by $\widetilde{\mathcal{M}}$. An ideal case is with infinite corruptions, which however is infeasible in reality. We develop a marginalized solution with linear regression model and dropout noise. With a marginalization trick, we can train an RTM that is equivalently to training using infinite source noisy populations without truly conducting any corruption. More importantly, such an RTM has a closed-form solution, which enables very fast and efficient training. Extensive experiments on various real-world transfer tasks show that RTM is a promising transfer model.


Knowledge transfer across cell lines using Hybrid Gaussian Process models with entity embedding vectors

arXiv.org Machine Learning

To date, a large number of experiments are performed to develop a biochemical process. The generated data is used only once, to take decisions for development. Could we exploit data of already developed processes to make predictions for a novel process, we could significantly reduce the number of experiments needed. Processes for different products exhibit differences in behaviour, typically only a subset behave similar. Therefore, effective learning on multiple product spanning process data requires a sensible representation of the product identity. We propose to represent the product identity (a categorical feature) by embedding vectors that serve as input to a Gaussian Process regression model. We demonstrate how the embedding vectors can be learned from process data and show that they capture an interpretable notion of product similarity. The improvement in performance is compared to traditional one-hot encoding on a simulated cross product learning task. All in all, the proposed method could render possible significant reductions in wet-lab experiments.


On Generalization of Adaptive Methods for Over-parameterized Linear Regression

arXiv.org Machine Learning

Over-parameterization and adaptive methods have played a crucial role in the success of deep learning in the last decade. The widespread use of over-parameterization has forced us to rethink generalization by bringing forth new phenomena, such as implicit regularization of optimization algorithms and double descent with training progression. A series of recent works have started to shed light on these areas in the quest to understand -- why do neural networks generalize well? The setting of over-parameterized linear regression has provided key insights into understanding this mysterious behavior of neural networks. In this paper, we aim to characterize the performance of adaptive methods in the over-parameterized linear regression setting. First, we focus on two sub-classes of adaptive methods depending on their generalization performance. For the first class of adaptive methods, the parameter vector remains in the span of the data and converges to the minimum norm solution like gradient descent (GD). On the other hand, for the second class of adaptive methods, the gradient rotation caused by the pre-conditioner matrix results in an in-span component of the parameter vector that converges to the minimum norm solution and the out-of-span component that saturates. Our experiments on over-parameterized linear regression and deep neural networks support this theory.


Learning from Incomplete Data by Simultaneous Training of Neural Networks and Sparse Coding

arXiv.org Machine Learning

Handling correctly incomplete datasets in machine learning is a fundamental and classical challenge. In this paper, the problem of training a classifier on a dataset with missing features, and its application to a complete or incomplete test dataset, is addressed. A supervised learning method is developed to train a general classifier, such as a logistic regression or a deep neural network, using only a limited number of features per sample, while assuming sparse representations of data vectors on an unknown dictionary. The pattern of missing features is allowed to be different for each input data instance and can be either random or structured. The proposed method simultaneously learns the classifier, the dictionary and the corresponding sparse representation of each input data sample. A theoretical analysis is provided, comparing this method with the standard imputation approach, which consists of performing data completion followed by training the classifier with those reconstructions. Sufficient conditions are identified such that, if it is possible to train a classifier on incomplete observations so that their reconstructions are well separated by a hyperplane, then the same classifier also correctly separates the original (unobserved) data samples. Extensive simulation results on synthetic and well-known reference datasets are presented that validate our theoretical findings and demonstrate the effectiveness of the proposed method compared to traditional data imputation approaches and one state of the art algorithm.


Equivalence of Convergence Rates of Posterior Distributions and Bayes Estimators for Functions and Nonparametric Functionals

arXiv.org Machine Learning

We study the posterior contraction rates of a Bayesian method with Gaussian process priors in nonparametric regression and its plug-in property for differential operators. For a general class of kernels, we establish convergence rates of the posterior measure of the regression function and its derivatives, which are both minimax optimal up to a logarithmic factor for functions in certain classes. Our calculation shows that the rate-optimal estimation of the regression function and its derivatives share the same choice of hyperparameter, indicating that the Bayes procedure remarkably adapts to the order of derivatives and enjoys a generalized plug-in property that extends real-valued functionals to function-valued functionals. This leads to a practically simple method for estimating the regression function and its derivatives, whose finite sample performance is assessed using simulations. Our proof shows that, under certain conditions, to any convergence rate of Bayes estimators there corresponds the same convergence rate of the posterior distributions (i.e., posterior contraction rate), and vice versa. This equivalence holds for a general class of Gaussian processes and covers the regression function and its derivative functionals, under both the $L_2$ and $L_{\infty}$ norms. In addition to connecting these two fundamental large sample properties in Bayesian and non-Bayesian regimes, such equivalence enables a new routine to establish posterior contraction rates by calculating convergence rates of nonparametric point estimators. At the core of our argument is an operator-theoretic framework for kernel ridge regression and equivalent kernel techniques. We derive a range of sharp non-asymptotic bounds that are pivotal in establishing convergence rates of nonparametric point estimators and the equivalence theory, which may be of independent interest.


In vivo Perturb-Seq reveals neuronal and glial abnormalities associated with autism risk genes

Science

CRISPR targeting in vivo, especially in mammals, can be difficult and time consuming when attempting to determine the effects of a single gene. However, such studies may be required to identify pathological gene variants with effects in specific cells along a developmental trajectory. To study the function of genes implicated in autism spectrum disorders (ASDs), Jin et al. applied a gene-editing and single-cell–sequencing system, Perturb-Seq, to knock out 35 ASD candidate genes in multiple mice embryos (see the Perspective by Treutlein and Camp). This method identified networks of gene expression in neuronal and glial cells that suggest new functions in ASD-related genes. Science , this issue p. [eaaz6063][1]; see also p. [1038][2] ### INTRODUCTION Human genetic studies have revealed long lists of genes and loci associated with risk for many diseases and disorders, but to systematically evaluate their phenotypic effects remains challenging. Without any a priori knowledge, these risk genes could affect any cellular processes in any cell type or tissue, which creates an enormous search space for identifying possible downstream effects. New high-throughput approaches are needed to functionally dissect these large gene sets across a spectrum of cell types in vivo. ### RATIONALE Analysis of trio-based whole-exome sequencing has implicated a large number of de novo loss-of-function variants that contribute to autism spectrum disorder and developmental delay (ASD/ND) risk. Such de novo variants often have large effect sizes, thus providing a key entry point for mechanistic studies. We have developed in vivo Perturb-Seq to allow simultaneous assessment of the individual phenotypes of a panel of such risk genes in the context of the developing mouse brain. ### RESULTS Using CRISPR-Cas9, we introduced frameshift mutations in 35 ASD/ND risk genes in pools, within the developing mouse neocortex in utero, followed by single-cell transcriptomic analysis of perturbed cells from the early postnatal brain. We analyzed five broad cell classes—cortical projection neurons, cortical inhibitory neurons, astrocytes, oligodendrocytes, and microglia—and selected cells that had received only single perturbations. Using weighted gene correlation network analysis, we identified 14 covarying gene modules that represent transcriptional programs expressed in different classes of cortical cells. These modules included both those affecting common biological processes across multiple cell subsets and others representing cell type–specific features restricted to certain subsets. We estimated the effect size of each perturbation on each of the 14 gene modules by fitting a joint linear regression model, estimating how module gene expression in cells from each perturbation group deviated from their expression level in internal control cells. Perturbations in nine ASD/ND genes had significant effects across five modules across four cell classes, including cortical projection neurons, cortical inhibitory neurons, astrocytes, and oligodendrocytes. Some of these results were validated by using a single-perturbation model as well as a germline-modified mutant mouse model. To establish whether the perturbation-associated gene modules identified in the mouse cerebral cortex are relevant to human biology and ASD/ND pathology, we performed co-analyses of data from ASD and control human brains and human cerebral organoids. Both gene expression and gene covariation (“modularity”) of several of the gene modules identified in the mouse Perturb-Seq analysis are conserved in human brain tissue. Comparison with single-cell data from ASD patients showed overlap in both affected cell types and transcriptomic phenotypes. ### CONCLUSION In vivo Perturb-Seq can serve as a scalable tool for systems genetic studies of large gene panels to reveal their cell-intrinsic functions at single-cell resolution in complex tissues. In this work, we demonstrated the application of in vivo Perturb-Seq to ASD/ND risk genes in the developing brain. This method can be applied across diverse diseases and tissues in the intact organism. ![Figure][3] In vivo Perturb-Seq identified neuron and glia-associated effects by perturbations of risk genes implicated in ASD/ND. De novo risk genes in this study were chosen from Satterstrom et al. (2018), and co-analysis with ASD patient data at bottom right is from Velmeshev et al. (2019); full citations for both are included in the full article online. The number of disease risk genes and loci identified through human genetic studies far outstrips the capacity to systematically study their functions. We applied a scalable genetic screening approach, in vivo Perturb-Seq, to functionally evaluate 35 autism spectrum disorder/neurodevelopmental delay (ASD/ND) de novo loss-of-function risk genes. Using CRISPR-Cas9, we introduced frameshift mutations in these risk genes in pools, within the developing mouse brain in utero, followed by single-cell RNA-sequencing of perturbed cells in the postnatal brain. We identified cell type–specific and evolutionarily conserved gene modules from both neuronal and glial cell classes. Recurrent gene modules and cell types are affected across this cohort of perturbations, representing key cellular effects across sets of ASD/ND risk genes. In vivo Perturb-Seq allows us to investigate how diverse mutations affect cell types and states in the developing organism. [1]: /lookup/doi/10.1126/science.aaz6063 [2]: /lookup/doi/10.1126/science.abf3661 [3]: pending:yes


Viral epitope profiling of COVID-19 patients reveals cross-reactivity and correlates of severity

Science

Among the coronaviruses that infect humans, four cause mild common colds, whereas three others, including the currently circulating severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), result in severe infections. Shrock et al. used a technology known as VirScan to probe the antibody repertoires of hundreds of coronavirus disease 2019 (COVID-19) patients and pre–COVID-19 era controls. They identified hundreds of antibody targets, including several antibody epitopes shared by the mild and severe coronaviruses and many specific to SARS-CoV-2. A machine-learning model accurately classified patients infected with SARS-CoV-2 and guided the design of an assay for rapid SARS-CoV-2 antibody detection. The study also looked at how the antibody response and viral exposure history differ in patients with diverging outcomes, which could inform the production of improved vaccine and antibody therapies. Science , this issue p. [eabd4250][1] ### INTRODUCTION A systematic characterization of the humoral response to severe acute respiratory system coronavirus 2 (SARS-CoV-2) epitopes has yet to be performed. This analysis is important for understanding the immunogenicity of the viral proteome and the basis for cross-reactivity with the common-cold coronaviruses. Coronavirus disease 2019 (COVID-19), caused by SARS-CoV-2, is notable for its variable course, with some individuals remaining asymptomatic whereas others experience fever, respiratory distress, or even death. A comprehensive investigation of the antibody response in individuals with severe versus mild COVID-19—as well as an examination of past viral exposure history—is needed. ### RATIONALE An understanding of humoral responses to SARS-CoV-2 is critical for improving diagnostics and vaccines and gaining insight into variable clinical outcomes. To this end, we used VirScan, a high-throughput method to analyze epitopes of antiviral antibodies in human sera. We supplemented the original VirScan library with additional libraries of peptides spanning the proteomes of SARS-CoV-2 and all other human coronaviruses. These libraries enabled us to precisely map epitope locations and investigate cross-reactivity between SARS-CoV-2 and other coronavirus strains. The original VirScan library allowed us to simultaneously investigate antibody responses to prior infections and viral exposure history. ### RESULTS We screened sera from 232 COVID-19 patients and 190 pre–COVID-19 era controls against the original VirScan and supplemental coronavirus libraries, assaying more than 108 antibody repertoire–peptide interactions. We identified epitopes ranging from “private” (recognized by antibodies in only a small number of individuals) to “public” (recognized by antibodies in many individuals) and detected SARS-CoV-2–specific epitopes as well as those that cross-react with common-cold coronaviruses. Several of these cross-reacting antibodies are present in pre–COVID-19 era samples. We developed a machine learning model that predicted SARS-CoV-2 exposure history with 99% sensitivity and 98% specificity from VirScan data. We used the most discriminatory SARS-CoV-2 peptides to produce a Luminex-based serological assay, which performed similarly to gold-standard enzyme-linked immunosorbent assays. We stratified the COVID-19 patient samples by disease severity and found that patients who had required hospitalization exhibited stronger and broader antibody responses to SARS-CoV-2 but weaker overall responses to past infections compared with those who did not need hospitalization. Further, the hospitalized group had higher seroprevalence rates for cytomegalovirus and herpes simplex virus 1. These findings may be influenced by differences in demographic compositions between the two groups, but they raise hypotheses that may be tested in future studies. Using alanine scanning mutagenesis, we precisely mapped 823 distinct epitopes across the entire SARS-CoV-2 proteome, 10 of which are likely targets of neutralizing antibodies. One cross-reactive antibody epitope in S2 has been previously suggested to be neutralizing and, as it exists in pre–COVID-19 era samples, could affect the severity of COVID-19. ### CONCLUSION We present a highly detailed view of the epitope landscape within the SARS-CoV-2 proteome. This knowledge may be used to produce diagnostics with improved specificity and can provide a stepping stone to the isolation and functional dissection of both neutralizing antibodies and antibodies that might exacerbate patient outcomes through antibody-dependent enhancement or immune distraction. Our study reveals notable correlations between COVID-19 severity and both viral exposure history and overall strength of the antibody response to past infections. These findings are likely influenced by demographic covariates, but they generate hypotheses that may be tested with larger patient cohorts matched for age, gender, race, and other demographic variables. ![Figure][2] SARS-CoV-2 epitope mapping. VirScan detects antibodies against SARS-CoV-2 in COVID-19 patients with severe and mild disease. Heatmap color represents the strength of the antibody response in each sample (columns) to each protein (rows, left) or peptide (rows, right). VirScan reveals the precise positions of epitopes, which can be mapped onto the structure of the spike protein (S). Examination of SARS-CoV-2 and seasonal coronavirus sequence conservation explains epitope cross-reactivity. A, Ala; D, Asp; E, Glu; F, Phe; I, Ile; K, Lys; L, Leu; N, Asn; P, Pro; Q, Gln; R, Arg; S, Ser; T, Thr; V, Val; W, Trp; Y, Tyr. Understanding humoral responses to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is critical for improving diagnostics, therapeutics, and vaccines. Deep serological profiling of 232 coronavirus disease 2019 (COVID-19) patients and 190 pre–COVID-19 era controls using VirScan revealed more than 800 epitopes in the SARS-CoV-2 proteome, including 10 epitopes likely recognized by neutralizing antibodies. Preexisting antibodies in controls recognized SARS-CoV-2 ORF1, whereas only COVID-19 patient antibodies primarily recognized spike protein and nucleoprotein. A machine learning model trained on VirScan data predicted SARS-CoV-2 exposure history with 99% sensitivity and 98% specificity; a rapid Luminex-based diagnostic was developed from the most discriminatory SARS-CoV-2 peptides. Individuals with more severe COVID-19 exhibited stronger and broader SARS-CoV-2 responses, weaker antibody responses to prior infections, and higher incidence of cytomegalovirus and herpes simplex virus 1, possibly influenced by demographic covariates. Among hospitalized patients, males produce stronger SARS-CoV-2 antibody responses than females. [1]: /lookup/doi/10.1126/science.abd4250 [2]: pending:yes