Regression
Feature Engineering vs BERT on Twitter Data
Gani, Ryiaadh, Chalaguine, Lisa
In this paper, we compare the performances of traditional machine learning models using feature engineering and word vectors and the state-of-the-art language model BERT using word embeddings on three datasets. We also consider the time and cost efficiency of feature engineering compared to BERT. From our results we conclude that the use of the BERT model was only worth the time and cost trade-off for one of the three datasets we used for comparison, where the BERT model significantly outperformed any kind of traditional classifier that uses feature vectors, instead of embeddings. Using the BERT model for the other datasets only achieved an increase of 0.03 and 0.05 of accuracy and F1 score respectively, which could be argued makes its use not worth the time and cost of GPU.
Supervised Machine Learning: Regression
This course introduces you to one of the main types of modelling families of supervised Machine Learning: Regression. You will learn how to train regression models to predict continuous outcomes and how to use error metrics to compare across different models. This course also walks you through best practices, including train and test splits, and regularization techniques. By the end of this course you should be able to: Differentiate uses and applications of classification and regression in the context of supervised machine learning Describe and use linear regression models Use a variety of error metrics to compare and select a linear regression model that best suits your data Articulate why regularization may help prevent overfitting Use regularization regressions: Ridge, LASSO, and Elastic net Who should take this course? This course targets aspiring data scientists interested in acquiring hands-on experience with Supervised Machine Learning Regression techniques in a business setting.
Simultaneous off-the-grid learning of mixtures issued from a continuous dictionary
Butucea, Cristina, Delmas, Jean-François, Dutfoy, Anne, Hardy, Clément
In this paper we observe a set, possibly a continuum, of signals corrupted by noise. Each signal is a finite mixture of an unknown number of features belonging to a continuous dictionary. The continuous dictionary is parametrized by a real non-linear parameter. We shall assume that the signals share an underlying structure by saying that the union of active features in the whole dataset is finite. We formulate regularized optimization problems to estimate simultaneously the linear coefficients in the mixtures and the non-linear parameters of the features. The optimization problems are composed of a data fidelity term and a (l1 , Lp)-penalty. We prove high probability bounds on the prediction errors associated to our estimators. The proof is based on the existence of certificate functions. Following recent works on the geometry of off-the-grid methods, we show that such functions can be constructed provided the parameters of the active features are pairwise separated by a constant with respect to a Riemannian metric. When the number of signals is finite and the noise is assumed Gaussian, we give refinements of our results for p = 1 and p = 2 using tail bounds on suprema of Gaussian and $\chi$2 random processes. When p = 2, our prediction error reaches the rates obtained by the Group-Lasso estimator in the multi-task linear regression model.
Confound-leakage: Confound Removal in Machine Learning Leads to Leakage
Hamdan, Sami, Love, Bradley C., von Polier, Georg G., Weis, Susanne, Schwender, Holger, Eickhoff, Simon B., Patil, Kaustubh R.
Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against na\"ive use of standard confound removal approaches.
Sample-Specific Root Causal Inference with Latent Variables
Strobl, Eric V., Lasko, Thomas A.
Root causal analysis seeks to identify the set of initial perturbations that induce an unwanted outcome. In prior work, we defined sample-specific root causes of disease using exogenous error terms that predict a diagnosis in a structural equation model. We rigorously quantified predictivity using Shapley values. However, the associated algorithms for inferring root causes assume no latent confounding. We relax this assumption by permitting confounding among the predictors. We then introduce a corresponding procedure called Extract Errors with Latents (EEL) for recovering the error terms up to contamination by vertices on certain paths under the linear non-Gaussian acyclic model. EEL also identifies the smallest sets of dependent errors for fast computation of the Shapley values. The algorithm bypasses the hard problem of estimating the underlying causal graph in both cases. Experiments highlight the superior accuracy and robustness of EEL relative to its predecessors.
Improved Prediction of Beta-Amyloid and Tau Burden Using Hippocampal Surface Multivariate Morphometry Statistics and Sparse Coding
Wu, Jianfeng, Su, Yi, Zhu, Wenhui, Mallak, Negar Jalili, Lepore, Natasha, Reiman, Eric M., Caselli, Richard J., Thompson, Paul M., Chen, Kewei, Wang, Yalin
To be resubmitted to the Journal of Alzheimer's Disease Please address correspondence to: Dr. Yalin Wang School of Computing and Augmented Intelligence Arizona State University P.O. As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in the analysis or writing of this report. ABSTRACT (235 WORDS) Background: Beta-amyloid (Aβ) plaques and tau protein tangles in the brain are the defining'A' and'T' hallmarks of Alzheimer's disease (AD), and together with structural atrophy detectable on brain magnetic resonance imaging (MRI) scans as one of the neurodegenerative ('N') biomarkers comprise the "ATN framework" of AD. Current methods to detect Aβ/tau pathology include cerebrospinal fluid (CSF; invasive), positron emission tomography (PET; costly and not widely available), and blood-based biomarkers (BBBM; promising but mainly still in development). Objective: To develop a non-invasive and widely available structural MRI-based framework to quantitatively predict the amyloid and tau measurements. Methods: With MRI-based hippocampal multivariate morphometry statistics (MMS) features, we apply our Patch Analysis-based Surface Correntropy-induced Sparse coding and max-pooling (PASCS-MP) method combined with the ridge regression model to individual amyloid/tau measure prediction. Results: We evaluate our framework on amyloid PET/MRI and tau PET/MRI datasets from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Each subject has one pair consisting of a PET image and MRI scan, collected at about the same time. Experimental results suggest that amyloid/tau measurements predicted with our PASCP-MP representations are closer to the real values than the measures derived from other approaches, such as hippocampal surface area, volume, and shape morphometry features based on spherical harmonics (SPHARM). Conclusion: The MMS-based PASCP-MP is an efficient tool that can bridge hippocampal atrophy with amyloid and tau pathology and thus help assess disease burden, progression, and treatment effects. INTRODUCTION Alzheimer's disease (AD) has a progressive preclinical phase that begins many years before the onset of clinical symptoms.
Ace your Machine Learning Interview - Part 1
These days I am having several interviews in the field of Machine Learning as I have moved abroad and need to look for a new job. Big companies and small startups always want to make sure you know the fundamentals of Machine Learning, and so I'm using some of my time going over the basics again. So I decided to share a series of articles about what you need to know to deal with interviews in Machine Learning hoping it will help some of you as well. When we talk about Linear Regression, we have a set of points that for ease you can think of plotted on a plane in 2 dimensions (x: feature, y: label) and we want to fit these points with a straight line. That is, we want to find that straight line that passes right'between' the points as in the figure above.
A copula-based boosting model for time-to-event prediction with dependent censoring
Midtfjord, Alise Danielle, De Bin, Riccardo, Huseby, Arne Bang
A characteristic feature of time-to-event data analysis is possible censoring of the event time. Most of the statistical learning methods for handling censored data are limited by the assumption of independent censoring, even if this can lead to biased predictions when the assumption does not hold. This paper introduces Clayton-boost, a boosting approach built upon the accelerated failure time model, which uses a Clayton copula to handle the dependency between the event and censoring distributions. By taking advantage of a copula, the independent censoring assumption is not needed any more. During comparisons with commonly used methods, Clayton-boost shows a strong ability to remove prediction bias at the presence of dependent censoring and outperforms the comparing methods either if the dependency strength or percentage censoring are considerable. The encouraging performance of Clayton-boost shows that there is indeed reasons to be critical about the independent censoring assumption, and that real-world data could highly benefit from modelling the potential dependency.
High-dimensional Measurement Error Models for Lipschitz Loss
Recently emerging large-scale biomedical data pose exciting opportunities for scientific discoveries. However, the ultrahigh dimensionality and non-negligible measurement errors in the data may create difficulties in estimation. There are limited methods for high-dimensional covariates with measurement error, that usually require knowledge of the noise distribution and focus on linear or generalized linear models. In this work, we develop high-dimensional measurement error models for a class of Lipschitz loss functions that encompasses logistic regression, hinge loss and quantile regression, among others. Our estimator is designed to minimize the $L_1$ norm among all estimators belonging to suitable feasible sets, without requiring any knowledge of the noise distribution. Subsequently, we generalize these estimators to a Lasso analog version that is computationally scalable to higher dimensions. We derive theoretical guarantees in terms of finite sample statistical error bounds and sign consistency, even when the dimensionality increases exponentially with the sample size. Extensive simulation studies demonstrate superior performance compared to existing methods in classification and quantile regression problems. An application to a gender classification task based on brain functional connectivity in the Human Connectome Project data illustrates improved accuracy under our approach, and the ability to reliably identify significant brain connections that drive gender differences.
Graph-Regularized Tensor Regression: A Domain-Aware Framework for Interpretable Multi-Way Financial Modelling
Xu, Yao Lei, Konstantinidis, Kriton, Mandic, Danilo P.
Analytics of financial data is inherently a Big Data paradigm, as such data are collected over many assets, asset classes, countries, and time periods. This represents a challenge for modern machine learning models, as the number of model parameters needed to process such data grows exponentially with the data dimensions; an effect known as the Curse-of-Dimensionality. Recently, Tensor Decomposition (TD) techniques have shown promising results in reducing the computational costs associated with large-dimensional financial models while achieving comparable performance. However, tensor models are often unable to incorporate the underlying economic domain knowledge. To this end, we develop a novel Graph-Regularized Tensor Regression (GRTR) framework, whereby knowledge about cross-asset relations is incorporated into the model in the form of a graph Laplacian matrix. This is then used as a regularization tool to promote an economically meaningful structure within the model parameters. By virtue of tensor algebra, the proposed framework is shown to be fully interpretable, both coefficient-wise and dimension-wise. The GRTR model is validated in a multi-way financial forecasting setting and compared against competing models, and is shown to achieve improved performance at reduced computational costs. Detailed visualizations are provided to help the reader gain an intuitive understanding of the employed tensor operations.