Goto

Collaborating Authors

 Regression


Data Science Interview Questions

#artificialintelligence

A fresh scrape from Glassdoor gives us a good idea about what applicants are asked during a data scientist interview at some of the top companies. Unfortunately for us, almost every company has their interviewees sign NDAs. Since Glassdoor allows anonymity, a few brave souls have given us some fantastic examples of what they were asked during the interview process at top companies like Facebook, Google, and Microsoft. If you find yourself unable to answer some of the questions below, consider checking out a course or a book on the subject. If you'd like to share your answer(s) to any of the questions, leave a comment and I'll add the top ones to the post.


How To Interpret R-squared and Goodness-of-Fit in Regression Analysis

@machinelearnbot

This article was written by Jim Frost from Minitab. He came to Minitab with a background in a wide variety of academic research. His role was the "data/stat guy" on research projects that ranged from osteoporosis prevention to quantitative studies of online user behavior. Essentially, his job was to design the appropriate research conditions, accurately generate a vast sea of measurements, and then pull out patterns and meanings from it. After you have fit a linear model using regression analysis, ANOVA, or design of experiments (DOE), you need to determine how well the model fits the data. To help you out, Minitab statistical software presents a variety of goodness-of-fit statistics.


Robust Regression via Mutivariate Regression Depth

arXiv.org Machine Learning

This paper studies robust regression in the settings of Huber's $\epsilon$-contamination models. We consider estimators that are maximizers of multivariate regression depth functions. These estimators are shown to achieve minimax rates in the settings of $\epsilon$-contamination models for various regression problems including nonparametric regression, sparse linear regression, reduced rank regression, etc. We also discuss a general notion of depth function for linear operators that has potential applications in robust functional linear regression.


Random Forest regression for manifold-valued responses

arXiv.org Machine Learning

An increasing array of biomedical and computer vision applications requires the predictive modeling of complex data, for example images and shapes. The main challenge when predicting such objects lies in the fact that they do not comply to the assumptions of Euclidean geometry. Rather, they occupy non-linear spaces, a.k.a. manifolds, where it is difficult to define concepts such as coordinates, vectors and expected values. In this work, we construct a non-parametric predictive methodology for manifold-valued objects, based on a distance modification of the Random Forest algorithm. Our method is versatile and can be applied both in cases where the response space is a well-defined manifold, but also when such knowledge is not available. Model fitting and prediction phases only require the definition of a suitable distance function for the observed responses. We validate our methodology using simulations and apply it on a series of illustrative image completion applications, showcasing superior predictive performance, compared to various established regression methods.


23 types of regression

@machinelearnbot

This contribution is from David Corliss. David teaches a class on this subject, giving a (very brief) description of 23 regression methods in just an hour, with an example and the package and procedures used for each case. Here you can check the webcast done for Central Michigan University. The slide deck can be found here. Below is the presentation transcript.


Enumerate Lasso Solutions for Feature Selection

AAAI Conferences

We propose an algorithm for enumerating solutions to the Lasso regression problem.In ordinary Lasso regression, one global optimum is obtained and the resulting features are interpreted as task-relevant features.However, this can overlook possibly relevant features not selected by the Lasso.With the proposed method, we can enumerate many possible feature sets for human inspection, thus recording all the important features.We prove that by enumerating solutions, we can recover a true feature set exactly under less restrictive conditions compared with the ordinary Lasso.We confirm our theoretical results also in numerical simulations.Finally, in the gene expression and the text data, we demonstrate that the proposed method can enumerate a wide variety of meaningful feature sets, which are overlooked by the global optima.


ATSUM: Extracting Attractive Summaries for News Propagation on Microblogs

AAAI Conferences

In this paper, we investigate how to automatically extract attractive summaries for news propagation on microblogs and propose a novel system called ATSUM to achieve this goal via text attractiveness analysis. It first analyzes the sentences in a news article and automatically predict the attractiveness score of each sentence by using the support vector regression method. The predicted attractiveness scores are then incorporated into a summarization system. Experimental results on a manually labeled dataset verify the effectiveness of the proposed methods.


Android Malware Detection with Weak Ground Truth Data

AAAI Conferences

For Android malware detection, precise ground truth is a rare commodity. As security knowledge evolves, what may be considered ground truth at one moment in time may change, and apps once considered benign may turn out to be malicious. The inevitable noise in data labels poses a challenge to inferring effective machine learning classifiers. Our work is focused on approaches for learning classifiers for Android malware detection in a manner that is methodologically sound with regard to the uncertain and ever-changing ground truth in the problem space. We leverage the fact that although data labels are unavoidably noisy, a malware label is much more precise than a benign label. While you can be confident that an app is malicious, you can never be certain that a benign app is really benign, or just undetected malware. Based on this insight, we leverage a modified Logistic Regression classifier that allows us to learn from only positive and unlabeled data, without making any assumptions about benign labels. We find Label Regularized Logistic Regression to perform well for noisy app datasets, as well as datasets where there is a limited amount of positive labeled data, both of which are representative of real-world situations.


Extracting Urban Microclimates from Electricity Bills

AAAI Conferences

Sustainable energy policies are of growing importance in all urban centers.Climate — and climate change — will play increasingly important roles in these policies.Climate zones defined by the California Energy Commissionhave long been influential in energy management.For example, recently a two-zone division of Los Angeles(defined by historical temperature averages) was introduced for electricity rate restructuring.The importance of climate zones has been enormous,and climate change could make them still more important. AI can provide improvements on the ways climate zones are derived and managed.This paper reports on analysis of aggregate household electricity consumption (EC) data from local utilities in Los Angeles,seeking possible improvements in energy management. In this analysis we noticed that EC data permits identificationof interesting geographical zones  — regions having EC patterns that are characteristically different from surrounding regions.We believe these zones could be useful in a variety of urban models.


TaGiTeD: Predictive Task Guided Tensor Decomposition for Representation Learning from Electronic Health Records

AAAI Conferences

With the better availability of healthcare data, such as Electronic Health Records (EHR), more and more data analytics methodologies are developed aiming at digging insights from them to improve the quality of care delivery. There are many challenges on analyzing EHR, such as high dimensionality and event sparsity. Moreover, different from other application domains, the EHR analysis algorithms need to be highly interpretable to make them clinically useful. This makes representation learning from EHRs of key importance. In this paper, we propose an algorithm called Predictive Task Guided Tensor Decomposition (TaGiTeD), to analyze EHRs. Specifically, TaGiTeD learns event interaction patterns that are highly predictive for certain tasks from EHRs with supervised tensor decomposition. Compared with unsupervised methods, TaGiTeD can learn effective EHR representations in a more focused way. This is crucial because most of the medical problems have very limited patient samples, which are not enough for unsupervised algorithms to learn meaningful representations form. We apply TaGiTeD on real world EHR data warehouse and demonstrate that TaGiTeD can learn representations that are both interpretable and predictive.