Goto

Collaborating Authors

 Regression


Exploring NYC Taxi Data with Microsoft R Server and HDInsight

#artificialintelligence

As I mentioned yesterday, Microsoft R Server now available for HDInsight, which means that you can now run R code (including the big-data algorithms of Microsoft R Server) on a managed, cloud-based Hadoop instance. Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R. To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows).


[Question] Reduced Error Logistic Regression (RELR) โ€ข /r/MachineLearning

@machinelearnbot

I came across a book titled Calculus of Thought: Neuromorphic Logistic Regression in Cognitive Machines. It introduces a method called reduced error logistic regression (RELR). Does anyone know anything about it? Yes, but I thought it was better to ask before wasting one week to understand and reproduce the results like I did with Numenta's HTM.


Score Spark-built machine learning models

#artificialintelligence

This topic describes how to load machine learning (ML) models that have been built using Spark MLlib and stored in Azure Blob Storage (WASB), and how to score them with datasets that have also been stored in WASB. It shows how to pre-process the input data, transform features using the indexing and encoding functions in the MLlib toolkit, and how to create a labeled point data object that can be used as input for scoring with the ML models. The models used for scoring include Linear Regression, Logistic Regression, Random Forest Models, and Gradient Boosting Tree Models. You need an Azure account and an HDInsight Spark cluster to begin this walkthrough. See the Overview of Data Science using Spark on Azure HDInsight for these requirements, for a description of the NYC 2013 Taxi data used here, and for instructions on how execute code from a Jupyter notebook on the Spark cluster.


Jackknife logistic and linear regression for clustering and predictions

@machinelearnbot

This article discusses a far more general version of the technique described in our article The best kept secret about regression. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with highly correlated independent variables. Our goal is to produce a regression tool that can be used as a black box, be very robust and parameter-free, and usable and easy-to-interpret by non-statisticians. It is part of a bigger project: automating many fundamental data science tasks, to make it easy, scalable and cheap for data consumers, not just for data experts. Readers are invited to further formalize the technology outlined here, and challenge my proposed methodology.


10 types of regressions. Which one to use?

@machinelearnbot

The CRAN task view: "Robust statistical methods" gives a long list of regression methods, including many that Vincent mentions. Here a some that are not mentioned there: Regression in unusual spaces. It is usually addressed under the title "Compositional data" (see Wikipedia entry). The late John Aitchison founded this area of statistics. Googling his name "compositional data" gives access to a number of his articles.


Collection of Machine Learning Interview Questions

#artificialintelligence

Here is the link to coursera course for NLP Pick the software from the The Stanford NLP (Natural Language Processing) Group and input some text to view its parse tree, named entities, part of speech tags, etc.


Predicting ICU Mortality Risk by Grouping Temporal Trends from a Multivariate Panel of Physiologic Measurements

AAAI Conferences

ICU mortality risk prediction may help clinicians take effective interventions to improve patient outcome. Existing machine learning approaches often face challenges in integrating a comprehensive panel of physiologic variables and presenting to clinicians interpretable models. We aim to improve both accuracy and interpretability of prediction models by introducing Subgraph Augmented Non-negative Matrix Factorization (SANMF) on ICU physiologic time series. SANMF converts time series into a graph representation and applies frequent subgraph mining to automatically extract temporal trends. We then apply non-negative matrix factorization to group trends in a way that approximates patient pathophysiologic states. Trend groups are then used as features in training a logistic regression model for mortality risk prediction, and are also ranked according to their contribution to mortality risk. We evaluated SANMF against four empirical models on the task of predicting mortality or survival 30 days after discharge from ICU using the observed physiologic measurements between 12 and 24 hours after admission. SANMF outperforms all comparison models, and in particular, demonstrates an improvement in AUC (0.848 vs. 0.827, p<0.002) compared to a state-of-the-art machine learning method that uses manual feature engineering. Feature analysis was performed to illuminate insights and benefits of subgraph groups in mortality risk prediction.


Adaptable Regression Method for Ensemble Consensus Forecasting

AAAI Conferences

Accurate weather forecasts enhance sustainability by facilitating decision making across a broad range of endeavors including public safety, transportation, energy generation and management, retail logistics, emergency preparedness, and many others. This paper presents a method for combining multiple scalar forecasts to obtain deterministic predictions that are generally more accurate than any of the constituents. Exponentially-weighted forecast bias estimates and error covariance matrices are formed at observation sites, aggregated spatially and temporally, and used to formulate a constrained, regularized least squares regression problem that may be solved using quadratic programming. The model is re-trained when new observations arrive, updating the forecast bias estimates and consensus combination weights to adapt to weather regime and input forecast model changes. The algorithm is illustrated for 0-72 hour temperature forecasts at over 1200 sites in the contiguous U.S. based on a 22-member forecast ensemble, and its performance over multiple seasons is compared to a state-of-the-art ensemble-based forecasting system. In addition to weather forecasts, this approach to consensus may be useful for ensemble predictions of climate, wind energy, solar power, energy demand, and numerous other quantities.


Structured Output Prediction for Semantic Perception in Autonomous Vehicles

AAAI Conferences

A key challenge in the realization of autonomous vehicles is the machine's ability to perceive its surrounding environment. This task is tackled through a model that partitions vehicle camera input into distinct semantic classes, by taking into account visual contextual cues. The use of structured machine learning models is investigated, which not only allow for complex input, but also arbitrarily structured output. Towards this goal, an outdoor road scene dataset is constructed with accompanying fine-grained image labelings. For coherent segmentation, a structured predictor is modeled to encode label distributions conditioned on the input images. After optimizing this model through max-margin learning, based on an ontological loss function, efficient classification is realized via graph cuts inference using alpha-expansion. Both quantitative and qualitative analyses demonstrate that by taking into account contextual relations between pixel segmentation regions within a second-degree neighborhood, spurious label assignments are filtered out, leading to highly accurate semantic segmentations for outdoor scenes.


Accelerated Sparse Linear Regression via Random Projection

AAAI Conferences

In this paper, we present an accelerated numerical method based on random projection for sparse linear regression. Previous studies have shown that under appropriate conditions, gradient-based methods enjoy a geometric convergence rate when applied to this problem. However, the time complexity of evaluating the gradient is as large as $\mathcal{O}(nd)$, where $n$ is the number of data points and $d$ is the dimensionality, making those methods inefficient for large-scale and high-dimensional dataset. To address this limitation, we first utilize random projection to find a rank-$k$ approximator for the data matrix, and reduce the cost of gradient evaluation to $\mathcal{O}(nk+dk)$, a significant improvement when $k$ is much smaller than $d$ and $n$. Then, we solve the sparse linear regression problem via a proximal gradient method with a homotopy strategy to generate sparse intermediate solutions. Theoretical analysis shows that our method also achieves a global geometric convergence rate, and moreover the sparsity of all the intermediate solutions are well-bounded over the iterations. Finally, we conduct experiments to demonstrate the efficiency of the proposed method.