Goto

Collaborating Authors

 Regression


Artificial Intelligence vs Business Intelligence - Learn 6 Useful Comparison

#artificialintelligence

Business Intelligence is a technology that is used to gather, store, access and analyzes data to help business users in making better decisions, on the other hand, Artificial Intelligence is a way to make a computer, a computer-controlled robot, or a software that think intelligently like humans.Artificial Intelligence is based on the study that how human thinks, learn, decide and work in order to resolve an issue and then using the outcome of this study as a basis of developing intelligent software and systems. It starts from root node and explores neighbor nodes first and moves to the next level neighbor nodes.It provides the shortest path to the solution and can be implemented using FIFO This algorithm is implemented using LIFO(Last in first out)data structure.It creates nodes same as breadth-first search but it differs in only order.In each iteration, it stores the nodes from root to leaf and also it cannot check duplicate nodes. It makes predictions by using Bayes algorithm, which derives probability prediction from the underlying evidence, as observed in data. In this algorithm, sorting is done in increasing cost of the path to a node.It always expands the least cost node.This search is identical to the Breadth-first search if each transition has the same cost.It explores the path in the increasing order of cost. It implements logistic regression for classification of binary targets and linear regression for continuous targets.It supports confidence bounds for prediction probabilities and also supports confidence bounds for prediction. It performs the depth-first search at level-1 and starts over, then executes a complete depth-first search to level 2, and continues till it gets the solution.


Linear Regression -- Detailed View โ€“ Towards Data Science

#artificialintelligence

Linear regression is used for finding linear relationship between target and one or more predictors. There are two types of linear regression- Simple and Multiple. Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable. It looks for statistical relationship but not deterministic relationship.


Detecting non-causal artifacts in multivariate linear regression models

arXiv.org Machine Learning

We consider linear models where $d$ potential causes $X_1,...,X_d$ are correlated with one target quantity $Y$ and propose a method to infer whether the association is causal or whether it is an artifact caused by overfitting or hidden common causes. We employ the idea that in the former case the vector of regression coefficients has 'generic' orientation relative to the covariance matrix $\Sigma_{XX}$ of $X$. Using an ICA based model for confounding, we show that both confounding and overfitting yield regression vectors that concentrate mainly in the space of low eigenvalues of $\Sigma_{XX}$.


A brain signature highly predictive of future progression to Alzheimer's dementia

arXiv.org Machine Learning

Early prognosis of Alzheimer's dementia is hard. Mild cognitive impairment (MCI) typically precedes Alzheimer's dementia, yet only a fraction of MCI individuals will progress to dementia, even when screened using biomarkers. We propose here to identify a subset of individuals who share a common brain signature highly predictive of oncoming dementia. This signature was composed of brain atrophy and functional dysconnectivity and discovered using a machine learning model in patients suffering from dementia. The model recognized the same brain signature in MCI individuals, 90% of which progressed to dementia within three years. This result is a marked improvement on the state-of-theart in prognostic precision, while the brain signature still identified 47% of all MCI progressors. We thus discovered a sizable MCI subpopulation which represents an excellent recruitment target for clinical trials at the prodromal stage of Alzheimer's disease. Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. Acknowledgement_List.pdf Preprint submitted to March 5, 2018 1. Introduction Alzheimer's disease (AD) is the most common age-related neurodegenerative disorder. The typical progression of late-onset, sporadic AD comprises a lengthy preclinical stage, a prodromal stage of mild cognitive impairment (MCI), and a final stage of dementia. Usually, by the time patients suffer from dementia, severe and irreversible neurodegeneration has already occurred.


Predicting Film Ratings With Simple Linear Regression

#artificialintelligence

First of all, imputing film data is simply not very effective. Picking mean budget or runtime values is also questionable, since budget values, for example, increase over time with inflation and other factors (Note. I did not account for inflation, at least in this iteration of the project). I also avoided engineering new features. Although I could have created genre or keyword clusters, keywords on IMDb were bizarre, unfitting, and even inappropriate.


Distributed multivariable modeling for signature development under data protection constraints

arXiv.org Machine Learning

Data protection constraints frequently require distributed analysis of data, i.e. individual-level data remains at many different sites, but analysis nevertheless has to be performed jointly. The data exchange is often handled manually, requiring explicit permission before transfer, i.e. the number of data calls and the amount of data should be limited. Thus, only simple summary statistics are typically transferred and aggregated with just a single call, but this does not allow for complex statistical techniques, e.g., automatic variable selection for prognostic signature development. We propose a multivariable regression approach for building a prognostic signature by automatic variable selection that is based on aggregated data from different locations in iterative calls. To minimize the amount of transferred data and the number of calls, we also provide a heuristic variant of the approach. To further strengthen data protection, the approach can also be combined with a trusted third party architecture. We evaluate our proposed method in a simulation study comparing our results to the results obtained with the pooled individual data. The proposed method is seen to be able to detect covariates with true effect to a comparable extent as a method based on individual data, although the performance is moderately decreased if the number of sites is large. In a typical scenario, the heuristic decreases the number of data calls from more than 10 to 3. To make our approach widely available for application, we provide an implementation on top of the DataSHIELD framework.


Interval-based Prediction Uncertainty Bound Computation in Learning with Missing Values

arXiv.org Machine Learning

The problem of machine learning with missing values is common in many areas. A simple approach is to first construct a dataset without missing values simply by discarding instances with missing entries or by imputing a fixed value for each missing entry, and then train a prediction model with the new dataset. A drawback of this naive approach is that the uncertainty in the missing entries is not properly incorporated in the prediction. In order to evaluate prediction uncertainty, the multiple imputation (MI) approach has been studied, but the performance of MI is sensitive to the choice of the probabilistic model of the true values in the missing entries, and the computational cost of MI is high because multiple models must be trained. In this paper, we propose an alternative approach called the Interval-based Prediction Uncertainty Bounding (IPUB) method. The IPUB method represents the uncertainties due to missing entries as intervals, and efficiently computes the lower and upper bounds of the prediction results when all possible training sets constructed by imputing arbitrary values in the intervals are considered. The IPUB method can be applied to a wide class of convex learning algorithms including penalized least-squares regression, support vector machine (SVM), and logistic regression. We demonstrate the advantages of the IPUB method by comparing it with an existing method in numerical experiment with benchmark datasets.


Subspace-Induced Gaussian Processes

arXiv.org Machine Learning

We present a new Gaussian process (GP) regression model where the covariance kernel is indexed or parameterized by a sufficient dimension reduction subspace of a reproducing kernel Hilbert space. The covariance kernel will be low-rank while capturing the statistical dependency of the response to the covariates, this affords significant improvement in computational efficiency as well as potential reduction in the variance of predictions. We develop a fast Expectation-Maximization algorithm for estimating the parameters of the subspace-induced Gaussian process (SIGP). Extensive results on real data show that SIGP can outperform the standard full GP even with a low rank-$m$, $m\leq 3$, inducing subspace.


Regression Analysis

@machinelearnbot

I am doing some regression analysis. Some of the independent variables are continuous while some are categorical. The dependent variable is continuous. Can you please help me on which regression model should I pick?


Maximum likelihood estimation of a finite mixture of logistic regression models in a continuous data stream

arXiv.org Machine Learning

In marketing we are often confronted with a continuous stream of responses to marketing messages. Such streaming data provide invaluable information regarding message effectiveness and segmentation. However, streaming data are hard to analyze using conventional methods: their high volume and the fact that they are continuously augmented means that it takes considerable time to analyze them. We propose a method for estimating a finite mixture of logistic regression models which can be used to cluster customers based on a continuous stream of responses. This method, which we coin oFMLR, allows segments to be identified in data streams or extremely large static datasets. Contrary to black box algorithms, oFMLR provides model estimates that are directly interpretable. We first introduce oFMLR, explaining in passing general topics such as online estimation and the EM algorithm, making this paper a high level overview of possible methods of dealing with large data streams in marketing practice. Next, we discuss model convergence, identifiability, and relations to alternative, Bayesian, methods; we also identify more general issues that arise from dealing with continuously augmented data sets. Finally, we introduce the oFMLR [R] package and evaluate the method by numerical simulation and by analyzing a large customer clickstream dataset.