Goto

Collaborating Authors

 Regression


Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis

arXiv.org Machine Learning

This paper learns multi-modal embeddings from text, audio, and video views/modes of data in order to improve upon down-stream sentiment classification. The experimental framework also allows investigation of the relative contributions of the individual views in the final multi-modal embedding. Individual features derived from the three views are combined into a multi-modal embedding using Deep Canonical Correlation Analysis (DCCA) in two ways i) One-Step DCCA and ii) Two-Step DCCA. This paper learns text embeddings using BERT, the current state-of-the-art in text encoders. We posit that this highly optimized algorithm dominates over the contribution of other views, though each view does contribute to the final result. Classification tasks are carried out on two benchmark datasets and on a new Debate Emotion data set, and together these demonstrate that the one-Step DCCA outperforms the current state-of-the-art in learning multi-modal embeddings.


A Stratification Approach to Partial Dependence for Codependent Variables

arXiv.org Machine Learning

Model interpretability is important to machine learning practitioners, and a key component of interpretation is the characterization of partial dependence of the response variable on any subset of features used in the model. The two most common strategies for assessing partial dependence suffer from a number of critical weaknesses. In the first strategy, linear regression model coefficients describe how a unit change in an explanatory variable changes the response, while holding other variables constant. But, linear regression is inapplicable for high dimensional (p>n) data sets and is often insufficient to capture the relationship between explanatory variables and the response. In the second strategy, Partial Dependence (PD) plots and Individual Conditional Expectation (ICE) plots give biased results for the common situation of codependent variables and they rely on fitted models provided by the user. When the supplied model is a poor choice due to systematic bias or overfitting, PD/ICE plots provide little (if any) useful information. To address these issues, we introduce a new strategy, called StratPD, that does not depend on a user's fitted model, provides accurate results in the presence codependent variables, and is applicable to high dimensional settings. The strategy works by stratifying a data set into groups of observations that are similar, except in the variable of interest, through the use of a decision tree. Any fluctuations of the response variable within a group is likely due to the variable of interest. We apply StratPD to a collection of simulations and case studies to show that StratPD is a fast, reliable, and robust method for assessing partial dependence with clear advantages over state-of-the-art methods.


Best Split Nodes for Regression Trees

arXiv.org Machine Learning

Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. This paper aims to study the bias and adaptive properties of regression trees constructed with CART. In doing so, we derive an interesting connection between the bias and the mean decrease in impurity (MDI) measure of variable importance---a tool widely used for model interpretability---defined as the sum of impurity reductions over all non-terminal nodes in the tree. In particular, we show that the size of a terminal subnode for a variable is small when the MDI for that variable is large and that this relationship is exponential---confirming theoretically that decision trees with CART have small bias and are adaptive to signal strength and direction. Finally, we apply these individual tree bounds to tree ensembles and show consistency of Breiman's random forests. The context is surprisingly general and applies to a wide variety of multivariable data generating distributions and regression functions. The main technical tool is an exact characterization of the conditional probability content of the daughter nodes arising from an optimal split, in terms of the partial dependence function and reduction in impurity.


Ten Machine Learning Algorithms You Should Know to Become a Data Scientist

#artificialintelligence

Let's say I am given an Excel sheet with data about various fruits and I have to tell which look like Apples. What I will do is ask a question "Which fruits are red and round?" and divide all fruits which answer yes and no to the question. Now, All Red and Round fruits might not be apples and all apples won't be red and round. So I will ask a question "Which fruits have red or yellow colour hints on them? " on red and round fruits and will ask "Which fruits are green and round?" on not red and round fruits. Based on these questions I can tell with considerable accuracy which are apples. This cascade of questions is what a decision tree is. However, this is a decision tree based on my intuition.


Estimation and Feature Selection in Mixtures of Generalized Linear Experts Models

arXiv.org Machine Learning

Mixtures-of-Experts (MoE) are conditional mixture models that have shown their performance in modeling heterogeneity in data in many statistical learning approaches for prediction, including regression and classification, as well as for clustering. Their estimation in high-dimensional problems is still however challenging. We consider the problem of parameter estimation and feature selection in MoE models with different generalized linear experts models, and propose a regularized maximum likelihood estimation that efficiently encourages sparse solutions for heterogeneous data with high-dimensional predictors. The developed proximal-Newton EM algorithm includes proximal Newton-type procedures to update the model parameter by monotonically maximizing the objective function and allows to perform efficient estimation and feature selection. An experimental study shows the good performance of the algorithms in terms of recovering the actual sparse solutions, parameter estimation, and clustering of heterogeneous regression data, compared to the main state-of-the art competitors.


Heart of Darkness: Logistic Regression vs. Random Forest

#artificialintelligence

The'functional needs repair' category of the target variable only makes up about 7% of the whole set. The implication is that whatever algorithm you end up using it's probably going to learn the other two balanced classes a lot better than this one. Such is data science: the struggle is real. The first thing we're going to do is create an'age' variable for the waterpoints as that seems highly relevant. The'population' variable also has a highly right-skewed distribution so we're going to change that as well: The zeros inside of the'amount_tsh' are also probably NaNs so we're going to do something drastic and simplify it into 0s and 1s: One of the most important points we learned from the week before and something that will stay with me is the idea of coming up with a baseline model as fast as one can.


Mid-price Prediction Based on Machine Learning Methods with Technical and Quantitative Indicators

arXiv.org Machine Learning

Stock price prediction is a challenging task, but machine learning methods have recently been used successfully for this purpose. In this paper, we extract over 270 hand-crafted features (factors) inspired by technical and quantitative analysis and tested their validity on short-term mid-price movement prediction. We focus on a wrapper feature selection method using entropy, least-mean squares, and linear discriminant analysis. We also build a new quantitative feature based on adaptive logistic regression for online learning, which is constantly selected first among the majority of the proposed feature selection methods. This study examines the best combination of features using high frequency limit order book data from Nasdaq Nordic. Our results suggest that sorting methods and classifiers can be used in such a way that one can reach the best performance with a combination of only very few advanced hand-crafted features.


Predicting Cancer with Logistic Regression in Python

#artificialintelligence

Let's jump into the analysis by pulling in the data and importing necessary modules. Each row is a patient and each column contains a descriptive attribute. Class (Y) describes if the patient has no cancer (0) or has cancer (1). The next 4 columns are the protein levels found in that patient's bloodstream. We can retrieve some basic information about the sample from the describe method.


Essential Machine Learning with Linear Models in RAPIDS: Part 1 of a Series

#artificialintelligence

I want to take a moment to tell the origin story of regression analysis, which will explain why it has that name. I believe that of all the common machine learning techniques (K-means, kNN, PCA), "regression analysis" has the most opaque name. OLS regression was first invented to analyze exceptional genetic traits and their heritability. These early studies seemed to show the offspring of exceptional individuals "regressed to the mean". The inventor was Sir Francis Galton (half-cousin of Charles Darwin²), who had previously invented the standard deviation and first observed the "wisdom of the crowds" in certain estimation tasks. I am trying to predict daily demand for short-term bike rentals made in 2012, and I have data from 2011 to build the model.


XGBoostLSS -- An extension of XGBoost to probabilistic forecasting

arXiv.org Artificial Intelligence

We propose a new framework of XGBoost that predicts the entire conditional distribution of a univariate response variable. In particular, XGBoostLSS models all moments of a parametric distribution, i.e., mean, location, scale and shape (LSS), instead of the conditional mean only. Choosing from a wide range of continuous, discrete and mixed discrete-continuous distribution, modelling and predicting the entire conditional distribution greatly enhances the flexibility of XGBoost, as it allows to gain additional insight into the data generating process, as well as to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived. We present both a simulation study and real world examples that demonstrate the benefits of our approach.