Goto

Collaborating Authors

 Regression


Introducing the HCmodelSets Package

#artificialintelligence

I am happy to introduce the package HCmodelSets, which is now available on CRAN. This package implements the methods proposed by Cox, D.R. and Battey, H.S. (2017). In particular it performs the reduction, exploratory and model selection phases given in the aforementioned reference. The software supports linear regression, likelihood-based fitting of generalized linear regression models and the proportional hazards model fitted by partial likelihood. The standard method described in the literature to deal with sparse regression is the LASSO proposed by Tibshirani (1996), which assumes sparsity of the effects.


Machine Learning Results in R: one plot to rule them all! (Part 2 โ€“ Regression Models)

#artificialintelligence

Given the number of people interested in my first post for visualizing Classification Models Results, I've decided to create and share some new function to visualize and compare whole Linear Regression Models with one line of code. These plots will help us with our time invested in model selection and a general understanding of our results. Where are we going with this post? Let's take a quick look at the final output: a quick nice dashboard with everything you'd need to compare and evaluate if your regression model is looking good, compare with others, or get working on further improvements. Interesting to say that, the exact same function mplot_full used before in the Part 1 โ€“ Classification Models post, will work on Regressions too lares::updateLares().


The impact of imbalanced training data on machine learning for author name disambiguation

arXiv.org Machine Learning

In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers - Logistic Regression, Na\"ive Bayes, and Random Forest - are trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic Regression and Na\"ive Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10 ~ 1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.


An Introduction to Applied Machine Learning with Multiple Linear Regression and Python

#artificialintelligence

The purpose of this post is to unpack to the layman the basic concepts of applied machine learning and to document how data scientists or data analysts would generally answer a question or solve a problem with data and machine learning algorithms. Hopefully, by the end, you would have a more solid understanding of the steps your data scientist or business intelligence officers should be going through when attempting to apply the power of machine learning to data. Machine learning is a method of data analysis that automates analytical model building. The steps illustrated here are written as a'practical guide' of that method. It covers the broad strokes of the process one would go through when implementing any other similar machine learning algorithms or ideas.


Coding Deep Learning for Beginners -- Linear Regression (Part 1): Initialization and Prediction

#artificialintelligence

This is the 3rd article of series "Coding Deep Learning for Beginners". Here, you will be able to find links to all articles, agenda, and general information about an estimated release date of next articles on the bottom of the 1st article. They are also available in my open source portfolio -- MyRoadToAI, along with some mini-projects, presentations, tutorials and links. You can also read the article on my personal website, hosted with Jekyll in order to improve readability (supporting code syntax highlighting, LaTeX equations and more. Some of you may wonder, why the article series about explaining and coding Neural Networks starts with basic Machine Learning algorithm such as Linear Regression.


Different methods of feature selection

#artificialintelligence

In our previous post, we discussed what is feature selection and why we need feature selection. In this post, we're going to look at the different methods used in feature selection. There are three main classification of feature selection methods โ€“ Filter Methods, Wrapper Methods, and Embedded Methods. We'll look at all of them individually. Filter methods are learning-algorithm-agnostic, which means they can be employed no matter which learning algorithm you're using.


Using deep learning to predict emergency room visits

#artificialintelligence

At IBM Research, we are exploring new solutions for a range of health care challenges. One such challenge is emergency room (ER) overcrowding, which can lead to long wait times for treatment. Patients who use the ER for non-emergency situations are more likely to return to the ER multiple times (Poole et al. 2016), further contributing to overcrowding. Identifying those patients who are likely to return to the ER may enable hospitals to intervene to ensure access to necessary care outside the ER and potentially alleviate overcrowding. My team at IBM Research-China took on this challenge.


Scalable Multi-Task Gaussian Process Tensor Regression for Normative Modeling of Structured Variation in Neuroimaging Data

arXiv.org Machine Learning

Most brain disorders are very heterogeneous in terms of their underlying biology and developing analysis methods to model such heterogeneity is a major challenge. A promising approach is to use probabilistic regression methods to estimate normative models of brain function using (f)MRI data then use these to map variation across individuals in clinical populations (e.g., via anomaly detection). To fully capture individual differences, it is crucial to statistically model the patterns of correlation across different brain regions and individuals. However, this is very challenging for neuroimaging data because of high-dimensionality and highly structured patterns of correlation across multiple axes. Here, we propose a general and flexible multi-task learning framework to address this problem. Our model uses a tensor-variate Gaussian process in a Bayesian mixed-effects model and makes use of Kronecker algebra and a low-rank approximation to scale efficiently to multi-way neuroimaging data at the whole brain level. On a publicly available clinical fMRI dataset, we show that our computationally affordable approach substantially improves detection sensitivity over both a mass-univariate normative model and a classifier that --unlike our approach-- has full access to the clinical labels.


Local Linear Forests

arXiv.org Machine Learning

Random forests are a powerful method for non-parametric regression, but are limited in their ability to fit smooth signals, and can show poor predictive performance in the presence of strong, smooth effects. Taking the perspective of random forests as an adaptive kernel method, we pair the forest kernel with a local linear regression adjustment to better capture smoothness. The resulting procedure, local linear forests, enables us to improve on asymptotic rates of convergence for random forests with smooth signals, and provides substantial gains in accuracy on both real and simulated data.


Structured Point Cloud Data Analysis via Regularized Tensor Regression for Process Modeling and Optimization

arXiv.org Machine Learning

Modern measurement technologies provide the means to measure high density spatial and geometric data in three-dimensional (3D) coordinate systems, referred to as point clouds. Point cloud data analysis has broad applications in advanced manufacturing and metrology for measuring dimensional accuracy and shape analysis, in geographic information systems (GIS) for digital elevation modeling and analysis of terrains, in computer graphics for shape reconstruction, and in medical imaging for volumetric measurement to name a few. The role of point cloud data in manufacturing is now more important than ever, particularly in the field of smart and additive manufacturing processes, where products with complex shape and geometry are manufactured with the help of advanced technologies (Gibson et al., 2010). In these processes, the dimensional and geometric accuracy of manufactured parts are measured in the form of point clouds using modern sensing devices, including touch-probe coordinate measuring machines (CMM) and optical systems, such as laser scanners. Modeling the relationship of the dimensional accuracy, encapsulated in point clouds, with process parameters and machine settings is vital for variation reduction and process optimization.