Goto

Collaborating Authors

 Regression


One-step regression and classification with crosspoint resistive memory arrays

arXiv.org Machine Learning

Machine learning has been getting a large attention in the recent years, as a tool to process big data generated by ubiquitous sensors in our daily life. High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge, i.e., without the support of a remote frame server in the cloud. Such requirements challenge the complementary metal-oxide-semiconductor (CMOS) technology, which is limited by the Moore's law approaching its end and the communication bottleneck in conventional computing architecture. Novel computing concepts, architectures and devices are thus strongly needed to accelerate data-intensive applications. Here we show a crosspoint resistive memory circuit with feedback configuration can execute linear regression and logistic regression in just one step by computing the pseudoinverse matrix of the data within the memory. The most elementary learning operation, that is the regression of a sequence of data and the classification of a set of data, can thus be executed in one single computational step by the novel technology. One-step learning is further supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition. The results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.


Tree-based Machine Learning Models for Handling Imbalanced Datasets

#artificialintelligence

Recently, I have been working on a binary classification problem with an imbalanced dataset, where the ratio of positive class to negative class is around 1:4. Imbalanced classification problems are so commonplace that data enthusiasts would encounter them sooner or later. In this post, I will be sharing three tree-based Machine Learning Models that can help handle imbalanced datasets. The dataset that I am going to use to illustrate the effectiveness of algorithms is the credit card fraud dataset from Kaggle. This is an extremely imbalanced dataset: out of 284,807 transactions, there are only 492 frauds. Following the convention, we label the fraud class samples as positive class and normal transactions, negative class.


Generalization Error for Linear Regression under Distributed Learning

arXiv.org Machine Learning

Distributed learning facilitates the scaling-up of data processing by distributing the computational burden over several nodes. Despite the vast interest in distributed learning, generalization performance of such approaches is not well understood. We address this gap by focusing on a linear regression setting. We consider the setting where the unknowns are distributed over a network of nodes. We present an analytical characterization of the dependence of the generalization error on the partitioning of the unknowns over nodes. In particular, for the overparameterized case, our results show that while the error on training data remains in the same range as that of the centralized solution, the generalization error of the distributed solution increases dramatically compared to that of the centralized solution when the number of unknowns estimated at any node is close to the number of observations. We further provide numerical examples to verify our analytical expressions.


Feature Selection Methods for Uplift Modeling

arXiv.org Machine Learning

Uplift modeling is a predictive modeling technique that estimates the user-level incremental effect of a treatment using machine learning models. It is often used for targeting promotions and advertisements, as well as for the personalization of product offerings. In these applications, there are often hundreds of features available to build such models. Keeping all the features in a model can be costly and inefficient. Feature selection is an essential step in the modeling process for multiple reasons: improving the estimation accuracy by eliminating irrelevant features, accelerating model training and prediction speed, reducing the monitoring and maintenance workload for feature data pipeline, and providing better model interpretation and diagnostics capability. However, feature selection methods for uplift modeling have been rarely discussed in the literature. Although there are various feature selection methods for standard machine learning models, we will demonstrate that those methods are sub-optimal for solving the feature selection problem for uplift modeling. To address this problem, we introduce a set of feature selection methods designed specifically for uplift modeling, including both filter methods and embedded methods. To evaluate the effectiveness of the proposed feature selection methods, we use different uplift models and measure the accuracy of each model with a different number of selected features. We use both synthetic and real data to conduct these experiments. We also implemented the proposed filter methods in an open source Python package (CausalML).


A Solution for Large Scale Nonlinear Regression with High Rank and Degree at Constant Memory Complexity via Latent Tensor Reconstruction

arXiv.org Machine Learning

This paper proposes a novel method for learning highly nonlinear, multivariate functions from examples. Our method takes advantage of the property that continuous functions can be approximated by polynomials, which in turn are representable by tensors. Hence the function learning problem is transformed into a tensor reconstruction problem, an inverse problem of the tensor decomposition. Our method incrementally builds up the unknown tensor from rank-one terms, which lets us control the complexity of the learned model and reduce the chance of overfitting. For learning the models, we present an efficient gradient-based algorithm that can be implemented in linear time in the sample size, order, rank of the tensor and the dimension of the input. In addition to regression, we present extensions to classification, multi-view learning and vector-valued output as well as a multi-layered formulation. The method can work in an online fashion via processing mini-batches of the data with constant memory complexity. Consequently, it can fit into systems equipped only with limited resources such as embedded systems or mobile phones. Our experiments demonstrate a favorable accuracy and running time compared to competing methods.


Predicting Boston House prices using Linear Regression

#artificialintelligence

It is a predictive modeling technique that finds a relationship between independent variable(s) and dependent variable(s) (which is a continuous variable). US, UK, 0/1) or continuous(1729, 3.141 etc), while dependent variable(dv)s are continuous. Underlying function mapping iv's and dv's can be linear, quadratic, polynomial or other non-linear functions(like sigmoid function in logistic regression), but this article is on linear technique. Regression techniques are heavily used in making real estate price prediction, financial forecasting, predicting traffic arrival time (ETA). Continuous: Can take infinite values, e.g.


Large-scale Uncertainty Estimation and Its Application in Revenue Forecast of SMEs

arXiv.org Machine Learning

The economic and banking importance of the small and medium enterprise (SME) sector is well recognized in contemporary society. Business credit loans are very important for the operation of SMEs, and the revenue is a key indicator of credit limit management. Therefore, it is very beneficial to construct a reliable revenue forecasting model. If the uncertainty of an enterprise's revenue forecasting can be estimated, a more proper credit limit can be granted. Natural gradient boosting approach, which estimates the uncertainty of prediction by a multi-parameter boosting algorithm based on the natural gradient. However, its original implementation is not easy to scale into big data scenarios, and computationally expensive compared to state-of-the-art tree-based models (such as XGBoost). In this paper, we propose a Scalable Natural Gradient Boosting Machines that is simple to implement, readily parallelizable, interpretable and yields high-quality predictive uncertainty estimates. According to the characteristics of revenue distribution, we derive an uncertainty quantification function. We demonstrate that our method can distinguish between samples that are accurate and inaccurate on revenue forecasting of SMEs. What's more, interpretability can be naturally obtained from the model, satisfying the financial needs.


Imputation of missing sub-hourly precipitation data in a large sensor network: a machine learning approach

arXiv.org Machine Learning

Precipitation data collected at sub-hourly resolution represents specific challenges for missing data recovery by being largely stochastic in nature and highly unbalanced in the duration of rain vs nonrain. Here we present a two-step analysis utilising current machine learning techniques for imputing precipitation data sampled at 30-minute intervals by devolving the task into (a) the classification of rain or non-rain samples, and (b) regressing the absolute values of predicted rain samples. Investigating 37 weather stations in the UK, this machine learning process produces more accurate predictions for recovering precipitation data than an established surface fitting technique utilising neighbouring rain gauges. Increasing available features for the training of machine learning algorithms increases performance with the integration of weather data at the target site with externally sourced rain gauges providing the highest performance. This method informs machine learning models by utilising information in concurrently collected environmental data to make accurate predictions of missing rain data. Capturing complex nonlinear relationships from weakly correlated variables is critical for data recovery at sub-hourly resolutions. Such pipelines for data recovery can be developed and deployed for highly automated and near instantaneous imputation of missing values in ongoing datasets at high temporal resolutions. Keywords: machine learning, data imputation, gradient boosted trees, environmental sensor networks, precipitation, soil moisture 1. Introduction Precipitation data is of critical importance across multiple lines of enquiry, informing statistical models and analysis relating to weather forecasting, extreme weather events, climate change, water-resource management, droughts, flooding, agricultural impact, and hydroelectric power. Historical rainfall data can reveal long term trends in environmental hydrological issues with real-time data input allowing for immediate forecasting of future conditions. Distributed networks of rain gauges are typically used to provide precipitation data at the earth's surface at varying temporal resolutions and can cover large geographical areas (Kidd, 2001). As is the case in many databases, particularly those utilising physical sensors, the problem of missing data arises. Missing data can be a result of sensor failure, data storage/transmission failure, or post-collection quality control procedures resulting in removal of identified problem data (Blenkinsop et al., 2017). Missing data in precipitation databases represents a serious limitation for the effective use of the data. Given the global scale and importance of precipitation and meteorological data (Sun et al., 2018), developing solutions to missing data is of paramount importance for maximising information gain.


Ensemble Learning: Data Science.

#artificialintelligence

Ensemble Learning is a technique or process in which multiple models are generated and combined to solve a particular machine learning problem. Ensemble Learning is meta-algorithms that combine multiple models to try and solve the same problem. It is primarily used to improve the performance of a model and reduce the variance of the outcome. Choosing which model to use is extremely important in any regression or classification problem and the choice depends on many variables such as the quantity of data, distribution of data, and its types. In supervised machine learning an algorithm creates a model from training data with the goal to best estimate the output variable (y) given the data (X).


Generalization Error of Generalized Linear Models in High Dimensions

arXiv.org Machine Learning

At the heart of machine learning lies the question of generalizability of learned rules over previously unseen data. While over-parameterized models based on neural networks are now ubiquitous in machine learning applications, our understanding of their generalization capabilities is incomplete. This task is made harder by the non-convexity of the underlying learning problems. We provide a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems. This framework enables analyzing the effect of (i) over-parameterization and non-linearity during modeling; and (ii) choices of loss function, initialization, and regularizer during learning. Our model also captures mismatch between training and test distributions. As examples, we analyze a few special cases, namely linear regression and logistic regression. We are also able to rigorously and analytically explain the \emph{double descent} phenomenon in generalized linear models.