Goto

Collaborating Authors

 Regression


A Computational Exploration of Emerging Methods of Variable Importance Estimation

arXiv.org Artificial Intelligence

Estimating the importance of variables is an essential task in modern machine learning. This help to evaluate the goodness of a feature in a given model. Several techniques for estimating the importance of variables have been developed during the last decade. In this paper, we proposed a computational and theoretical exploration of the emerging methods of variable importance estimation, namely: Least Absolute Shrinkage and Selection Operator (LASSO), Support Vector Machine (SVM), the Predictive Error Function (PERF), Random Forest (RF), and Extreme Gradient Boosting (XGBOOST) that were tested on different kinds of real-life and simulated data. All these methods can handle both regression and classification tasks seamlessly but all fail when it comes to dealing with data containing missing values. The implementation has shown that PERF has the best performance in the case of highly correlated data closely followed by RF. PERF and XGBOOST are "data-hungry" methods, they had the worst performance on small data sizes but they are the fastest when it comes to the execution time. SVM is the most appropriate when many redundant features are in the dataset. A surplus with the PERF is its natural cut-off at zero helping to separate positive and negative scores with all positive scores indicating essential and significant features while the negatives score indicates useless features. RF and LASSO are very versatile in a way that they can be used in almost all situations despite they are not giving the best results.


The white-box model approach aims for interpretable AI

#artificialintelligence

When building machine learning models or algorithms, developers should adhere to the principle of interpretability so that they and their intended users know exactly how the inputs and inner workings achieve outputs. Interpretable AI is a book written by Ajay Thampi, a machine learning engineer at Meta, and its second chapter explains the white-box model approach to machine learning as well as examples of white-box models. These models are interpretable, as they feature easy-to-understand algorithms that show how data inputs achieve outputs or target variables. Thampi walks readers through three types of white-box models in this chapter and how they are applied: linear regression, generalized additive models (GAMs) and decision trees. Given the term regression in machine learning refers to models and algorithms taking data and learning relationships within that data to make predictions, the premise of a linear regression model is that a target prediction variable can be determined as a linear combination of every input variable.


MIT study: Selective regression method improves AI accuracy

#artificialintelligence

Knowing when to trust a model's predictions is not always an easy challenge for professionals who use machine-learning models to aid in decision-making, especially since these models are frequently so complicated that their inner workings remain a mystery. Selective regression is a method in which the model calculates its confidence level for each prediction and rejects predictions if its confidence is too low. After then, a person can look over those situations, gather further data, and manually decide on each one. While researchers are working on new models, regulators are trying to set a standard in the usage of artificial intelligence. Two months ago we discussed the EU AI Act and now the UK prepares the AI rulebook.


The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

arXiv.org Artificial Intelligence

In transfer learning (Pan and Yang, 2009; Sugiyama and Kawanabe, 2012), an algorithm is provided with abundant data from a source domain and scarce or no data from a target domain, and aims to train a model that generalizes well on the target domain. A simple yet effective approach is to pretrain a model with the rich source data and then finetune the model with the available target data via, e.g., stochastic gradient descent (SGD) (see, e.g., Yosinski et al. (2014)). Despite its wide applicability in practice, the power and limitation of the pretraining-finetuning based transfer learning framework is not fully understood in theory. The focus of this work is to consider this issue in a specific transfer learning setup known as covariate shift (Pan and Yang, 2009; Sugiyama and Kawanabe, 2012), where the source and target distributions differ in their marginal distributions over the input, but coincide in their conditional distribution of the output given the input. Regarding the theory of learning with covariate shift, there exists a rich set of results (Ben-David et al., 2010; Germain et al., 2013; Mansour et al., 2009; Mohri and Muñoz Medina, 2012; Cortes and


Machine Learning Training on a Real Processing-in-Memory System

arXiv.org Artificial Intelligence

Training machine learning algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., computing systems with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate machine learning training. To do so, we (1) implement several representative classic machine learning algorithms (namely, linear regression, logistic regression, decision tree, K-means clustering) on a real-world general-purpose PIM architecture, (2) characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our experimental evaluation on a memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound machine learning workloads, when the necessary operations and datatypes are natively supported by PIM hardware. To our knowledge, our work is the first one to evaluate training of machine learning algorithms on a real-world general-purpose PIM architecture.


How to Verify the Assumptions of Linear Regression

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. Linear regression is a model that estimates the relationship between independent variables and a dependent variable using a straight line.


Automatic Classification of Bug Reports Based on Multiple Text Information and Reports' Intention

arXiv.org Artificial Intelligence

With the rapid growth of software scale and complexity, a large number of bug reports are submitted to the bug tracking system. In order to speed up defect repair, these reports need to be accurately classified so that they can be sent to the appropriate developers. However, the existing classification methods only use the text information of the bug report, which leads to their low performance. To solve the above problems, this paper proposes a new automatic classification method for bug reports. The innovation is that when categorizing bug reports, in addition to using the text information of the report, the intention of the report (i.e. suggestion or explanation) is also considered, thereby improving the performance of the classification. First, we collect bug reports from four ecosystems (Apache, Eclipse, Gentoo, Mozilla) and manually annotate them to construct an experimental data set. Then, we use Natural Language Processing technology to preprocess the data. On this basis, BERT and TF-IDF are used to extract the features of the intention and the multiple text information. Finally, the features are used to train the classifiers. The experimental result on five classifiers (including K-Nearest Neighbor, Naive Bayes, Logistic Regression, Support Vector Machine, and Random Forest) show that our proposed method achieves better performance and its F-Measure achieves from 87.3% to 95.5%.


What algorithm curate machine learning

#artificialintelligence

In order to address a specific problem, practitioners must select an acceptable learning algorithm. A general rule of thumb is that for classification issues, we should use algorithms with high accuracy, whereas for regression problems, we should choose algorithms with lower accuracy but higher robustness because the absolute error rate is unimportant. Here are a few examples: Linear Regression: Linear regression uses the linearity principle to predict continuous values from a set of input variables. It achieves this by minimizing the total of squared errors. This method is fast and scalable for huge data sets since it avoids iterating over all possible replies; nonetheless, it is unstable.


Flood Prediction Using Machine Learning Models

arXiv.org Artificial Intelligence

Floods are one of nature's most catastrophic calamities which cause irreversible and immense damage to human life, agriculture, infrastructure and socio-economic system. Several studies on flood catastrophe management and flood forecasting systems have been conducted. The accurate prediction of the onset and progression of floods in real time is challenging. To estimate water levels and velocities across a large area, it is necessary to combine data with computationally demanding flood propagation models. This paper aims to reduce the extreme risks of this natural disaster and also contributes to policy suggestions by providing a prediction for floods using different machine learning models. This research will use Binary Logistic Regression, K-Nearest Neighbor (KNN), Support Vector Classifier (SVC) and Decision tree Classifier to provide an accurate prediction. With the outcome, a comparative analysis will be conducted to understand which model delivers a better accuracy.


A penalized two-pass regression to predict stock returns with time-varying risk premia

arXiv.org Machine Learning

Under the arbitrage pricing theory (Ross, 1976; Chamberlain and Rothschild, 1983), we know that risk premia are drivers of expected excess returns. Hence, estimating them should be useful for prediction of future equity excess returns. The workhorse to estimate equity risk premia in a linear multi-factor setting is the two-pass crosssectional regression method developed by Black et al. (1972) and Fama and MacBeth (1973). A series of papers address its large and finite sample properties for linear factor models with time-invariant coefficients; see, for example, Shanken (1985, 1992), Jagannathan and Wang (1998), Shanken and Zhou (2007), Kan et al. (2013), and the review paper of Jagannathan et al. (2010) (see Bryzgalova et al. (2019) for a recent Bayesian approach). In a time-varying setting, Gagliardini et al. (2016) (henceforth referred as GOS) study how we can infer the dynamics of equity risk premia from large stock return data sets under conditional linear factor models (see also Gagliardini