Regression
When Your Regression Model's Errors Contain Two Peaks
It's good to see that all model coefficients are statistically significant at a p-value of 0.001 i.e. at 99.999% confidence level. As against linear regression models, models in which the dependent variable is a count, rarely produce normally distributed residual error distributions. So we have to normalize the raw-residuals using other means.
How To Code Linear Regression From Scratch -- Quick & Easy!
Here, we load the chocolate data into our program using pandas; we also drop two of the columns we won't be using in our calculation: competitorname and winpercent. Our y becomes the first column in the dataset which indicates if our specific sweet is chocolate (1) or not (0). The remaining columns are used as variables/features to predict our y and, thus, become our X. If you're confused about why we're doing with …[:, 0][:,np.newaxis] on line 5, this is to turn y into a column. We simply add a new dimension to convert the horizontal vector into a vertical column!
On lower bounds for the bias-variance trade-off
Derumigny, Alexis, Schmidt-Hieber, Johannes
It is a common phenomenon that for high-dimensional and nonparametric statistical models, rate-optimal estimators balance squared bias and variance. Although this balancing is widely observed, little is known whether methods exist that could avoid the trade-off between bias and variance. We propose a general strategy to obtain lower bounds on the variance of any estimator with bias smaller than a prespecified bound. This shows to which extent the bias-variance trade-off is unavoidable and allows to quantify the loss of performance for methods that do not obey it. The approach is based on a number of abstract lower bounds for the variance involving the change of expectation with respect to different probability measures as well as information measures such as the Kullback-Leibler or chi-square divergence. Some of these inequalities rely on a new concept of information matrices. In a second part of the article, the abstract lower bounds are applied to several statistical models including the Gaussian white noise model, a boundary estimation problem, the Gaussian sequence model and the high-dimensional linear regression model. For these specific statistical applications, different types of bias-variance trade-offs occur that vary considerably in their strength. For the trade-off between integrated squared bias and integrated variance in the Gaussian white noise model, we propose to combine the general strategy for lower bounds with a reduction technique. This allows us to reduce the original problem to a lower bound on the bias-variance trade-off for estimators with additional symmetry properties in a simpler statistical model. To highlight possible extensions of the proposed framework, we moreover briefly discuss the trade-off between bias and mean absolute deviation.
Detecting Problem Statements in Peer Assessments
Xiao, Yunkai, Zingle, Gabriel, Jia, Qinjin, Shah, Harsh R., Zhang, Yi, Li, Tianyi, Karovaliya, Mohsin, Zhao, Weixiang, Song, Yang, Ji, Jie, Balasubramaniam, Ashwin, Patel, Harshit, Bhalasubbramanian, Priyankha, Patel, Vikram, Gehringer, Edward F.
Effective peer assessment requires students to be attentive to the deficiencies in the work they rate. Thus, their reviews should identify problems. But what ways are there to check that they do? We attempt to automate the process of deciding whether a review comment detects a problem. We use over 18,000 review comments that were labeled by the reviewees as either detecting or not detecting a problem with the work. We deploy several traditional machine-learning models, as well as neural-network models using GloVe and BERT embeddings. We find that the best performer is the Hierarchical Attention Network classifier, followed by the Bidirectional Gated Recurrent Units (GRU) Attention and Capsule model with scores of 93.1% and 90.5% respectively. The best non-neural network model was the support vector machine with a score of 89.71%. This is followed by the Stochastic Gradient Descent model and the Logistic Regression model with 89.70% and 88.98%.
Meta Clustering for Collaborative Learning
Ye, Chenglong, Ding, Jie, Ghanadan, Reza
An emerging number of learning scenarios involve a set of learners/analysts each equipped with a unique dataset and algorithm, who may collaborate with each other to enhance their learning performance. From the perspective of a particular learner, a careless collaboration with task-irrelevant other learners is likely to incur modeling error. A crucial problem is to search for the most appropriate collaborators so that their data and modeling resources can be effectively leveraged. Motivated by this, we propose to study the problem of'meta clustering', where the goal is to identify subsets of relevant learners whose collaboration will improve the performance of each individual learner. In particular, we study the scenario where each learner is performing a supervised regression, and the meta clustering aims to categorize the underlying supervised relations (between responses and predictors) instead of the raw data. We propose a general method named as Select-Exchange-Cluster (SEC) for performing such a clustering. Our method is computationally efficient as it does not require each learner to exchange their raw data. We prove that the SEC method can accurately cluster the learners into appropriate collaboration sets according to their underlying regression functions. Synthetic and real data examples show the desired performance and wide applicability of SEC to a variety of learning tasks. Index Terms Distributed computing; Fairness; Meta clustering; Regression.
Machine learning time series regressions with an application to nowcasting
Babii, Andrii, Ghysels, Eric, Striaukas, Jonas
The statistical imprecision of quarterly gross domestic product (GDP) estimates, along with the fact that the first estimate is available with a delay of nearly a month, pose a significant challenge to policy makers, market participants, and other observers with an interest in monitoring the state of the economy in real time; see, e.g., Ghysels, Horan, and Moench (2018) for a recent discussion of macroeconomic data revision and publication delays. A term originated in meteorology, nowcasting pertains to the prediction of the present and very near future. Nowcasting is intrinsically a mixed frequency data problem as the object of interest is a low-frequency data series (e.g., quarterly GDP), whereas the real-time information (e.g., daily, weekly, or monthly) can be used to update the state, or to put it differently, to nowcast the low-frequency series of interest. Traditional methods used for nowcasting rely on dynamic factor models that treat the underlying low frequency series of interest as a latent process with high frequency data noisy observations. These models are naturally cast in a state-space form and inference can be performed using likelihood-based methods and Kalman filtering techniques; see Bańbura, Giannone, Modugno, and Reichlin (2013) for a recent survey.
Breiman's "Two Cultures" Revisited and Reconciled
Subhadeep, null, Mukhopadhyay, null, Wang, Kaijun
In a landmark paper published in 2001, Leo Breiman described the tense standoff between two cultures of data modeling: parametric statistical and algorithmic machine learning. The cultural division between these two statistical learning frameworks has been growing at a steady pace in recent years. What is the way forward? It has become blatantly obvious that this widening gap between "the two cultures" cannot be averted unless we find a way to blend them into a coherent whole. This article presents a solution by establishing a link between the two cultures. Through examples, we describe the challenges and potential gains of this new integrated statistical thinking.
Using Machine Learning to Forecast Future Earnings
Cui, Xinyue, Xu, Zhaoyu, Zhou, Yue
In this essay, we have comprehensively evaluated the feasibility and suitability of adopting the Machine Learning Models on the forecast of corporation fundamentals (i.e. the earnings), where the prediction results of our method have been thoroughly compared with both analysts' consensus estimation and traditional statistical models. As a result, our model has already been proved to be capable of serving as a favorable auxiliary tool for analysts to conduct better predictions on company fundamentals. Compared with previous traditional statistical models being widely adopted in the industry like Logistic Regression, our method has already achieved satisfactory advancement on both the prediction accuracy and speed. Meanwhile, we are also confident enough that there are still vast potentialities for this model to evolve, where we do hope that in the near future, the machine learning model could generate even better performances compared with professional analysts.
Review of Mathematical frameworks for Fairness in Machine Learning
del Barrio, Eustasio, Gordaliza, Paula, Loubes, Jean-Michel
With both the introduction of new ways of storing, sharing and streaming data and the drastic development of the capacity of computers to handle large computations, the conception of models have changed. Mathematical models were first designed following prior ideas or conjectures from physical or biological models, then tested by designing experiments to test the validity of the ideas of their inventors. The model holds until new observations enable to reject its assumptions. The so-called Big Data's area introduced a new paradigm. The observed data convey enough information to understand the complexity of real life and the more the data, the better the description of the reality. Hence building models optimised to fit the data has become an efficient way to obtain generalizable models able to describe and forecast the real world. In this framework, the principle of supervised machine learning is to build a decision rule from a set of labeled examples called the learning sample, that fits the data.