Regression
Forecasting in Big Data Environments: an Adaptable and Automated Shrinkage Estimation of Neural Networks (AAShNet)
Habibnia, Ali, Maasoumi, Esfandiar
This paper considers improved forecasting in possibly nonlinear dynamic settings, with high-dimension predictors ("big data" environments). To overcome the curse of dimensionality and manage data and model complexity, we examine shrinkage estimation of a back-propagation algorithm of a deep neural net with skip-layer connections. We expressly include both linear and nonlinear components. This is a high-dimensional learning approach including both sparsity L1 and smoothness L2 penalties, allowing high-dimensionality and nonlinearity to be accommodated in one step. This approach selects significant predictors as well as the topology of the neural network. We estimate optimal values of shrinkage hyperparameters by incorporating a gradient-based optimization technique resulting in robust predictions with improved reproducibility. The latter has been an issue in some approaches. This is statistically interpretable and unravels some network structure, commonly left to a black box. An additional advantage is that the nonlinear part tends to get pruned if the underlying process is linear. In an application to forecasting equity returns, the proposed approach captures nonlinear dynamics between equities to enhance forecast performance. It offers an appreciable improvement over current univariate and multivariate models by RMSE and actual portfolio performance.
Crop yield probability density forecasting via quantile random forest and Epanechnikov Kernel function
Gyamerah, Samuel Asante, Ngare, Philip, Ikpe, Dennis
A reliable and accurate forecasting method for crop yields is very important for the farmer, the economy of a country, and the agricultural stakeholders. However, due to weather extremes and uncertainties as a result of increasing climate change, most crop yield forecasting models are not reliable and accurate. In this paper, a hybrid crop yield probability density forecasting method via quantile regression forest and Epanechnikov kernel function (QRF-SJ) is proposed to capture the uncertainties and extremes of weather in crop yield forecasting. By assigning probability to possible crop yield values, probability density forecast gives a complete description of the yield of crops. A case study using the annual crop yield of groundnut and millet in Ghana is presented to illustrate the efficiency and robustness of the proposed technique. The proposed model is able to capture the nonlinearity between crop yield and the weather variables via random forest. The values of prediction interval coverage probability and prediction interval normalized average width for the two crops show that the constructed prediction intervals cover the target values with perfect probability. The probability density curves show that QRF-SJ method has a very high ability to forecast quality prediction intervals with a higher coverage probability. The feature importance gave a score of the importance of each weather variable in building the quantile regression forest model. The farmer and other stakeholders are able to realize the specific weather variable that affect the yield of a selected crop through feature importance. The proposed method and its application on crop yield dataset is the first of its kind in literature.
Regression-Enhanced Random Forests
Zhang, Haozhe, Nettleton, Dan, Zhu, Zhengyuan
In the last few years, there have been many methodological and theoretical advances in the random forests approach. Some methodological developments and extensions include case-specific random forests [19], multivariate random forests [16], quantile regression forests [13], random survival forests [11], enriched random forests for microarry data [1] and predictor augmentation in random forests [18] among others. For theoretical developments, the statistical and asymptotic properties of random forests have been intensively investigated. Advances have been made in the areas such as consistency [2] [15], variable selection [8] and the construction of confidence intervals [17]. Although RF methodology has proven itself to be a reliable predictive approach in many application areas [3][10], there are some cases where random forests may suffer. First, as a fully nonparametric predictive algorithm, random forests may not efficiently incorporate known relationships between the response and the predictors. Second, random forests may fail in extrapolation problems where predictions are required at points out of the domain of the training dataset. For regression problems, a random forest prediction is an average of the predictions produced by the trees in the forest.
Sparse Neural Attentive Knowledge-based Models for Grade Prediction
Grade prediction for future courses not yet taken by students is important as it can help them and their advisers during the process of course selection as well as for designing personalized degree plans and modifying them based on their performance. One of the successful approaches for accurately predicting a student's grades in future courses is Cumulative Knowledge-based Regression Models (CKRM). CKRM learns shallow linear models that predict a student's grades as the similarity between his/her knowledge state and the target course. A student's knowledge state is built by linearly accumulating the learned provided knowledge components of the courses he/she has taken in the past, weighted by his/her grades in them. However, not all the prior courses contribute equally to the target course. In this paper, we propose a novel Neural Attentive Knowledge-based model (NAK) that learns the importance of each historical course in predicting the grade of a target course. Compared to CKRM and other competing approaches, our experiments on a large real-world dataset consisting of $\sim$1.5 grades show the effectiveness of the proposed NAK model in accurately predicting the students' grades. Moreover, the attention weights learned by the model can be helpful in better designing their degree plans.
Can Machine Learning Model with Static Features be Fooled: an Adversarial Machine Learning Approach
Taheri, Rahim, Javidan, Reza, Shojafar, Mohammad, P, Vinod, Conti, Mauro
Applied Intelligence manuscript No. (will be inserted by the editor) Abstract The widespread adoption of smartphones dramaticallygenerated by our attacks models when used to harden increases the risk of attacks and the spread the developed anti-malware system improves the detection of mobile malware, especially on the Android platform. Machine learning based solutions have been already Keywords Adversarial machine learning ยท malware used as a tool to supersede signature based anti-malware detection ยท poison attacks ยท adversarial example ยท systems. However, malware authors leverage attributes jacobian algorithm. Hence, to evaluate the vulnerability of machine 1 Introduction learning algorithms in malware detection, we propose five different attack scenarios to perturb malicious applications Nowadays using the Android application is very popular (apps). Every Android application inappropriately fits discriminant function on has a Jar-like APK format and is an archive file which the set of data points, eventually yielding a higher misclassification contains Android manifest and Classes.dex Further, to distinguish the adversarial manifest file holds information about the application examples from benign samples, we propose two defense structure and each part responsible for certain actions. To validate our For instance, the requested permissions must be accepted attacks and solutions, we test our model on three different by the users for successful installation of applications. We also test our methods The manifest file contains a list of hardware using various classifier algorithms and compare them components and permissions required by each application. Promising results show that generated the manifest file that are useful for running applications. Additionally, evasive variants is saved as the classes.dex In a nutshell, the by presenting some adversary-aware approaches?generated malware sample is statistically identical to a Do we require retraining of the current ML model to designbenign sample. To do so, adversaries adopt adversarial adversary-aware learning algorithms? How to properlymachine learning algorithms (AML) to design an example test and validate the countermeasure solutions inset called poison data which is used to fool machine a real-world network? The goal of this paper is to shedlearning models.
The Why's and how's of Machine Learning
The knowledge is the output of learning through the inseparable combination of theory and practice. It's what remains in one's experience from all the data which got shaped into what we call information. This process can be noticed throughout the different stages of our lives and it's never limited to the academic journey. What I'm aiming to express is that machine learning is nothing but a human logic tailored for more complex problems that surely require more computational capabilities. The last quote represents the nature knowledge acquiring process which, as you may notice, is similar to CRISP-DM Methodology which I detailed in a previous article and which is essential to succeed in your data mining project. To define Machine learning, its is a set of algorithms that are included in the many operations like the Data Mining process and which help you transform your raw data into knowledge, the layer that hides under the obvious information.
Top 10 Machine Learning Algorithms for Data Science
For the majority of newcomers, machine learning algorithms may seem too boring and complicated subject to be mastered. Well, to some extent, this is true. In most cases, you stumble upon a few-page description for each algorithm and yes, it's hard to find time and energy to deal with each and every detail. However, if you truly, madly, deeply want to be an ML-expert, you have to brush up your knowledge regarding it and there is no other way to be. But relax, today I will try to simplify this task and explain core principles of 10 most common algorithms in simple words (each includes a brief description, guides, and useful links).
Causal Discovery with General Non-Linear Relationships Using Non-Linear ICA
Monti, Ricardo Pio, Zhang, Kun, Hyvarinen, Aapo
We consider the problem of inferring causal relationships between two or more passively observed variables. While the problem of such causal discovery has been extensively studied especially in the bivariate setting, the majority of current methods assume a linear causal relationship, and the few methods which consider non-linear dependencies usually make the assumption of additive noise. Here, we propose a framework through which we can perform causal discovery in the presence of general non-linear relationships. The proposed method is based on recent progress in non-linear independent component analysis and exploits the non-stationarity of observations in order to recover the underlying sources or latent disturbances. We show rigorously that in the case of bivariate causal discovery, such non-linear ICA can be used to infer the causal direction via a series of independence tests. We further propose an alternative measure of causal direction based on asymptotic approximations to the likelihood ratio, as well as an extension to multivariate causal discovery. We demonstrate the capabilities of the proposed method via a series of simulation studies and conclude with an application to neuroimaging data.
Memory-Sample Tradeoffs for Linear Regression with Small Error
Sharan, Vatsal, Sidford, Aaron, Valiant, Gregory
We consider the problem of performing linear regression over a stream of $d$-dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples $(a_1,b_1), (a_2,b_2)\ldots,$ with $a_i$ drawn independently from a $d$-dimensional isotropic Gaussian, and where $b_i = \langle a_i, x\rangle + \eta_i,$ for a fixed $x \in \mathbb{R}^d$ with $\|x\|_2 = 1$ and with independent noise $\eta_i$ drawn uniformly from the interval $[-2^{-d/5},2^{-d/5}].$ We show that any algorithm with at most $d^2/4$ bits of memory requires at least $\Omega(d \log \log \frac{1}{\epsilon})$ samples to approximate $x$ to $\ell_2$ error $\epsilon$ with probability of success at least $2/3$, for $\epsilon$ sufficiently small as a function of $d$. In contrast, for such $\epsilon$, $x$ can be recovered to error $\epsilon$ with probability $1-o(1)$ with memory $O\left(d^2 \log(1/\epsilon)\right)$ using $d$ examples. This represents the first nontrivial lower bounds for regression with super-linear memory, and may open the door for strong memory/sample tradeoffs for continuous optimization.
On Low-rank Trace Regression under General Sampling Distribution
A growing number of modern statistical learning problems involve estimating a large number of parameters from a (smaller) number of observations. In a subset of these problems (matrix completion, matrix compressed sensing, and multi-task learning) the unknown parameters form a high-dimensional matrix, and two popular approaches for the estimation are trace-norm regularized linear regression or alternating minimization. It is also known that these estimators satisfy certain optimal tail bounds under assumptions on rank, coherence, or spikiness of the unknown matrix. We study a general family of estimators and sampling distribution that include the above two estimators, and introduce a general notion of spikiness and rank for the unknown matrix. Next, we extend the existing literature on the analysis of these estimators and provide a unifying technique to prove tail bounds for the estimation error. We demonstrate the benefit of this generalization by studying its application to four problems of (1) matrix completion, (2) multi-task learning, (3) compressed sensing with Gaussian ensembles, and (4) compressed sensing with factored measurements. For (1) and (3), we recover matching tail bounds as those found in the literature, and for (2) and (4) we obtain (to the best of our knowledge) the first tail bounds. Our approach relies on a generic recipe to prove restricted strong convexity for the sampling operator of the trace regression, that only requires finding upper bounds on certain norms of the parameter matrix.