Robust Variable Selection for High-dimensional Regression with Missing Data and Measurement Errors

Zhang, Zhenhao

arXiv.org Machine Learning 

The linear relationship between response variables and covariates has been the topic of interest.In the classical squared loss function,it is usually assumed that the data obey a normal distribution.However,the data discussed in this paper contain a large number of missing data and measurement errors,such that the datausually do not conform to any of the common forms of data distribution.We propose a method based on an exponential squared loss function with tuning parameter.For data with different distributions,a better result of linear regression can be achieved by changing the value of the tuning parameter h.Therefore,forany kind of data distribution,going with an exponential squared loss function with moderating variables will be highly robust.For any data distribution,the loss function is strongly robust for h (0,+x).In previous studies,when using the traditional squared loss function,the data distribution requirements are very high,resulting in the traditional exponential squared loss function being very sensitive to anomalies.This reduces the estimation efficiency of the model,and this drawback becomes more obvious in data containing missing data with measurement errors.In contrast,the use of exponential squared loss functions can improve the estimation efficiency of the model by varying thetuning parameter h in a way that adapts to more distributed forms of data sets and produces more reliable estimates. In the traditional squared loss function,the values of the covariates are always defaulted to be free ofmissingdata and measurement errors.Even if missing data and measurement errors exist,they are assumed to be absent or these data are removed.However,this assumption is often broken in studies in disciplines such as health and epidemiology.As an illustration,Zhang and Zhou(1)looked at a collection of breast cancer patients to identify the gene expression that was associated with long-term disease-free survival.The datacollection consists of 24481 gene probes collected from 78 breast cancer patients.In particular,using the log-value of the ratio (log1o(Ratio)),which could be denoted as Y,it is possible to forecast the disease-free survival.In truth,gene sensors will inevitably lead to measurement errors.In this breast cancer data set,the(log1o(Ratio))numbers have missing data. When there are a large numberof missing data and measurement errors in a dataset,if we ignore the missing data and measurement errors and use the traditional square loss function for estimation,the estimation accuracy of the model will be greatly affected due to the chaotic data distribution,resulting in significant estimation bias.In the above dataset, We discover that employing the traditional squared loss function,which handles data with measurement errors and Robust Variable Selection for High-dimensional Regression with Missing Data and Measurement Errors

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found