AITopics | Regression

Collaborating Authors

Regression

News Overviews Instructional Materials AI-Alerts Classics

Distributed Bootstrap for Simultaneous Inference Under High Dimensionality

arXiv.org Machine LearningFeb-19-2021

We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces a $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $\tau_{\min}$ only increases logarithmically with the number of workers and intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-time Performance dataset. The code to reproduce the numerical results is available at GitHub: https://github.com/skchao74/Distributed-bootstrap.

assumption, linear model, log 2, (15 more...)

arXiv.org Machine Learning

2102.1008

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Missouri (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry:

Transportation > Passenger (1.00)
Transportation > Air (1.00)
Consumer Products & Services > Travel (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Machine learning on distributed Dask using Amazon SageMaker and AWS Fargate

#artificialintelligenceFeb-18-2021, 17:44:33 GMT

As businesses around the world are embarking on building innovative solutions, we're seeing a growing trend adopting data science workloads across various industries. Recently, we've seen a greater push towards reducing the friction between data engineers and data scientists. Data scientists are now enabled to run their experiments on their local machine and port to it powerful clusters that can scale without rewriting the code. You have many options for running data science workloads, such as running it on your own managed Spark cluster. Alternatively there are cloud options such as Amazon SageMaker, Amazon EMR and Amazon Elastic Kubernetes Service (Amazon EKS) clusters.

dask dataframe, dataframe, fargate, (12 more...)

#artificialintelligence

Country:

North America > United States > New York (0.05)
North America > United States > Colorado > El Paso County > Colorado Springs (0.05)

Industry:

Transportation > Passenger (0.70)
Transportation > Ground > Road (0.48)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.33)

Add feedback

A Basic Guide to Logistic Regression

#artificialintelligenceFeb-18-2021, 01:53:26 GMT

I love working with human centered data to solve real-world problems using Tableau as my canvas. I am passionate about Data Science, Statistics, and Math. When I'm not working with data, you'll find me jamming with my ukulele or stroking a paintbrush to make some art. If you see me out there, come say hi!

basic guide, logistic regression

#artificialintelligence

Genre:

Research Report > New Finding (0.40)
Research Report > Experimental Study (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.40)

Add feedback

Robust non-parametric mortality and fertility modelling and forecasting: Gaussian process regression approaches

Lam, Ka Kin, Wang, Bo

arXiv.org Machine LearningFeb-18-2021

There has been an increasing demand for demographic modelling and forecasting over the last few decades, driven by many developed countries are now suffering a rapid decline in mortality and fertility, leading to a significant increase in expenditures on health services for an ageing population and a shortage of future labour. A better understanding of the mortality and fertility patterns and trends is always of importance for all stakeholders in a society as the mortality forecasts, for example, play a vital role for the insurance and pensions industries in pricing their insurance products. The fertility predictions are also of great interest to the government and education sectors in planing children's welfare and educational services. Unlike the biological and the medical methods, statisticians have developed very different and purely mathematical methods to model the demographic patterns and trends which are well-documented by Preston et al. (2000). The history of demographic modelling with the mathematical approaches can be traced back to some deterministic models proposed in the midnineteenth century, see, for example, Gompertz (1825) and Makeham (1860). The deterministic models are, however, restricted with few fixed factors and have no stochastic process considered owing to the lack of computing capability in that early period.

covariance function, gpr model, mortality rate, (15 more...)

arXiv.org Machine Learning

2102.09676

Country:

Asia > Japan (0.06)
North America > Canada (0.05)
Europe > Switzerland (0.05)
(16 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Education (0.54)
Health & Medicine (0.50)
Government (0.48)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.83)

Add feedback

Transfer Learning for Linear Regression: a Statistical Test of Gain

Obst, David, Ghattas, Badih, Cugliari, Jairo, Oppenheim, Georges, Claudel, Sandra, Goude, Yannig

arXiv.org Machine LearningFeb-18-2021

Transfer learning, also referred as knowledge transfer, aims at reusing knowledge from a source dataset to a similar target one. While many empirical studies illustrate the benefits of transfer learning, few theoretical results are established especially for regression problems. In this paper a theoretical framework for the problem of parameter transfer for the linear model is proposed. It is shown that the quality of transfer for a new input vector $x$ depends on its representation in an eigenbasis involving the parameters of the problem. Furthermore a statistical test is constructed to predict whether a fine-tuned model has a lower prediction quadratic risk than the base target model for an unobserved sample. Efficiency of the test is illustrated on synthetic data as well as real electricity consumption data.

estimator, linear model, negative transfer, (15 more...)

arXiv.org Machine Learning

2102.09504

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
North America > United States (0.04)

Genre: Research Report (0.83)

Industry: Energy > Power Industry (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.50)

Add feedback

StatEcoNet: Statistical Ecology Neural Networks for Species Distribution Modeling

Seo, Eugene, Hutchinson, Rebecca A., Fu, Xiao, Li, Chelsea, Hallman, Tyler A., Kilbride, John, Robinson, W. Douglas

arXiv.org Machine LearningFeb-17-2021

This paper focuses on a core task in computational sustainability and statistical ecology: species distribution modeling (SDM). In SDM, the occurrence pattern of a species on a landscape is predicted by environmental features based on observations at a set of locations. At first, SDM may appear to be a binary classification problem, and one might be inclined to employ classic tools (e.g., logistic regression, support vector machines, neural networks) to tackle it. However, wildlife surveys introduce structured noise (especially under-counting) in the species observations. If unaccounted for, these observation errors systematically bias SDMs. To address the unique challenges of SDM, this paper proposes a framework called StatEcoNet. Specifically, this work employs a graphical generative model in statistical ecology to serve as the skeleton of the proposed computational framework and carefully integrates neural networks under the framework. The advantages of StatEcoNet over related approaches are demonstrated on simulated datasets as well as bird species data. Since SDMs are critical tools for ecological science and natural resource management, StatEcoNet may offer boosted computational and analytical powers to a wide range of applications that have significant social impacts, e.g., the study and conservation of threatened species.

detection prob, occupancy prob, prob, (16 more...)

arXiv.org Machine Learning

2102.08534

Country:

North America > United States > Ohio > Lucas County > Oregon (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(9 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Education (0.46)
Social Sector (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)

Add feedback

Split Modeling for High-Dimensional Logistic Regression

Christidis, Anthony-Alexander, Van Aelst, Stefan, Zamar, Ruben

arXiv.org Machine LearningFeb-17-2021

A novel method is proposed to learn an ensemble of logistic classification models in the context of high-dimensional binary classification. The models in the ensemble are built simultaneously by optimizing a multi-convex objective function. To enforce diversity between the models the objective function penalizes overlap between the models in the ensemble. We study the bias and variance of the individual models as well as their correlation and discuss how our method learns the ensemble by exploiting the accuracy-diversity trade-off for ensemble models. In contrast to other ensembling approaches, the resulting ensemble model is fully interpretable as a logistic regression model and at the same time yields excellent prediction accuracy as demonstrated in an extensive simulation study and gene expression data applications. An open-source compiled software library implementing the proposed method is briefly discussed.

adaptive 0, split-en 0, split-lasso 0, (13 more...)

arXiv.org Machine Learning

2102.08591

Country:

North America > United States > New York (0.04)
North America > Canada > British Columbia (0.04)
Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)

Genre:

Research Report > Experimental Study (0.48)
Research Report > New Finding (0.34)

Industry:

Health & Medicine > Therapeutic Area (0.67)
Health & Medicine > Pharmaceuticals & Biotechnology (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

Muddling Labels for Regularization, a novel approach to generalization

Lounici, Karim, Meziani, Katia, Riu, Benjamin

arXiv.org Artificial IntelligenceFeb-17-2021

Generalization is a central problem in Machine Learning. Indeed most prediction methods require careful calibration of hyperparameters usually carried out on a hold-out \textit{validation} dataset to achieve generalization. The main goal of this paper is to introduce a novel approach to achieve generalization without any data splitting, which is based on a new risk measure which directly quantifies a model's tendency to overfit. To fully understand the intuition and advantages of this new approach, we illustrate it in the simple linear regression model ($Y=X\beta+\xi$) where we develop a new criterion. We highlight how this criterion is a good proxy for the true generalization risk. Next, we derive different procedures which tackle several structures simultaneously (correlation, sparsity,...). Noticeably, these procedures \textbf{concomitantly} train the model and calibrate the hyperparameters. In addition, these procedures can be implemented via classical gradient descent methods when the criterion is differentiable w.r.t. the hyperparameters. Our numerical experiments reveal that our procedures are computationally feasible and compare favorably to the popular approach (Ridge, LASSO and Elastic-Net combined with grid-search cross-validation) in term of generalization. They also outperform the baseline on two additional tasks: estimation and support recovery of $\beta$. Moreover, our procedures do not require any expertise for the calibration of the initial parameters which remain the same for all the datasets we experimented on.

generalization, procedure, regularization, (17 more...)

arXiv.org Artificial Intelligence

2102.08769

Country:

South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)

Genre: Research Report > Promising Solution (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Trees-Based Models for Correlated Data

Rabinowicz, Assaf, Rosset, Saharon

arXiv.org Machine LearningFeb-16-2021

This paper presents a new approach for treesbased In this paper we develop a method which combines the regression, such as simple regression tree, concepts of random effects and random fields -- which are random forest and gradient boosting, in settings convenient platforms for analyzing correlated data -- and involving correlated data. We show the problems trees-based models such as: regression tree, random forest that arise when implementing standard treesbased and gradient boosting. The desired result is that the treesbased regression models, which ignore the correlation part results a high prediction accuracy and model structure. Our new approach explicitly selection capabilities and the random effects aspect enables takes the correlation structure into account in the to boost the model performance by utilizing correctly the splitting criterion, stopping rules and fitted values correlation structure and even allows statistical inference.

algorithm, covariate, regression tree, (15 more...)

arXiv.org Machine Learning

2102.08114

Country:

North America > United States > California (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.36)

Add feedback

Outside the Echo Chamber: Optimizing the Performative Risk

Miller, John, Perdomo, Juan C., Zrnic, Tijana

arXiv.org Machine LearningFeb-16-2021

In performative prediction, predictions guide decision-making and hence can influence the distribution of future data. To date, work on performative prediction has focused on finding performatively stable models, which are the fixed points of repeated retraining. However, stable solutions can be far from optimal when evaluated in terms of the performative risk, the loss experienced by the decision maker when deploying a model. In this paper, we shift attention beyond performative stability and focus on optimizing the performative risk directly. We identify a natural set of properties of the loss function and model-induced distribution shift under which the performative risk is convex, a property which does not follow from convexity of the loss alone. Furthermore, we develop algorithms that leverage our structural assumptions to optimize the performative risk with better sample efficiency than generic methods for derivative-free convex optimization.

algorithm, convex, performative risk, (16 more...)

arXiv.org Machine Learning

2102.0857

Country:

North America > United States > New York (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)

Genre: Research Report (0.82)

Industry: Government > Voting & Elections (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback