Regression
Deconfounding and Causal Regularization for Stability and External Validity
Bühlmann, Peter, Ćevid, Domagoj
Brad Efron, in his lecture at the occasion of receiving the International Prize in Statistics, brought up some fascinating thoughts on "prediction, estimation and attribution", with particular attention to the new "wide data era" which has entered statistics and data science more generally (Efron, 2019, 2020). Looking back almost 20 years ago, there has been a huge development in statistics since Leo Breiman's article "Statistical Modeling: The Two Cultures" (Breiman, 2001). Even more broadly, data science has become an emerging new field and profession. It deals with information extraction from data, often in close proximity with other sciences. Its historical roots are in statistics, and statistical "critical" thinking plays an ever important role in inference from data to models and prediction. There are many interesting facets of this broad topic, see for example David Donoho's "50 years of Data Science" (Donoho, 2017) or Bin Yu's "Veridical Data Science" (Yu and Kumbier, 2020). Efron (2019, 2020) has formulated intriguing ideas on "prediction, estimation and attribution". We are presenting here a few additional considerations on the topic, as outlined in the following Sections 1.1 and 1.2.
Privacy-Preserving Asynchronous Federated Learning Algorithms for Multi-Party Vertically Collaborative Learning
Gu, Bin, Xu, An, Huo, Zhouyuan, Deng, Cheng, Huang, Heng
The privacy-preserving federated learning for vertically partitioned data has shown promising results as the solution of the emerging multi-party joint modeling application, in which the data holders (such as government branches, private finance and e-business companies) collaborate throughout the learning process rather than relying on a trusted third party to hold data. However, existing federated learning algorithms for vertically partitioned data are limited to synchronous computation. To improve the efficiency when the unbalanced computation/communication resources are common among the parties in the federated learning system, it is essential to develop asynchronous training algorithms for vertically partitioned data while keeping the data privacy. In this paper, we propose an asynchronous federated SGD (AFSGD-VP) algorithm and its SVRG and SAGA variants on the vertically partitioned data. Moreover, we provide the convergence analyses of AFSGD-VP and its SVRG and SAGA variants under the condition of strong convexity. We also discuss their model privacy, data privacy, computational complexities and communication costs. To the best of our knowledge, AFSGD-VP and its SVRG and SAGA variants are the first asynchronous federated learning algorithms for vertically partitioned data. Extensive experimental results on a variety of vertically partitioned datasets not only verify the theoretical results of AFSGD-VP and its SVRG and SAGA variants, but also show that our algorithms have much higher efficiency than the corresponding synchronous algorithms.
Federated Doubly Stochastic Kernel Learning for Vertically Partitioned Data
Gu, Bin, Dang, Zhiyuan, Li, Xiang, Huang, Heng
In a lot of real-world data mining and machine learning applications, data are provided by multiple providers and each maintains private records of different feature sets about common entities. It is challenging to train these vertically partitioned data effectively and efficiently while keeping data privacy for traditional data mining and machine learning algorithms. In this paper, we focus on nonlinear learning with kernels, and propose a federated doubly stochastic kernel learning (FDSKL) algorithm for vertically partitioned data. Specifically, we use random features to approximate the kernel mapping function and use doubly stochastic gradients to update the solutions, which are all computed federatedly without the disclosure of data. Importantly, we prove that FDSKL has a sublinear convergence rate, and can guarantee the data security under the semi-honest assumption. Extensive experimental results on a variety of benchmark datasets show that FDSKL is significantly faster than state-of-the-art federated learning methods when dealing with kernels, while retaining the similar generalization performance.
An information criterion for automatic gradient tree boosting
Lunde, Berent Ånund Strømnes, Kleppe, Tore Selland, Skaug, Hans Julius
This article is motivated by the problem of selecting the functional form of trees and ensemble size in gradient tree boosting (Friedman, 2001; Mason et al., 2000). Gradient tree boosting (GTB) has become extremely popular in recent years, both in academia and industry: At present, an increase in the size of datasets, both in the number of observations and the richness of the data, or number of features, is seen. This, coupled with an exponential increase in computational power and a growing revelation and acceptance for data-driven decisions in the industry makes for an increasing interest in statistical learning (Hastie et al., 2001). For these new datasets, standard statistical methods such as generalized linear models (McCullagh and Nelder, 1989) that have a fixed learning rate due to their constrained functional form with bounded complexity, struggle in terms of predictive power, as they stop learning at certain information thresholds. The interest is therefore geared towards more flexible approaches such as ensembles of learners.
(Almost) All of Entity Resolution
Binette, Olivier, Steorts, Rebecca C.
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.
Robust Validation: Confident Predictions Even When Distributions Shift
Cauchois, Maxime, Gupta, Suyash, Ali, Alnur, Duchi, John C.
While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy---coming from robust statistics and optimization---is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an $f$-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.'s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity.
What are Classification and Regression in ML?
ML is extracting data from knowledge. Machine learning is a study of algorithms that uses a provides computers the ability to learn from the data and predict outcomes with accuracy, without being explicitly programmed. Machine learning is sub-branched into three categories- supervised learning, unsupervised learning, and reinforcement learning. As the name "supervised learning" suggests, here learning is based through example. We have a known set of inputs (called features, x) and outputs (called labels, y).
15 Machine Learning and Data Science Project Ideas with Datasets
In this article, we'll be discussing 15 machine learning and data science projects for beginners as well for intermediate level. Projects are some of the best investments of your time. You'll enjoy learning, stay motivated, and make faster progress. For machine learning or data science projects finding a dataset is a quite difficult task. And, to build accurate models, you need a huge amount of data.
Individualized Prediction of COVID-19 Adverse outcomes with MLHO
Estiri, Hossein, Strasser, Zachary H., Murphy, Shawn N.
The COVID-19 pandemic has devastated the world with health and economic wreckage. Precise estimates of the COVID-19 adverse outcomes on individual patients could have led to better allocation of healthcare resources and more efficient targeted preventive measures. We developed MLHO (pronounced as melo) for predicting patient-level risk of hospitalization, ICU admission, need for mechanical ventilation, and death from patients' past (before COVID-19 infection) medical records. MLHO is an end-to-end Machine Learning pipeline that implements iterative sequential representation mining and feature and model selection to predict health outcomes. MLHO's architecture enables a parallel and outcome-oriented calibration, in which different statistical learning algorithms and vectors of features are simultaneously tested and leveraged to improve prediction of health outcomes. Using clinical data from a large cohort of over 14,000 patients, we modeled the four adverse outcomes utilizing about 600 features representing patients' before-COVID health records. Overall, the best predictions were obtained from extreme and gradient boosting models. The median AUC ROC for mortality prediction was 0.91, while the prediction performance ranged between 0.79 and 0.83 for ICU, hospitalization, and ventilation. We broadly describe the clusters of features that were utilized in modeling and their relative influence on predicting each outcome. As COVID-19 cases are re-surging in the U.S. and around the world, a Machine Learning pipeline like MLHO is crucial to improve our readiness for confronting the potential future waves of COVID-19, as well as other novel infectious diseases that may emerge in the near future.
What is AI - specifically what is machine learning?
This entry is part 2 of 3 in the series What is AI once and for all? Artificial intelligence is science fiction. Artificial intelligence is already part of our everyday lives. All those statements are true, it just depends on what flavor of AI you are referring to. Most of us are familiar with the term "Artificial Intelligence." After all, it's been a popular focus in movies such as The Terminator, The Matrix, and Ex Machina but you may have recently been hearing about other terms like "#Machine Learning" and "#Deep Learning," sometimes used interchangeably with artificial intelligence.