Regression
Histogram Transform Ensembles for Large-scale Regression
Hang, Hanyuan, Lin, Zhouchen, Liu, Xiaoyu, Wen, Hongwei
We propose a novel algorithm for large-scale regression problems named histogram transform ensembles (HTE), composed of random rotations, stretchings, and translations. First of all, we investigate the theoretical properties of HTE when the regression function lies in the H\"{o}lder space $C^{k,\alpha}$, $k \in \mathbb{N}_0$, $\alpha \in (0,1]$. In the case that $k=0, 1$, we adopt the constant regressors and develop the na\"{i}ve histogram transforms (NHT). Within the space $C^{0,\alpha}$, although almost optimal convergence rates can be derived for both single and ensemble NHT, we fail to show the benefits of ensembles over single estimators theoretically. In contrast, in the subspace $C^{1,\alpha}$, we prove that if $d \geq 2(1+\alpha)/\alpha$, the lower bound of the convergence rates for single NHT turns out to be worse than the upper bound of the convergence rates for ensemble NHT. In the other case when $k \geq 2$, the NHT may no longer be appropriate in predicting smoother regression functions. Instead, we apply kernel histogram transforms (KHT) equipped with smoother regressors such as support vector machines (SVMs), and it turns out that both single and ensemble KHT enjoy almost optimal convergence rates. Then we validate the above theoretical results by numerical experiments. On the one hand, simulations are conducted to elucidate that ensemble NHT outperform single NHT. On the other hand, the effects of bin sizes on accuracy of both NHT and KHT also accord with theoretical analysis. Last but not least, in the real-data experiments, comparisons between the ensemble KHT, equipped with adaptive histogram transforms, and other state-of-the-art large-scale regression estimators verify the effectiveness and accuracy of our algorithm.
Logistic regression models for aggregated data
Whitaker, Tom, Beranger, Boris, Sisson, Scott A.
Logistic regression models are a popular and effective method to predict the probability of categorical response data. However inference for these models can become computationally prohibitive for large datasets. Here we adapt ideas from symbolic data analysis to summarise the collection of predictor variables into histogram form, and perform inference on this summary dataset. We develop ideas based on composite likelihoods to derive an efficient one-versus-rest approximate composite likelihood model for histogram-based random variables, constructed from low-dimensional marginal histograms obtained from the full histogram. We demonstrate that this procedure can achieve comparable classification rates compared to the standard full data multinomial analysis and against state-of-the-art subsampling algorithms for logistic regression, but at a substantially lower computational cost. Performance is explored through simulated examples, and analyses of large supersymmetry and satellite crop classification datasets.
Global Big Data Conference
The boons of machine learning have been leveraged in the industry in the past many years. With its increasing implementation, the ML tools have also evolved with time. Today, people can easily work with machine learning owing to its easy-to-use, user-friendly tools. As the gathering of data and turning it into actionable insights has been automated enough, people with some knowledge of technology and motivation can work with ML. These tools possess the strength to handle the mundane work of collecting data, adding structure and consistency where possible, and then starting the calculation.
Explainability: Cracking open the black box, Part 1 - KDnuggets
Explainable AI (XAI) is a sub-field of AI which has been gaining ground in the recent past. And as I machine learning practitioner dealing with customers day in and day out, I can see why. I've been an analytics practitioner for more than 5 years, and I swear, the hardest part of a machine learning project is not creating the perfect model which beats all the benchmarks. It's the part where you convince the customer why and how it works. Humans always had a dichotomy when faced with the unknown.
Robust Deep Ordinal Regression Under Label Noise
State-of-the-art ordinal regression methods rely on the correctness of the labels in the data. The real-world data might be susceptible to label noise, and the existing state of the art algorithms do not take label noise into account. So far, none of the approaches for ordinal regression take care of the label noise issue. We propose two novel noise models for ordinal regression. Further, we propose a general framework for robust ordinal regression learning. The proposed method is based on unbiased estimators approach and assumes the knowledge of the noise model. We then give a deep learning implementation for two commonly used loss functions for ordinal regression. We prove that this approach gives a rank consistent model, which is needed for a good ranking rule. We verify the proposed approach empirically and show that it is indeed robust to label noise. To the best of our knowledge, this is the first approach for learning robust deep ordinal regression models in the presence of label noise.
Is Netflix Original Content getting worse?
Using the data available I will make a simple Logistic Regression model to predict the status of a show. For this analysis the training set is small but the model may still provide some insights as to the important features in Netflix's decision to Renew or End a show. Since the mean rating of renewed vs ended shows seems to be a major difference a very simple model which would be intuitive would be to predict a higher IMDB rating as renewed and a lower rating as ended. My model will take into account more features than just rating and hopefully will be able to provide some insights into why shows are renewed or ended by Netflix management. For how small the dataset is that I am working with and how simple the model is these accuracy scores are pretty good!
Differentially Private Mixed-Type Data Generation For Unsupervised Learning
Tantipongpipat, Uthaipon, Waites, Chris, Boob, Digvijay, Siva, Amaresh Ankit, Cummings, Rachel
In this work we introduce the DP-auto-GAN framework for synthetic data generation, which combines the low dimensional representation of autoencoders with the flexibility of Generative Adversarial Networks (GANs). This framework can be used to take in raw sensitive data, and privately train a model for generating synthetic data that will satisfy the same statistical properties as the original data. This learned model can be used to generate arbitrary amounts of publicly available synthetic data, which can then be freely shared due to the post-processing guarantees of differential privacy. Our framework is applicable to unlabeled mixed-type data, that may include binary, categorical, and real-valued data. We implement this framework on both unlabeled binary data (MIMIC-III) and unlabeled mixed-type data (ADULT). We also introduce new metrics for evaluating the quality of synthetic mixed-type data, particularly in unsupervised settings.
Influenza Modeling Based on Massive Feature Engineering and International Flow Deconvolution
Liu, Ziming, Wang, Yixuan, Han, Zizhao, Wu, Dian
In this article, we focus on the analysis of the potential factors driving the spread of influenza, and possible policies to mitigate the adverse effects of the disease. To be precise, we first invoke discrete Fourier transform (DFT) to conclude a yearly periodic regional structure in the influenza activity, thus safely restricting ourselves to the analysis of the yearly influenza behavior. Then we collect a massive number of possible region-wise indicators contributing to the influenza mortality, such as consumption, immunization, sanitation, water quality, and other indicators from external data, with $1170$ dimensions in total. We extract significant features from the high dimensional indicators using a combination of data analysis techniques, including matrix completion, support vector machines (SVM), autoencoders, and principal component analysis (PCA). Furthermore, we model the international flow of migration and trade as a convolution on regional influenza activity, and solve the deconvolution problem as higher-order perturbations to the linear regression, thus separating regional and international factors related to the influenza mortality. Finally, both the original model and the perturbed model are tested on regional examples, as validations of our models. Pertaining to the policy, we make a proposal based on the connectivity data along with the previously extracted significant features to alleviate the impact of influenza, as well as efficiently propagate and carry out the policies. We conclude that environmental features and economic features are of significance to the influenza mortality. The model can be easily adapted to model other types of infectious diseases.
A Quasi-Newton Method Based Vertical Federated Learning Framework for Logistic Regression
Data privacy and security becomes a major concern in building machine learning models from different data providers. Federated learning shows promise by leaving data at providers locally and exchanging encrypted information. This paper studies the vertical federated learning structure for logistic regression where the data sets at two parties have the same sample IDs but own disjoint subsets of features. Existing frameworks adopt the first-order stochastic gradient descent algorithm, which requires large number of communication rounds. To address the communication challenge, we propose a quasi-Newton method based vertical federated learning framework for logistic regression under the additively homomorphic encryption scheme.
Asymptotic Unbiasedness of the Permutation Importance Measure in Random Forest Models
Variable selection in sparse regression models is an important task as applications ranging from biomedical research to econometrics have shown. Especially for higher dimensional regression problems, for which the link function between response and covariates cannot be directly detected, the selection of informative variables is challenging. Under these circumstances, the Random Forest method is a helpful tool to predict new outcomes while delivering measures for variable selection. One common approach is the usage of the permutation importance. Due to its intuitive idea and flexible usage, it is important to explore circumstances, for which the permutation importance based on Random Forest correctly indicates informative covariates. Regarding the latter, we deliver theoretical guarantees for the validity of the permutation importance measure under specific assumptions and prove its (asymptotic) unbiasedness. An extensive simulation study verifies our findings.