Regression
Covariance regression with random forests
Alakus, Cansu, Larocque, Denis, Labbe, Aurelie
Capturing the conditional covariances or correlations among the elements of a multivariate response vector based on covariates is important to various fields including neuroscience, epidemiology and biomedicine. We propose a new method called Covariance Regression with Random Forests (CovRegRF) to estimate the covariance matrix of a multivariate response given a set of covariates, using a random forest framework. Random forest trees are built with a splitting rule specially designed to maximize the difference between the sample covariance matrix estimates of the child nodes. We also propose a significance test for the partial effect of a subset of covariates. We evaluate the performance of the proposed method and significance test through a simulation study which shows that the proposed method provides accurate covariance matrix estimates and that the Type-1 error is well controlled. An application of the proposed method to thyroid disease data is also presented. CovRegRF is implemented in a freely available R package on CRAN.
Computationally Efficient and Statistically Optimal Robust High-Dimensional Linear Regression
Shen, Yinan, Li, Jingyang, Cai, Jian-Feng, Xia, Dong
High-dimensional linear regression under heavy-tailed noise or outlier corruption is challenging, both computationally and statistically. Convex approaches have been proven statistically optimal but suffer from high computational costs, especially since the robust loss functions are usually non-smooth. More recently, computationally fast non-convex approaches via sub-gradient descent are proposed, which, unfortunately, fail to deliver a statistically consistent estimator even under sub-Gaussian noise. In this paper, we introduce a projected sub-gradient descent algorithm for both the sparse linear regression and low-rank linear regression problems. The algorithm is not only computationally efficient with linear convergence but also statistically optimal, be the noise Gaussian or heavy-tailed with a finite 1 + epsilon moment. The convergence theory is established for a general framework and its specific applications to absolute loss, Huber loss and quantile loss are investigated. Compared with existing non-convex methods, ours reveals a surprising phenomenon of two-phase convergence. In phase one, the algorithm behaves as in typical non-smooth optimization that requires gradually decaying stepsizes. However, phase one only delivers a statistically sub-optimal estimator, which is already observed in the existing literature. Interestingly, during phase two, the algorithm converges linearly as if minimizing a smooth and strongly convex objective function, and thus a constant stepsize suffices. Underlying the phase-two convergence is the smoothing effect of random noise to the non-smooth robust losses in an area close but not too close to the truth. Numerical simulations confirm our theoretical discovery and showcase the superiority of our algorithm over prior methods.
Best-Effort Adaptation
Awasthi, Pranjal, Cortes, Corinna, Mohri, Mehryar
We study a problem of best-effort adaptation motivated by several applications and considerations, which consists of determining an accurate predictor for a target domain, for which a moderate amount of labeled samples are available, while leveraging information from another domain for which substantially more labeled samples are at one's disposal. We present a new and general discrepancy-based theoretical analysis of sample reweighting methods, including bounds holding uniformly over the weights. We show how these bounds can guide the design of learning algorithms that we discuss in detail. We further show that our learning guarantees and algorithms provide improved solutions for standard domain adaptation problems, for which few labeled data or none are available from the target domain. We finally report the results of a series of experiments demonstrating the effectiveness of our best-effort adaptation and domain adaptation algorithms, as well as comparisons with several baselines. We also discuss how our analysis can benefit the design of principled solutions for fine-tuning.
Enhancing Clinical Predictive Modeling through Model Complexity-Driven Class Proportion Tuning for Class Imbalanced Data: An Empirical Study on Opioid Overdose Prediction
Liu, Yinan, Dong, Xinyu, Lyu, Weimin, Rosenthal, Richard N., Wong, Rachel, Ma, Tengfei, Wang, Fusheng
Class imbalance problems widely exist in the medical field and heavily deteriorates performance of clinical predictive models. Most techniques to alleviate the problem rebalance class proportions and they predominantly assume the rebalanced proportions should be a function of the original data and oblivious to the model one uses. This work challenges this prevailing assumption and proposes that links the optimal class proportions to the model complexity, thereby tuning the class proportions per model. Our experiments on the opioid overdose prediction problem highlights the performance gain of tuning class proportions. Rigorous regression analysis also confirms the advantages of the theoretical framework proposed and the statistically significant correlation between the hyperparameters controlling the model complexity and the optimal class proportions. Introduction The class imbalance problem often occurs in medical machine learning because "positive" patients generally comprise a small fraction of the total population.
Coresets for Wasserstein Distributionally Robust Optimization Problems
Huang, Ruomin, Huang, Jiawei, Liu, Wenjie, Ding, Hu
Wasserstein distributionally robust optimization (\textsf{WDRO}) is a popular model to enhance the robustness of machine learning with ambiguous data. However, the complexity of \textsf{WDRO} can be prohibitive in practice since solving its ``minimax'' formulation requires a great amount of computation. Recently, several fast \textsf{WDRO} training algorithms for some specific machine learning tasks (e.g., logistic regression) have been developed. However, the research on designing efficient algorithms for general large-scale \textsf{WDRO}s is still quite limited, to the best of our knowledge. \textit{Coreset} is an important tool for compressing large dataset, and thus it has been widely applied to reduce the computational complexities for many optimization problems. In this paper, we introduce a unified framework to construct the $\epsilon$-coreset for the general \textsf{WDRO} problems. Though it is challenging to obtain a conventional coreset for \textsf{WDRO} due to the uncertainty issue of ambiguous data, we show that we can compute a ``dual coreset'' by using the strong duality property of \textsf{WDRO}. Also, the error introduced by the dual coreset can be theoretically guaranteed for the original \textsf{WDRO} objective. To construct the dual coreset, we propose a novel grid sampling approach that is particularly suitable for the dual formulation of \textsf{WDRO}. Finally, we implement our coreset approach and illustrate its effectiveness for several \textsf{WDRO} problems in the experiments.
When a CBR in Hand is Better than Twins in the Bush
Ahmed, Mobyen Uddin, Barua, Shaibal, Begum, Shahina, Islam, Mir Riyanul, Weber, Rosina O
AI methods referred to as interpretable are often discredited as inaccurate by supporters of the existence of a trade-off between interpretability and accuracy. In many problem contexts however this trade-off does not hold. This paper discusses a regression problem context to predict flight take-off delays where the most accurate data regression model was trained via the XGBoost implementation of gradient boosted decision trees. While building an XGB-CBR Twin and converting the XGBoost feature importance into global weights in the CBR model, the resultant CBR model alone provides the most accurate local prediction, maintains the global importance to provide a global explanation of the model, and offers the most interpretable representation for local explanations. This resultant CBR model becomes a benchmark of accuracy and interpretability for this problem context, and hence it is used to evaluate the two additive feature attribute methods SHAP and LIME to explain the XGBoost regression model.
Learning Group Importance using the Differentiable Hypergeometric Distribution
Sutter, Thomas M., Manduchi, Laura, Ryser, Alain, Vogt, Julia E.
Partitioning a set of elements into subsets of a priori unknown sizes is essential in many applications. These subset sizes are rarely explicitly learned - be it the cluster sizes in clustering applications or the number of shared versus independent generative latent factors in weakly-supervised learning. Probability distributions over correct combinations of subset sizes are non-differentiable due to hard constraints, which prohibit gradient-based optimization. In this work, we propose the differentiable hypergeometric distribution. The hypergeometric distribution models the probability of different group sizes based on their relative importance. We introduce reparameterizable gradients to learn the importance between groups and highlight the advantage of explicitly learning the size of subsets in two typical applications: weakly-supervised learning and clustering. In both applications, we outperform previous approaches, which rely on suboptimal heuristics to model the unknown size of groups. Many machine learning approaches rely on differentiable sampling procedures, from which the reparameterization trick for Gaussian distributions is best known (Kingma & Welling, 2014; Rezende et al., 2014). The non-differentiable nature of discrete distributions has long hindered their use in machine learning pipelines with end-to-end gradient-based optimization. Only the concrete distribution (Maddison et al., 2017) or Gumbel-Softmax trick (Jang et al., 2016) boosted the use of categorical distributions in stochastic networks. Unlike the high-variance gradients of score-based methods such as REINFORCE (Williams, 1992), these works enable reparameterized and lowvariance gradients with respect to the categorical weights. Despite enormous progress in recent years, the extension to more complex probability distributions is still missing or comes with a trade-off regarding differentiability or computational speed (Huijben et al., 2021). The hypergeometric distribution plays a vital role in various areas of science, such as social and computer science and biology. The range of applications goes from modeling gene mutations and recommender systems to analyzing social networks (Becchetti et al., 2011; Casiraghi et al., 2016; Lodato et al., 2015). The hypergeometric distribution describes sampling without replacement and, therefore, models the number of samples per group given a limited number of total samples. Hence, it is essential wherever the choice of a single group element influences the probability of the remaining elements being drawn. Previous work mainly uses the hypergeometric distribution implicitly to model assumptions or as a tool to prove theorems.
Provable Identifiability of Two-Layer ReLU Neural Networks via LASSO Regularization
Li, Gen, Wang, Ganghua, Ding, Jie
LASSO regularization is a popular regression tool to enhance the prediction accuracy of statistical models by performing variable selection through the $\ell_1$ penalty, initially formulated for the linear model and its variants. In this paper, the territory of LASSO is extended to two-layer ReLU neural networks, a fashionable and powerful nonlinear regression model. Specifically, given a neural network whose output $y$ depends only on a small subset of input $\boldsymbol{x}$, denoted by $\mathcal{S}^{\star}$, we prove that the LASSO estimator can stably reconstruct the neural network and identify $\mathcal{S}^{\star}$ when the number of samples scales logarithmically with the input dimension. This challenging regime has been well understood for linear models while barely studied for neural networks. Our theory lies in an extended Restricted Isometry Property (RIP)-based analysis framework for two-layer ReLU neural networks, which may be of independent interest to other LASSO or neural network settings. Based on the result, we advocate a neural network-based variable selection method. Experiments on simulated and real-world datasets show promising performance of the variable selection approach compared with existing techniques.
AUTOLYCUS: Exploiting Explainable AI (XAI) for Model Extraction Attacks against White-Box Models
Oksuz, Abdullah Caglar, Halimi, Anisa, Ayday, Erman
Explainable Artificial Intelligence (XAI) encompasses a range of techniques and procedures aimed at elucidating the decision-making processes of AI models. While XAI is valuable in understanding the reasoning behind AI models, the data used for such revelations poses potential security and privacy vulnerabilities. Existing literature has identified privacy risks targeting machine learning models, including membership inference, model inversion, and model extraction attacks. Depending on the settings and parties involved, such attacks may target either the model itself or the training data used to create the model. We have identified that tools providing XAI can particularly increase the vulnerability of model extraction attacks, which can be a significant issue when the owner of an AI model prefers to provide only black-box access rather than sharing the model parameters and architecture with other parties. To explore this privacy risk, we propose AUTOLYCUS, a model extraction attack that leverages the explanations provided by popular explainable AI tools. We particularly focus on white-box machine learning (ML) models such as decision trees and logistic regression models. We have evaluated the performance of AUTOLYCUS on 5 machine learning datasets, in terms of the surrogate model's accuracy and its similarity to the target model. We observe that the proposed attack is highly effective; it requires up to 60x fewer queries to the target model compared to the state-of-the-art attack, while providing comparable accuracy and similarity. We first validate the performance of the proposed algorithm on decision trees, and then show its performance on logistic regression models as an indicator that the proposed algorithm performs well on white-box ML models in general. Finally, we show that the existing countermeasures remain ineffective for the proposed attack.
A Comprehensive Survey on Enterprise Financial Risk Analysis from Big Data Perspective
Zhao, Yu, Du, Huaming, Li, Qing, Zhuang, Fuzhen, Liu, Ji, Kou, Gang
Enterprise financial risk analysis aims at predicting the future financial risk of enterprises. Due to its wide and significant application, enterprise financial risk analysis has always been the core research topic in the fields of Finance and Management. Based on advanced computer science and artificial intelligence technologies, enterprise risk analysis research is experiencing rapid developments and making significant progress. Therefore, it is both necessary and challenging to comprehensively review the relevant studies. Although there are already some valuable and impressive surveys on enterprise risk analysis from the perspective of Finance and Management, these surveys introduce approaches in a relatively isolated way and lack recent advances in enterprise financial risk analysis. In contrast, this paper attempts to provide a systematic literature survey of enterprise risk analysis approaches from Big Data perspective, which reviews more than 250 representative articles in the past almost 50 years (from 1968 to 2023). To the best of our knowledge, this is the first and only survey work on enterprise financial risk from Big Data perspective. Specifically, this survey connects and systematizes the existing enterprise financial risk studies, i.e. to summarize and interpret the problems, methods, and spotlights in a comprehensive way. In particular, we first introduce the issues of enterprise financial risks in terms of their types,granularity, intelligence, and evaluation metrics, and summarize the corresponding representative works. Then, we compare the analysis methods used to learn enterprise financial risk, and finally summarize the spotlights of the most representative works. Our goal is to clarify current cutting-edge research and its possible future directions to model enterprise risk, aiming to fully understand the mechanisms of enterprise risk generation and contagion.