Regression
Risk-Adaptive Approaches to Learning and Decision Making: A Survey
Uncertainty is prevalent in engineering design, statistical learning, and decision making broadly. Due to inherent risk-averseness and ambiguity about assumptions, it is common to address uncertainty by formulating and solving conservative optimization models expressed using measures of risk and related concepts. We survey the rapid development of risk measures over the last quarter century. From their beginning in financial engineering, we recount the spread to nearly all areas of engineering and applied mathematics. Solidly rooted in convex analysis, risk measures furnish a general framework for handling uncertainty with significant computational and theoretical advantages. We describe the key facts, list several concrete algorithms, and provide an extensive list of references for further reading. The survey recalls connections with utility theory and distributionally robust optimization, points to emerging applications areas such as fair machine learning, and defines measures of reliability.
Vertical Federated Learning: Concepts, Advances and Challenges
Liu, Yang, Kang, Yan, Zou, Tianyuan, Pu, Yanhong, He, Yuanqin, Ye, Xiaozhou, Ouyang, Ye, Zhang, Ya-Qin, Yang, Qiang
Federated Learning (FL) [1] is a novel machine learning paradigm where multiple parties collaboratively build machine learning models without centralizing their data. The concept of FL was first proposed by Google in 2016 [2] to describe a cross-device scenario where millions of mobile devices are coordinated by a central server while local data are not transferred. This concept is soon extended to a cross-silo collaboration scenario among organizations [3], where a small number of reliable organizations join a federation to train a machine learning model. In [3], FL is, for the first time, categorized into three categories based on how data is partitioned in the sample and feature space: Horizontal Federated Learning (HFL), Vertical Federated Learning (VFL) and Federated Transfer Learning (FTL) (See Figure 1). HFL refers to the FL setting where participants share the same feature space while holding different samples. For example, Google uses HFL to allow mobile phone users to use their dataset to collaboratively train a next-word prediction model [2]. VFL refers to the FL setting where datasets share the same samples/users while holding different features. For example, Webank uses VFL to collaborate with an invoice agency to build financial risk models for their enterprise customers [4].
Asymptotic Characterisation of Robust Empirical Risk Minimisation Performance in the Presence of Outliers
Vilucchio, Matteo, Troiani, Emanuele, Erba, Vittorio, Krzakala, Florent
We study robust linear regression in high-dimension, when both the dimension $d$ and the number of data points $n$ diverge with a fixed ratio $\alpha=n/d$, and study a data model that includes outliers. We provide exact asymptotics for the performances of the empirical risk minimisation (ERM) using $\ell_2$-regularised $\ell_2$, $\ell_1$, and Huber losses, which are the standard approach to such problems. We focus on two metrics for the performance: the generalisation error to similar datasets with outliers, and the estimation error of the original, unpolluted function. Our results are compared with the information theoretic Bayes-optimal estimation bound. For the generalization error, we find that optimally-regularised ERM is asymptotically consistent in the large sample complexity limit if one perform a simple calibration, and compute the rates of convergence. For the estimation error however, we show that due to a norm calibration mismatch, the consistency of the estimator requires an oracle estimate of the optimal norm, or the presence of a cross-validation set not corrupted by the outliers. We examine in detail how performance depends on the loss function and on the degree of outlier corruption in the training set and identify a region of parameters where the optimal performance of the Huber loss is identical to that of the $\ell_2$ loss, offering insights into the use cases of different loss functions.
Hebbian learning inspired estimation of the linear regression parameters from queries
Schmidt-Hieber, Johannes, Koolen, Wouter M
Local learning rules in biological neural networks (BNNs) are commonly referred to as Hebbian learning. [26] links a biologically motivated Hebbian learning rule to a specific zeroth-order optimization method. In this work, we study a variation of this Hebbian learning rule to recover the regression vector in the linear regression model. Zeroth-order optimization methods are known to converge with suboptimal rate for large parameter dimension compared to first-order methods like gradient descent, and are therefore thought to be in general inferior. By establishing upper and lower bounds, we show, however, that such methods achieve near-optimal rates if only queries of the linear regression loss are available. Moreover, we prove that this Hebbian learning rule can achieve considerably faster rates than any non-adaptive method that selects the queries independently of the data.
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Pacchiardi, Lorenzo, Chan, Alex J., Mindermann, Sรถren, Moscovitz, Ilan, Pan, Alexa Y., Gal, Yarin, Evans, Owain, Brauner, Jan
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
Beyond Log-Concavity: Theory and Algorithm for Sum-Log-Concave Optimization
This paper extends the classic theory of convex optimization to the minimization of functions that are equal to the negated logarithm of what we term as a sum-log-concave function, i.e., a sum of log-concave functions. In particular, we show that such functions are in general not convex but still satisfy generalized convexity inequalities. These inequalities unveil the key importance of a certain vector that we call the cross-gradient and that is, in general, distinct from the usual gradient. Thus, we propose the Cross Gradient Descent (XGD) algorithm moving in the opposite direction of the cross-gradient and derive a convergence analysis. As an application of our sum-log-concave framework, we introduce the so-called checkered regression method relying on a sum-log-concave function. This classifier extends (multiclass) logistic regression to non-linearly separable problems since it is capable of tessellating the feature space by using any given number of hyperplanes, creating a checkerboard-like pattern of decision regions.
Measurement Models For Sailboats Price vs. Features And Regional Areas
Weng, Jiaqi, Feng, Chunlin, Shao, Yihan
In this study, we investigated the relationship between sailboat technical specifications and their prices, as well as regional pricing influences. Utilizing a dataset encompassing characteristics like length, beam, draft, displacement, sail area, and waterline, we applied multiple machine learning models to predict sailboat prices. The gradient descent model demonstrated superior performance, producing the lowest MSE and MAE. Our analysis revealed that monohulled boats are generally more affordable than catamarans, and that certain specifications such as length, beam, displacement, and sail area directly correlate with higher prices. Interestingly, lower draft was associated with higher listing prices. We also explored regional price determinants and found that the United States tops the list in average sailboat prices, followed by Europe, Hong Kong, and the Caribbean. Contrary to our initial hypothesis, a country's GDP showed no direct correlation with sailboat prices. Utilizing a 50% cross-validation method, our models yielded consistent results across test groups. Our research offers a machine learning-enhanced perspective on sailboat pricing, aiding prospective buyers in making informed decisions.
Predicting the cardinality and maximum degree of a reduced Gr\"obner basis
Jamshidi, Shahrzad, Kang, Eric, Petroviฤ, Sonja
We construct neural network regression models to predict key metrics of complexity for Gr\"obner bases of binomial ideals. This work illustrates why predictions with neural networks from Gr\"obner computations are not a straightforward process. Using two probabilistic models for random binomial ideals, we generate and make available a large data set that is able to capture sufficient variability in Gr\"obner complexity. We use this data to train neural networks and predict the cardinality of a reduced Gr\"obner basis and the maximum total degree of its elements. While the cardinality prediction problem is unlike classical problems tackled by machine learning, our simulations show that neural networks, providing performance statistics such as $r^2 = 0.401$, outperform naive guess or multiple regression models with $r^2 = 0.180$.
A Weighted Prognostic Covariate Adjustment Method for Efficient and Powerful Treatment Effect Inferences in Randomized Controlled Trials
Vanderbeek, Alyssa M., Vidovszky, Anna A., Ross, Jessica L., Sabbaghi, Arman, Walsh, Jonathan R., Fisher, Charles K., Disease, the Critical Path for Alzheimer's, Initiative, the Alzheimer's Disease Neuroimaging, Disease, the European Prevention of Alzheimer's, Consortium, null, Study, the Alzheimer's Disease Cooperative
A crucial task for a randomized controlled trial (RCT) is to specify a statistical method that can yield an efficient estimator and powerful test for the treatment effect. A novel and effective strategy to obtain efficient and powerful treatment effect inferences is to incorporate predictions from generative artificial intelligence (AI) algorithms into covariate adjustment for the regression analysis of a RCT. Training a generative AI algorithm on historical control data enables one to construct a digital twin generator (DTG) for RCT participants, which utilizes a participant's baseline covariates to generate a probability distribution for their potential control outcome. Summaries of the probability distribution from the DTG are highly predictive of the trial outcome, and adjusting for these features via regression can thus improve the quality of treatment effect inferences, while satisfying regulatory guidelines on statistical analyses, for a RCT. However, a critical assumption in this strategy is homoskedasticity, or constant variance of the outcome conditional on the covariates. In the case of heteroskedasticity, existing covariate adjustment methods yield inefficient estimators and underpowered tests. We propose to address heteroskedasticity via a weighted prognostic covariate adjustment methodology (Weighted PROCOVA) that adjusts for both the mean and variance of the regression model using information obtained from the DTG. We prove that our method yields unbiased treatment effect estimators, and demonstrate via comprehensive simulation studies and case studies from Alzheimer's disease that it can reduce the variance of the treatment effect estimator, maintain the Type I error rate, and increase the power of the test for the treatment effect from 80% to 85%~90% when the variances from the DTG can explain 5%~10% of the variation in the RCT participants' outcomes.
Linked shrinkage to improve estimation of interaction effects in regression models
van de Wiel, Mark A., Amestoy, Matteo, Hoogland, Jeroen
We address a classical problem in statistics: adding two-way interaction terms to a regression model. As the covariate dimension increases quadratically, we develop an estimator that adapts well to this increase, while providing accurate estimates and appropriate inference. Existing strategies overcome the dimensionality problem by only allowing interactions between relevant main effects. Building on this philosophy, we implement a softer link between the two types of effects using a local shrinkage model. We empirically show that borrowing strength between the amount of shrinkage for main effects and their interactions can strongly improve estimation of the regression coefficients. Moreover, we evaluate the potential of the model for inference, which is notoriously hard for selection strategies. Large-scale cohort data are used to provide realistic illustrations and evaluations. Comparisons with other methods are provided. The evaluation of variable importance is not trivial in regression models with many interaction terms. Therefore, we derive a new analytical formula for the Shapley value, which enables rapid assessment of individual-specific variable importance scores and their uncertainties. Finally, while not targeting for prediction, we do show that our models can be very competitive to a more advanced machine learner, like random forest, even for fairly large sample sizes. The implementation of our method in RStan is fairly straightforward, allowing for adjustments to specific needs.