Regression
PackVFL: Efficient HE Packing for Vertical Federated Learning
Yang, Liu, Cai, Shuowei, Chai, Di, Zhang, Junxue, Tian, Han, Jin, Yilun, Guo, Kun, Chen, Kai, Yang, Qiang
As an essential tool of secure distributed machine learning, vertical federated learning (VFL) based on homomorphic encryption (HE) suffers from severe efficiency problems due to data inflation and time-consuming operations. To this core, we propose PackVFL, an efficient VFL framework based on packed HE (PackedHE), to accelerate the existing HE-based VFL algorithms. PackVFL packs multiple cleartexts into one ciphertext and supports single-instruction-multiple-data (SIMD)-style parallelism. We focus on designing a high-performant matrix multiplication (MatMult) method since it takes up most of the ciphertext computation time in HE-based VFL. Besides, devising the MatMult method is also challenging for PackedHE because a slight difference in the packing way could predominantly affect its computation and communication costs. Without domain-specific design, directly applying SOTA MatMult methods is hard to achieve optimal. Therefore, we make a three-fold design: 1) we systematically explore the current design space of MatMult and quantify the complexity of existing approaches to provide guidance; 2) we propose a hybrid MatMult method according to the unique characteristics of VFL; 3) we adaptively apply our hybrid method in representative VFL algorithms, leveraging distinctive algorithmic properties to further improve efficiency. As the batch size, feature dimension and model size of VFL scale up to large sizes, PackVFL consistently delivers enhanced performance. Empirically, PackVFL propels existing VFL algorithms to new heights, achieving up to a 51.52X end-to-end speedup. This represents a substantial 34.51X greater speedup compared to the direct application of SOTA MatMult methods.
Optimal Bias-Correction and Valid Inference in High-Dimensional Ridge Regression: A Closed-Form Solution
It was first introduced to data analysis by Hoerl (1959) and later formulated in Hoerl and Kennard (1970b,a) for providing a robust solution to some of the persistent challenges encountered in traditional linear regression techniques; see Hoerl (1985) for a nice review. Emerging as a fundamental technique in predictive modeling, ridge regression addresses issues such as multicollinearity and overfitting, which commonly afflict predictive models dealing with high-dimensional data. Since its inception, ridge regression's practical adoption persists due to its superior performance over the least-squares estimator in various scenarios, evident in applications across neuroscience, chemistry, biology, and economics; see Leonard et al. (2023), Zahrt et al. (2019), Otwinowski and Plotkin (2014), Giannone et al. (2021), and Abadie and Kasy (2019), among others, underscoring its empirical effectiveness. From a shrinkage perspective, the ridge estimator also dominates the least-squares solutions in the sense that its mean-squared errors (MSEs) can be smaller, which provides a reasonable explanation on the empirical effectiveness of ridge estimators. See Theobald (1974), Athey and Imbens (2019), Hastie (2020), Hansen (2022a), and a comprehensive introduction to ridge regression in van Wieringen (2023). The ridge estimator offers a closed-form expression that simplifies both theoretical and empirical analyses. It aligns with the dense modeling techniques of Giannone et al. (2021), which acknowledge the potential significance of all explanatory variables for prediction. Empirical studies, such as those in Giannone et al. (2021), indicate that dense models generally tend to outperform the sparse ones in out-of-sample economic prediction performance. Similarly, Abadie and Kasy (2019) find that the ridge estimators dominate the lasso and the pre-testing estimators in terms of the risks when the effects of different predictors on the dependent variable are "smoothly distributed".
M-DEW: Extending Dynamic Ensemble Weighting to Handle Missing Values
Catto, Adam, Jia, Nan, Salleb-Aouissi, Ansaf, Raja, Anita
Missing value imputation is a crucial preprocessing step for many machine learning problems. However, it is often considered as a separate subtask from downstream applications such as classification, regression, or clustering, and thus is not optimized together with them. We hypothesize that treating the imputation model and downstream task model together and optimizing over full pipelines will yield better results than treating them separately. Our work describes a novel AutoML technique for making downstream predictions with missing data that automatically handles preprocessing, model weighting, and selection during inference time, with minimal compute overhead. Specifically we develop M-DEW, a Dynamic missingness-aware Ensemble Weighting (DEW) approach, that constructs a set of two-stage imputation-prediction pipelines, trains each component separately, and dynamically calculates a set of pipeline weights for each sample during inference time. We thus extend previous work on dynamic ensemble weighting to handle missing data at the level of full imputation-prediction pipelines, improving performance and calibration on downstream machine learning tasks over standard model averaging techniques. M-DEW is shown to outperform the state-of-the-art in that it produces statistically significant reductions in model perplexity in 17 out of 18 experiments, while improving average precision in 13 out of 18 experiments.
Linguacodus: A Synergistic Framework for Transformative Code Generation in Machine Learning Pipelines
Trofimova, Ekaterina, Sataev, Emil, Ustyuzhanin, Andrey E.
In the ever-evolving landscape of machine learning, seamless translation of natural language descriptions into executable code remains a formidable challenge. This paper introduces Linguacodus, an innovative framework designed to tackle this challenge by deploying a dynamic pipeline that iteratively transforms natural language task descriptions into code through high-level data-shaping instructions. The core of Linguacodus is a fine-tuned large language model (LLM), empowered to evaluate diverse solutions for various problems and select the most fitting one for a given task. This paper details the fine-tuning process, and sheds light on how natural language descriptions can be translated into functional code. Linguacodus represents a substantial leap towards automated code generation, effectively bridging the gap between task descriptions and executable code. It holds great promise for advancing machine learning applications across diverse domains. Additionally, we propose an algorithm capable of transforming a natural description of an ML task into code with minimal human interaction. In extensive experiments on a vast machine learning code dataset originating from Kaggle, we showcase the effectiveness of Linguacodus. The investigations highlight its potential applications across diverse domains, emphasizing its impact on applied machine learning in various scientific fields.
Orthogonal Bootstrap: Efficient Simulation of Input Uncertainty
Liu, Kaizhao, Blanchet, Jose, Ying, Lexing, Lu, Yiping
Bootstrap is a popular methodology for simulating input uncertainty. However, it can be computationally expensive when the number of samples is large. We propose a new approach called \textbf{Orthogonal Bootstrap} that reduces the number of required Monte Carlo replications. We decomposes the target being simulated into two parts: the \textit{non-orthogonal part} which has a closed-form result known as Infinitesimal Jackknife and the \textit{orthogonal part} which is easier to be simulated. We theoretically and numerically show that Orthogonal Bootstrap significantly reduces the computational cost of Bootstrap while improving empirical accuracy and maintaining the same width of the constructed interval.
Online and Offline Robust Multivariate Linear Regression
Godichon-Baggioni, Antoine, Robin, Stephane S., Sansonnet, Laure
We consider the robust estimation of the parameters of multivariate Gaussian linear regression models. To this aim we consider robust version of the usual (Mahalanobis) least-square criterion, with or without Ridge regularization. We introduce two methods each considered contrast: (i) online stochastic gradient descent algorithms and their averaged versions and (ii) offline fix-point algorithms. Under weak assumptions, we prove the asymptotic normality of the resulting estimates. Because the variance matrix of the noise is usually unknown, we propose to plug a robust estimate of it in the Mahalanobis-based stochastic gradient descent algorithms. We show, on synthetic data, the dramatic gain in terms of robustness of the proposed estimates as compared to the classical least-square ones. Well also show the computational efficiency of the online versions of the proposed algorithms. All the proposed algorithms are implemented in the R package RobRegression available on CRAN.
A model-free subdata selection method for classification
Subdata selection is a study of methods that select a small representative sample of the big data, the analysis of which is fast and statistically efficient. The existing subdata selection methods assume that the big data can be reasonably modeled using an underlying model, such as a (multinomial) logistic regression for classification problems. These methods work extremely well when the underlying modeling assumption is correct but often yield poor results otherwise. In this paper, we propose a model-free subdata selection method for classification problems, and the resulting subdata is called PED subdata. The PED subdata uses decision trees to find a partition of the data, followed by selecting an appropriate sample from each component of the partition. Random forests are used for analyzing the selected subdata. Our method can be employed for a general number of classes in the response and for both categorical and continuous predictors. We show analytically that the PED subdata results in a smaller Gini than a uniform subdata. Further, we demonstrate that the PED subdata has higher classification accuracy than other competing methods through extensive simulated and real datasets.
Application of Deep Learning for Factor Timing in Asset Management
Panda, Prabhu Prasad, Gharanchaei, Maysam Khodayari, Chen, Xilin, Lyu, Haoshu
The paper examines the performance of regression models (OLS linear regression, Ridge regression, Random Forest, and Fully-connected Neural Network) on the prediction of CMA (Conservative Minus Aggressive) factor premium and the performance of factor timing investment with them. Out-of-sample R-squared shows that more flexible models have better performance in explaining the variance in factor premium of the unseen period, and the back testing affirms that the factor timing based on more flexible models tends to over perform the ones with linear models. However, for flexible models like neural networks, the optimal weights based on their prediction tend to be unstable, which can lead to high transaction costs and market impacts. We verify that tilting down the rebalance frequency according to the historical optimal rebalancing scheme can help reduce the transaction costs.
Uncertainty quantification for iterative algorithms in linear models with application to early stopping
This paper investigates the iterates $\hbb^1,\dots,\hbb^T$ obtained from iterative algorithms in high-dimensional linear regression problems, in the regime where the feature dimension $p$ is comparable with the sample size $n$, i.e., $p \asymp n$. The analysis and proposed estimators are applicable to Gradient Descent (GD), proximal GD and their accelerated variants such as Fast Iterative Soft-Thresholding (FISTA). The paper proposes novel estimators for the generalization error of the iterate $\hbb^t$ for any fixed iteration $t$ along the trajectory. These estimators are proved to be $\sqrt n$-consistent under Gaussian designs. Applications to early-stopping are provided: when the generalization error of the iterates is a U-shape function of the iteration $t$, the estimates allow to select from the data an iteration $\hat t$ that achieves the smallest generalization error along the trajectory. Additionally, we provide a technique for developing debiasing corrections and valid confidence intervals for the components of the true coefficient vector from the iterate $\hbb^t$ at any finite iteration $t$. Extensive simulations on synthetic data illustrate the theoretical results.
Machine Learning based prediction of Vanadium Redox Flow Battery temperature rise under different charge-discharge conditions
D, Anirudh Narayan, Johar, Akshat, Kalra, Divye, Ardeshna, Bhavya, Bhattacharjee, Ankur
Accurate prediction of battery temperature rise is very essential for designing an efficient thermal management scheme. In this paper, machine learning (ML) based prediction of Vanadium Redox Flow Battery (VRFB) thermal behavior during charge-discharge operation has been demonstrated for the first time. Considering different currents with a specified electrolyte flow rate, the temperature of a kW scale VRFB system is studied through experiments. Three different ML algorithms; Linear Regression (LR), Support Vector Regression (SVR) and Extreme Gradient Boost (XGBoost) have been used for the prediction work. The training and validation of ML algorithms have been done by the practical dataset of a 1kW 6kWh VRFB storage under 40A, 45A, 50A and 60A charge-discharge currents and 10 L min-1 of flow rate. A comparative analysis among the ML algorithms is done in terms of performance metrics such as correlation coefficient (R2), mean absolute error (MAE) and root mean square error (RMSE). It is observed that XGBoost shows the highest accuracy in prediction of around 99%. The ML based prediction results obtained in this work can be very useful for controlling the VRFB temperature rise during operation and act as indicator for further development of an optimized thermal management system.