Regression
VREM-FL: Mobility-Aware Computation-Scheduling Co-Design for Vehicular Federated Learning
Ballotta, Luca, Fabbro, Nicolò Dal, Perin, Giovanni, Schenato, Luca, Rossi, Michele, Piro, Giuseppe
Assisted and autonomous driving are rapidly gaining momentum, and will soon become a reality. Among their key enablers, artificial intelligence and machine learning are expected to play a prominent role, also thanks to the massive amount of data that smart vehicles will collect from their onboard sensors. In this domain, federated learning is one of the most effective and promising techniques for training global machine learning models, while preserving data privacy at the vehicles and optimizing communications resource usage. In this work, we propose VREM-FL, a computation-scheduling co-design for vehicular federated learning that leverages mobility of vehicles in conjunction with estimated 5G radio environment maps. VREM-FL jointly optimizes the global model learned at the server while wisely allocating communication resources. This is achieved by orchestrating local computations at the vehicles in conjunction with the transmission of their local model updates in an adaptive and predictive fashion, by exploiting radio channel maps. The proposed algorithm can be tuned to trade model training time for radio resource usage. Experimental results demonstrate the efficacy of utilizing radio maps. VREM-FL outperforms literature benchmarks for both a linear regression model (learning time reduced by 28%) and a deep neural network for a semantic image segmentation task (doubling the number of model updates within the same time window).
Global Convergence of Online Identification for Mixed Linear Regression
Liu, Yujing, Liu, Zhixin, Guo, Lei
Mixed linear regression (MLR) is a powerful model for characterizing nonlinear relationships by utilizing a mixture of linear regression sub-models. The identification of MLR is a fundamental problem, where most of the existing results focus on offline algorithms, rely on independent and identically distributed (i.i.d) data assumptions, and provide local convergence results only. This paper investigates the online identification and data clustering problems for two basic classes of MLRs, by introducing two corresponding new online identification algorithms based on the expectation-maximization (EM) principle. It is shown that both algorithms will converge globally without resorting to the traditional i.i.d data assumptions. The main challenge in our investigation lies in the fact that the gradient of the maximum likelihood function does not have a unique zero, and a key step in our analysis is to establish the stability of the corresponding differential equation in order to apply the celebrated Ljung's ODE method. It is also shown that the within-cluster error and the probability that the new data is categorized into the correct cluster are asymptotically the same as those in the case of known parameters. Finally, numerical simulations are provided to verify the effectiveness of our online algorithms.
Perturbation-based Analysis of Compositional Data
Lundborg, Anton Rask, Pfister, Niklas
Existing statistical methods for compositional data analysis are inadequate for many modern applications for two reasons. First, modern compositional datasets, for example in microbiome research, display traits such as high-dimensionality and sparsity that are poorly modelled with traditional approaches. Second, assessing -- in an unbiased way -- how summary statistics of a composition (e.g., racial diversity) affect a response variable is not straightforward. In this work, we propose a framework based on hypothetical data perturbations that addresses both issues. Unlike existing methods for compositional data, we do not transform the data and instead use perturbations to define interpretable statistical functionals on the compositions themselves, which we call average perturbation effects. These average perturbation effects, which can be employed in many applications, naturally account for confounding that biases frequently used marginal dependence analyses. We show how average perturbation effects can be estimated efficiently by deriving a perturbation-dependent reparametrization and applying semiparametric estimation techniques. We analyze the proposed estimators empirically on simulated data and demonstrate advantages over existing techniques on US census and microbiome data. For all proposed estimators, we provide confidence intervals with uniform asymptotic coverage guarantees.
Learning to Simulate: Generative Metamodeling via Quantile Regression
Hong, L. Jeff, Hou, Yanxi, Zhang, Qingkai, Zhang, Xiaowei
Stochastic simulation models, while effective in capturing the dynamics of complex systems, are often too slow to run for real-time decision-making. Metamodeling techniques are widely used to learn the relationship between a summary statistic of the outputs (e.g., the mean or quantile) and the inputs of the simulator, so that it can be used in real time. However, this methodology requires the knowledge of an appropriate summary statistic in advance, making it inflexible for many practical situations. In this paper, we propose a new metamodeling concept, called generative metamodeling, which aims to construct a "fast simulator of the simulator". This technique can generate random outputs substantially faster than the original simulation model, while retaining an approximately equal conditional distribution given the same inputs. Once constructed, a generative metamodel can instantaneously generate a large amount of random outputs as soon as the inputs are specified, thereby facilitating the immediate computation of any summary statistic for real-time decision-making. Furthermore, we propose a new algorithm -- quantile-regression-based generative metamodeling (QRGMM) -- and study its convergence and rate of convergence. Extensive numerical experiments are conducted to investigate the empirical performance of QRGMM, compare it with other state-of-the-art generative algorithms, and demonstrate its usefulness in practical real-time decision-making.
Knockoffs-SPR: Clean Sample Selection in Learning with Noisy Labels
Wang, Yikai, Fu, Yanwei, Sun, Xinwei
A noisy training set usually leads to the degradation of the generalization and robustness of neural networks. In this paper, we propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels. Specifically, we first present a Scalable Penalized Regression (SPR) method, to model the linear relation between network features and one-hot labels. In SPR, the clean data are identified by the zero mean-shift parameters solved in the regression model. We theoretically show that SPR can recover clean data under some conditions. Under general scenarios, the conditions may be no longer satisfied; and some noisy data are falsely selected as clean data. To solve this problem, we propose a data-adaptive method for Scalable Penalized Regression with Knockoff filters (Knockoffs-SPR), which is provable to control the False-Selection-Rate (FSR) in the selected clean data. To improve the efficiency, we further present a split algorithm that divides the whole training set into small pieces that can be solved in parallel to make the framework scalable to large datasets. While Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline, we further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data. Experimental results on several benchmark datasets and real-world noisy datasets show the effectiveness of our framework and validate the theoretical results of Knockoffs-SPR. Our code and pre-trained models are available at https://github.com/Yikai-Wang/Knockoffs-SPR.
A personalized Uncertainty Quantification framework for patient survival models: estimating individual uncertainty of patients with metastatic brain tumors in the absence of ground truth
Wang, Yuqi, Gupta, Aarzu, Carpenter, David, Mullikin, Trey, Reitman, Zachary J., Floyd, Scott, Kirkpatrick, John, Salama, Joseph K., Sperduto, Paul W., Liu, Jian-Guo, Bashir, Mustafa R., Lafata, Kyle J.
TodevelopanovelUncertaintyQuantification (UQ) framework to estimate the uncertainty of patient survival models in the absence of ground truth, we developed and evaluated our approach based on a dataset of 1383 patients treated with stereotactic radiosurgery (SRS) for brain metastases between January 2015 and December 2020. Our motivating hypothesis is that a time-to-event prediction of a test patient on inference is more certain given a higher feature-space-similarity to patients in the training set. Therefore, the uncertainty for a particular patient-of-interest is represented by the concordance index between a patient similarity rank and a prediction similarity rank. Model uncertainty was defined as the increased percentage of the max uncertainty-constrained-AUC compared to the model AUC. We evaluated our method on multiple clinically-relevant endpoints, including time to intracranial progression (ICP), progression-free survival (PFS) after SRS, overall survival (OS), and time to ICP and/or death (ICPD), on a variety of both statistical and non-statistical models, including CoxPH, conditional survival forest (CSF), and neural multi-task linear regression (NMTLR). Our results show that all models had the lowest uncertainty on ICP (2.21%) and the highest uncertainty (17.28%) on ICPD. OS models demonstrated high variation in uncertainty performance, where NMTLR had the lowest uncertainty(1.96%)and CSF had the highest uncertainty (14.29%). In conclusion, our method can estimate the uncertainty of individual patient survival modeling results. As expected, our data empirically demonstrate that as model uncertainty measured via our technique increases, the similarity between a feature-space and its predicted outcome decreases.
Learning sources of variability from high-dimensional observational studies
Bridgeford, Eric W., Chung, Jaewon, Gilbert, Brian, Panda, Sambit, Li, Adam, Shen, Cencheng, Badea, Alexandra, Caffo, Brian, Vogelstein, Joshua T.
Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estimands to outcomes with any number of dimensions or any measurable space, and formulates traditional causal estimands for nominal variables as causal discrepancy tests. We propose a simple technique for adjusting universally consistent conditional independence tests and prove that these tests are universally consistent causal discrepancy tests. Numerical experiments illustrate that our method, Causal CDcorr, leads to improvements in both finite sample validity and power when compared to existing strategies. Our methods are all open source and available at github.com/ebridge2/cdcorr.
The geometry of flow: Advancing predictions of river geometry with multi-model machine learning
Chang, Shuyu Y, Ghahremani, Zahra, Manuel, Laura, Erfani, Mohammad, Shen, Chaopeng, Cohen, Sagy, Van Meter, Kimberly, Pierce, Jennifer L, Meselhe, Ehab A, Goharian, Erfan
Hydraulic geometry parameters describing river hydrogeomorphic is important for flood forecasting. Although well-established, power-law hydraulic geometry curves have been widely used to understand riverine systems and mapping flooding inundation worldwide for the past 70 years, we have become increasingly aware of the limitations of these approaches. In the present study, we have moved beyond these traditional power-law relationships for river geometry, testing the ability of machine-learning models to provide improved predictions of river width and depth. For this work, we have used an unprecedentedly large river measurement dataset (HYDRoSWOT) as well as a suite of watershed predictor data to develop novel data-driven approaches to better estimate river geometries over the contiguous United States (CONUS). Our Random Forest, XGBoost, and neural network models out-performed the traditional, regionalized power law-based hydraulic geometry equations for both width and depth, providing R-squared values of as high as 0.75 for width and as high as 0.67 for depth, compared with R-squared values of 0.57 for width and 0.18 for depth from the regional hydraulic geometry equations. Our results also show diverse performance outcomes across stream orders and geographical regions for the different machine-learning models, demonstrating the value of using multi-model approaches to maximize the predictability of river geometry. The developed models have been used to create the newly publicly available STREAM-geo dataset, which provides river width, depth, width/depth ratio, and river and stream surface area (%RSSA) for nearly 2.7 million NHDPlus stream reaches across the rivers and streams across the contiguous US.
Target-Free Compound Activity Prediction via Few-Shot Learning
Eckmann, Peter, Anderson, Jake, Gilson, Michael K., Yu, Rose
Predicting the activities of compounds against protein-based or phenotypic assays using only a few known compounds and their activities is a common task in target-free drug discovery. Existing few-shot learning approaches are limited to predicting binary labels (active/inactive). However, in real-world drug discovery, degrees of compound activity are highly relevant. We study Few-Shot Compound Activity Prediction (FS-CAP) and design a novel neural architecture to meta-learn continuous compound activities across large bioactivity datasets. Our model aggregates encodings generated from the known compounds and their activities to capture assay information. We also introduce a separate encoder for the unknown compound. We show that FS-CAP surpasses traditional similarity-based techniques as well as other state of the art few-shot learning methods on a variety of target-free drug discovery settings and datasets.
Forecasting Auxiliary Energy Consumption for Electric Heavy-Duty Vehicles
Fan, Yuantao, Wang, Zhenkan, Pashami, Sepideh, Nowaczyk, Slawomir, Ydreskog, Henrik
Accurate energy consumption prediction is crucial for optimizing the operation of electric commercial heavy-duty vehicles, e.g., route planning for charging. Moreover, understanding why certain predictions are cast is paramount for such a predictive model to gain user trust and be deployed in practice. Since commercial vehicles operate differently as transportation tasks, ambient, and drivers vary, a heterogeneous population is expected when building an AI system for forecasting energy consumption. The dependencies between the input features and the target values are expected to also differ across sub-populations. One well-known example of such a statistical phenomenon is the Simpson paradox. In this paper, we illustrate that such a setting poses a challenge for existing XAI methods that produce global feature statistics, e.g. LIME or SHAP, causing them to yield misleading results. We demonstrate a potential solution by training multiple regression models on subsets of data. It not only leads to superior regression performance but also more relevant and consistent LIME explanations. Given that the employed groupings correspond to relevant sub-populations, the associations between the input features and the target values are consistent within each cluster but different across clusters. Experiments on both synthetic and real-world datasets show that such splitting of a complex problem into simpler ones yields better regression performance and interpretability.