Goto

Collaborating Authors

 Regression


Connecting Federated ADMM to Bayes

arXiv.org Machine Learning

We provide new connections between two distinct federated learning approaches based on (i) ADMM and (ii) Variational Bayes (VB), and propose new variants by combining their complementary strengths. Specifically, we show that the dual variables in ADMM naturally emerge through the "site" parameters used in VB with isotropic Gaussian covariances. Using this, we derive two versions of ADMM from VB that use flexible covariances and functional regularisation, respectively. Through numerical experiments, we validate the improvements obtained in performance. The work shows connection between two fields that are believed to be fundamentally different and combines them to improve federated learning. The goal of federated learning is to train a global model in the central server by using the data distributed over many local clients (McMahan et al., 2016). Such distributed learning improves privacy, security, and robustness, but is challenging due to frequent communication needed to synchronise training among nodes. This is especially true when the data quality differs drastically from client to client and needs to be appropriately weighted. Designing new methods to deal with such challenges is an active area of research in federated learning. We focus on two distinct federated-learning approaches based on the Alternating Direction Method of Multipliers (ADMM) and Variational Bayes (VB), respectively. The ADMM approach synchronises the global and local models by using constrained optimisation and updates both primal and dual variables simultaneously.


Learning Curves for Decision Making in Supervised Machine Learning: A Survey

arXiv.org Artificial Intelligence

Learning curves are a concept from social sciences that has been adopted in the context of machine learning to assess the performance of a learning algorithm with respect to a certain resource, e.g., the number of training examples or the number of training iterations. Learning curves have important applications in several machine learning contexts, most notably in data acquisition, early stopping of model training, and model selection. For instance, learning curves can be used to model the performance of the combination of an algorithm and its hyperparameter configuration, providing insights into their potential suitability at an early stage and often expediting the algorithm selection process. Various learning curve models have been proposed to use learning curves for decision making. Some of these models answer the binary decision question of whether a given algorithm at a certain budget will outperform a certain reference performance, whereas more complex models predict the entire learning curve of an algorithm. We contribute a framework that categorises learning curve approaches using three criteria: the decision-making situation they address, the intrinsic learning curve question they answer and the type of resources they use. We survey papers from the literature and classify them into this framework.


Reviews: On the number of variables to use in principal component regression

Neural Information Processing Systems

In the paper, the authors discussed PCR, a well-know variant of regression models, and showed the existence of a "double descent" phenomenon. The paper is technically sound and relatively well-written. I check most of the math and they are correct and reasonable to follow. I do have some concern that too much of the space is taken by the algebra which could make it difficult for readers to grasp the high-level intuition, specifically if they do not have enough time to plough through the equations. Considering the space limit for a NeurIPS submission, I think it's better to reorganize some of the proofs to the appendix, and add a discussion/conclusion session to highlight more about the intuitions.


Review for NeurIPS paper: Spike and slab variational Bayes for high dimensional logistic regression

Neural Information Processing Systems

Additional Feedback: Restricted to the studied problem, I would love to see more comments on the advantage of VB over frequentist approaches using, say, penalized MLE. It is my understanding that the main advantage of VB is not on estimation/prediction but on inference (e.g., establishing confidence intervals)? If so, would establishing validity of the confidence interval derived by VB (i.e., Bernstein-von Mises type results) be more interesting? They are exceedingly clear to me, and combined with the other referees' comments on novelty, made me to accordingly raise my score further. Speaking about Bernstein-von Mises type results, in case the authors missed it, V. Spokoiny had some very exciting progresses to extend them to high dimensions in a general M-estimation framework; cf.


Review for NeurIPS paper: Spike and slab variational Bayes for high dimensional logistic regression

Neural Information Processing Systems

This paper seems a solid theoretical contribution to the area of Variational Bayes, and most of the the reviewers concerns were addressed satisfactorily in the rebuttal, provided the mentioned simulations (particularly vs Skinny Gibbs) and comparisons are included in the final version. We hope that the authors incorporate their rebuttal into the final version, and expand the related work section.


Reviews: Manifold-regression to predict from MEG/EEG brain signals without source modeling

Neural Information Processing Systems

The theoretical sections of the paper appear sound, with the Riemannian approaches and their respective invariance properties being properly established. The authors also discuss multiple possible functions that could be applied on the signal powers to obtain the target variable, and prove how using a linear regression model with the Riemannian feature vectors would be optimal for the identity, log and square roots of the signal power. However, they fail to discuss how often these types of scenarios occur in actual MEG/EEG dataset, and also how the performance would deteriorate in case where a different function of the source signals powers is used. The construction of the toy dataset is well thought out to exploit the invariances provided by the Riemannian metrics and demonstrate their performance in the ideal scenario. But as mentioned previously, some additional toy examples that examine the performance of the different models in sub-optimal conditions would also be useful. In addition, it would be interesting to see how the performance of the log-diag model on the toy dataset is affected by the use of supervised spacial filters, or how the geometric distance changes when supervised or unsupervised spacial filters are used.


Reviews: Iterative Least Trimmed Squares for Mixed Linear Regression

Neural Information Processing Systems

The paper considers the problem of mixed linear regression: in this problem, an algorithm is given access to n data samples (x_i, y_i) with possibly corrupted labels, where each y_i is one of the m possible linear functions of x_i, i.e., y_i x_i T theta_j for j in {1,...,m} (but the algorithm does not know which one). The goal of the algorithm is to determine vectors theta_1,..., theta_m. A straightforward but computationally inefficient (the complexity is exponential in d) approach to solving this problem is by using Least Trimmed Squares (LTS), which tries to identify the best fit vector (in terms of least squares) over all possible subsets of the data points of a particular, predefined size. To address this issue, the paper proposes using an alternative, simple, algorithm called Iterative Least Trimmed Squares (ILTS), which is similar to other algorithms that have been used for related problems in the literature, as acknowledged in the paper. The algorithm is essentially alternating minimization: it alternates between (1) finding the best set of a given size tau * n, given the least squares solution from the previous iteration and (2) solving least squares over the set determined in (1).


Reviews: Iterative Least Trimmed Squares for Mixed Linear Regression

Neural Information Processing Systems

This paper studies mixed linear regression and give a number of results. Under various deterministic conditions, they show that given a sufficiently warm start, iterative trimmed least squares converges to the true directions quickly. Their algorithm continues to work in the presence of adversarial corruptions. However the warm start is required to be quite close to the true solution. They give an SVD based initialization procedure that works in the non-noisy setting and when the examples come from a gaussian distribution.


Reconciling Predictive Multiplicity in Practice

arXiv.org Artificial Intelligence

Many machine learning applications predict individual probabilities, such as the likelihood that a person develops a particular illness. Since these probabilities are unknown, a key question is how to address situations in which different models trained on the same dataset produce varying predictions for certain individuals. This issue is exemplified by the model multiplicity (MM) phenomenon, where a set of comparable models yield inconsistent predictions. Roth, Tolbert, and Weinstein recently introduced a reconciliation procedure, the Reconcile algorithm, to address this problem. Given two disagreeing models, the algorithm leverages their disagreement to falsify and improve at least one of the models. In this paper, we empirically analyze the Reconcile algorithm using five widely-used fairness datasets: COMPAS, Communities and Crime, Adult, Statlog (German Credit Data), and the ACS Dataset. We examine how Reconcile fits within the model multiplicity literature and compare it to existing MM solutions, demonstrating its effectiveness. We also discuss potential improvements to the Reconcile algorithm theoretically and practically. Finally, we extend the Reconcile algorithm to the setting of causal inference, given that different competing estimators can again disagree on specific causal average treatment effect (CATE) values. We present the first extension of the Reconcile algorithm in causal inference, analyze its theoretical properties, and conduct empirical tests. Our results confirm the practical effectiveness of Reconcile and its applicability across various domains.


Statistical Inference for Low-Rank Tensor Models

arXiv.org Machine Learning

Statistical inference for tensors has emerged as a critical challenge in analyzing high-dimensional data in modern data science. This paper introduces a unified framework for inferring general and low-Tucker-rank linear functionals of low-Tucker-rank signal tensors for several low-rank tensor models. Our methodology tackles two primary goals: achieving asymptotic normality and constructing minimax-optimal confidence intervals. By leveraging a debiasing strategy and projecting onto the tangent space of the low-Tucker-rank manifold, we enable inference for general and structured linear functionals, extending far beyond the scope of traditional entrywise inference. Specifically, in the low-Tucker-rank tensor regression or PCA model, we establish the computational and statistical efficiency of our approach, achieving near-optimal sample size requirements (in regression model) and signal-to-noise ratio (SNR) conditions (in PCA model) for general linear functionals without requiring sparsity in the loading tensor. Our framework also attains both computationally and statistically optimal sample size and SNR thresholds for low-Tucker-rank linear functionals. Numerical experiments validate our theoretical results, showcasing the framework's utility in diverse applications. This work addresses significant methodological gaps in statistical inference, advancing tensor analysis for complex and high-dimensional data environments.