Goto

Collaborating Authors

 Regression


Dependence Fidelity and Downstream Inference Stability in Generative Models

Riasat, Nazia

arXiv.org Machine Learning

Recent advances in generative AI have led to increasingly realistic synthetic data, yet evaluation criteria remain focused on marginal distribution matching. While these diagnostics assess local realism, they provide limited insight into whether a generative model preserves the multivariate dependence structures governing downstream inference. We introduce covariance-level dependence fidelity as a practical criterion for evaluating whether a generative distribution preserves joint structure beyond univariate marginals. We establish three core results. First, distributions can match all univariate marginals exactly while exhibiting substantially different dependence structures, demonstrating marginal fidelity alone is insufficient. Second, dependence divergence induces quantitative instability in downstream inference, including sign reversals in regression coefficients despite identical marginal behavior. Third, explicit control of covariance-level dependence divergence ensures stable behavior for dependence-sensitive tasks such as principal component analysis. Synthetic constructions illustrate how dependence preservation failures lead to incorrect conclusions despite identical marginal distributions. These results highlight dependence fidelity as a useful diagnostic for evaluating generative models in dependence-sensitive downstream tasks, with implications for diffusion models and variational autoencoders. These guarantees apply specifically to procedures governed by covariance structure; tasks requiring higher-order dependence such as tail-event estimation require richer criteria.


High-dimensional estimation with missing data: Statistical and computational limits

Verchand, Kabir Aladin, Pensia, Ankit, Haque, Saminul, Kuditipudi, Rohith

arXiv.org Machine Learning

We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $ε$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $ρ$, for any constant contamination $ε\in (0, 1)$, (roughly) $n \gtrsim d e^{1/ρ^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/ρ^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.


Multi-way Interacting Regression via Factorization Machines

Neural Information Processing Systems

We propose a Bayesian regression method that accounts for multi-way interactions of arbitrary orders among the predictor variables. Our model makes use of a factorization mechanism for representing the regression coefficients of interactions among the predictors, while the interaction selection is guided by a prior distribution on random hypergraphs, a construction which generalizes the Finite Feature Model. We present a posterior inference algorithm based on Gibbs sampling, and establish posterior consistency of our regression model. Our method is evaluated with extensive experiments on simulated data and demonstrated to be able to identify meaningful interactions in applications in genetics and retail demand forecasting.


Linear regression without correspondence

Neural Information Processing Systems

This article considers algorithmic and statistical aspects of linear regression when the correspondence between the covariates and the responses is unknown. First, a fully polynomial-time approximation scheme is given for the natural least squares optimization problem in any constant dimension. Next, in an average-case and noise-free setting where the responses exactly correspond to a linear function of i.i.d.


Regularized Modal Regression with Applications in Cognitive Impairment Prediction

Neural Information Processing Systems

Linear regression models have been successfully used to function estimation and model selection in high-dimensional data analysis. However, most existing methods are built on least squares with the mean square error (MSE) criterion, which are sensitive to outliers and their performance may be degraded for heavy-tailed noise. In this paper, we go beyond this criterion by investigating the regularized modal regression from a statistical learning viewpoint. A new regularized modal regression model is proposed for estimation and variable selection, which is robust to outliers, heavy-tailed noise, and skewed noise. On the theoretical side, we establish the approximation estimate for learning the conditional mode function, the sparsity analysis for variable selection, and the robustness characterization. On the application side, we applied our model to successfully improve the cognitive impairment prediction using the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort data.


Geometric Descent Method for Convex Composite Minimization

Neural Information Processing Systems

In this paper, we extend the geometric descent method recently proposed by Bubeck, Lee and Singh to tackle nonsmooth and strongly convex composite problems. We prove that our proposed algorithm, dubbed geometric proximal gradient method (GeoPG), converges with a linear rate $(1-1/\sqrt{\kappa})$ and thus achieves the optimal rate among first-order methods, where $\kappa$ is the condition number of the problem. Numerical results on linear regression and logistic regression with elastic net regularization show that GeoPG compares favorably with Nesterov's accelerated proximal gradient method, especially when the problem is ill-conditioned.


Efficient Sublinear-Regret Algorithms for Online Sparse Linear Regression with Limited Observation

Neural Information Processing Systems

Online sparse linear regression is the task of applying linear regression analysis to examples arriving sequentially subject to a resource constraint that a limited number of features of examples can be observed. Despite its importance in many practical applications, it has been recently shown that there is no polynomial-time sublinear-regret algorithm unless NP$\subseteq$BPP, and only an exponential-time sublinear-regret algorithm has been found. In this paper, we introduce mild assumptions to solve the problem.


Best Response Regression

Neural Information Processing Systems

In a regression task, a predictor is given a set of instances, along with a real value for each point. Subsequently, she has to identify the value of a new instance as accurately as possible. In this work, we initiate the study of strategic predictions in machine learning. We consider a regression task tackled by two players, where the payoff of each player is the proportion of the points she predicts more accurately than the other player. We first revise the probably approximately correct learning framework to deal with the case of a duel between two predictors. We then devise an algorithm which finds a linear regression predictor that is a best response to any (not necessarily linear) regression algorithm. We show that it has linearithmic sample complexity, and polynomial time complexity when the dimension of the instances domain is fixed. We also test our approach in a high-dimensional setting, and show it significantly defeats classical regression algorithms in the prediction duel. Together, our work introduces a novel machine learning task that lends itself well to current competitive online settings, provides its theoretical foundations, and illustrates its applicability.


A Bandit Framework for Strategic Regression

Neural Information Processing Systems

We consider a learner's problem of acquiring data dynamically for training a regression model, where the training data are collected from strategic data sources. A fundamental challenge is to incentivize data holders to exert effort to improve the quality of their reported data, despite that the quality is not directly verifiable by the learner. In this work, we study a dynamic data acquisition process where data holders can contribute multiple times.


A Minimax Approach to Supervised Learning

Neural Information Processing Systems

Given a task of predicting Y from X, a loss function L, and a set of probability distributions Gamma on (X,Y), what is the optimal decision rule minimizing the worst-case expected loss over Gamma? In this paper, we address this question by introducing a generalization of the maximum entropy principle. Applying this principle to sets of distributions with marginal on X constrained to be the empirical marginal, we provide a minimax interpretation of the maximum likelihood problem over generalized linear models as well as some popular regularization schemes. For quadratic and logarithmic loss functions we revisit well-known linear and logistic regression models. Moreover, for the 0-1 loss we derive a classifier which we call the minimax SVM. The minimax SVM minimizes the worst-case expected 0-1 loss over the proposed Gamma by solving a tractable optimization problem. We perform several numerical experiments to show the power of the minimax SVM in outperforming the SVM.