High-dimensional estimation with missing data: Statistical and computational limits

Verchand, Kabir Aladin, Pensia, Ankit, Haque, Saminul, Kuditipudi, Rohith

Mar-18-2026–arXiv.org Machine Learning

We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $ε$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $ρ$, for any constant contamination $ε\in (0, 1)$, (roughly) $n \gtrsim d e^{1/ρ^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/ρ^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.

artificial intelligence, data quality, machine learning, (18 more...)

arXiv.org Machine Learning

Mar-18-2026

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California (0.14)
- Europe > United Kingdom
  - England
    - Cambridgeshire > Cambridge (0.04)
    - Oxfordshire > Oxford (0.04)

Genre:
- Workflow (0.92)
- Research Report > New Finding (0.45)

Technology:
- Information Technology
  - Data Science > Data Quality (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Machine Learning
      - Computational Learning Theory (0.46)
      - Learning in High Dimensional Spaces (0.40)
      - Statistical Learning > Regression (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found