log 1
Order-Optimal Sequential 1-Bit Mean Estimation in General Tail Regimes
In this paper, we study the problem of mean estimation under strict 1-bit communication constraints. We propose a novel adaptive mean estimator based solely on randomized threshold queries, where each 1-bit outcome indicates whether a given sample exceeds a sequentially chosen threshold. Our estimator is $(ε, δ)$-PAC for any distribution with a bounded mean $μ\in [-λ, λ]$ and a bounded $k$-th central moment $\mathbb{E}[|X-μ|^k] \le σ^k$ for any fixed $k > 1$. Crucially, our sample complexity is order-optimal in all such tail regimes, i.e., for every such $k$ value. For $k \neq 2$, our estimator's sample complexity matches the unquantized minimax lower bounds plus an unavoidable $O(\log(λ/σ))$ localization cost. For the finite-variance case ($k=2$), our estimator's sample complexity has an extra multiplicative $O(\log(σ/ε))$ penalty, and we establish a novel information-theoretic lower bound showing that this penalty is a fundamental limit of 1-bit quantization. We also establish a significant adaptivity gap: for both threshold queries and more general interval queries, the sample complexity of any non-adaptive estimator must scale linearly with the search space parameter $λ/σ$, rendering it vastly less sample efficient than our adaptive approach. Finally, we present algorithmic variants that (i) handle an unknown sampling budget, (ii) adapt to an unknown scale parameter~$σ$ given (possibly loose) bounds, and (iii) require only two stages of adaptivity at the expense of more complicated general 1-bit queries.
Conformal Prediction with Time-Series Data via Sequential Conformalized Density Regions
We propose a new conformal prediction method for time-series data with a guaranteed asymptotic conditional coverage rate, Sequential Conformalized Density Regions (SCDR), which is flexible enough to produce both prediction intervals and disconnected prediction sets, signifying the emergence of bifurcations. Our approach uses existing estimated conditional highest density predictive regions to form initial predictive regions. We then use a quantile random forest conformal adjustment to provide guaranteed coverage while adaptively changing to take the non-exchangeable nature of time-series data into account. We show that the proposed method achieves the guaranteed coverage rate asymptotically under certain regularity conditions. In particular, the method is doubly robust -- it works if the predictive density model is correctly specified and/or if the scores follow a nonlinear autoregressive model with the correct order specified. Simulations reveal that the proposed method outperforms existing methods in terms of empirical coverage rates and set sizes. We illustrate the method using two real datasets, the Old Faithful geyser dataset and the Australian electricity usage dataset. Prediction sets formed using SCDR for the geyser eruption durations include both single intervals and unions of two intervals, whereas existing methods produce wider, less informative, single-interval prediction sets.
- North America > United States > Iowa (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- (3 more...)
Overfitting and Generalizing with (PAC) Bayesian Prediction in Noisy Binary Classification
Zhu, Xiaohan, Ohannessian, Mesrob I., Srebro, Nathan
We consider a PAC-Bayes type learning rule for binary classification, balancing the training error of a randomized ''posterior'' predictor with its KL divergence to a pre-specified ''prior''. This can be seen as an extension of a modified two-part-code Minimum Description Length (MDL) learning rule, to continuous priors and randomized predictions. With a balancing parameter of $λ=1$ this learning rule recovers an (empirical) Bayes posterior and a modified variant recovers the profile posterior, linking with standard Bayesian prediction (up to the treatment of the single-parameter noise level). However, from a risk-minimization prediction perspective, this Bayesian predictor overfits and can lead to non-vanishing excess loss in the agnostic case. Instead a choice of $λ\gg 1$, which can be seen as using a sample-size-dependent-prior, ensures uniformly vanishing excess loss even in the agnostic case. We precisely characterize the effect of under-regularizing (and over-regularizing) as a function of the balance parameter $λ$, understanding the regimes in which this under-regularization is tempered or catastrophic. This work extends previous work by Zhu and Srebro [2025] that considered only discrete priors to PAC Bayes type learning rules and, through their rigorous Bayesian interpretation, to Bayesian prediction more generally.
Stepwise Variational Inference with Vine Copulas
Griesbauer, Elisabeth, Rønneberg, Leiv, Frigessi, Arnoldo, Czado, Claudia, Haff, Ingrid Hobæk
We propose stepwise variational inference (VI) with vine copulas: a universal VI procedure that combines vine copulas with a novel stepwise estimation procedure of the variational parameters. Vine copulas consist of a nested sequence of trees built from copulas, where more complex latent dependence can be modeled with increasing number of trees. We propose to estimate the vine copula approximate posterior in a stepwise fashion, tree by tree along the vine structure. Further, we show that the usual backward Kullback-Leibler divergence cannot recover the correct parameters in the vine copula model, thus the evidence lower bound is defined based on the Rényi divergence. Finally, an intuitive stopping criterion for adding further trees to the vine eliminates the need to pre-define a complexity parameter of the variational distribution, as required for most other approaches. Thus, our method interpolates between mean-field VI (MFVI) and full latent dependence. In many applications, in particular sparse Gaussian processes, our method is parsimonious with parameters, while outperforming MFVI.
- Asia > Middle East > Jordan (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Europe > Germany (0.04)
- (3 more...)
Computation-Utility-Privacy Tradeoffs in Bayesian Estimation
Chen, Sitan, Ding, Jingqiu, Majid, Mahbod, McKelvie, Walter
Bayesian methods lie at the heart of modern data science and provide a powerful scaffolding for estimation in data-constrained settings and principled quantification and propagation of uncertainty. Yet in many real-world use cases where these methods are deployed, there is a natural need to preserve the privacy of the individuals whose data is being scrutinized. While a number of works have attempted to approach the problem of differentially private Bayesian estimation through either reasoning about the inherent privacy of the posterior distribution or privatizing off-the-shelf Bayesian methods, these works generally do not come with rigorous utility guarantees beyond low-dimensional settings. In fact, even for the prototypical tasks of Gaussian mean estimation and linear regression, it was unknown how close one could get to the Bayes-optimal error with a private algorithm, even in the simplest case where the unknown parameter comes from a Gaussian prior. In this work, we give the first efficient algorithms for both of these problems that achieve mean-squared error $(1+o(1))\mathrm{OPT}$ and additionally show that both tasks exhibit an intriguing computational-statistical gap. For Bayesian mean estimation, we prove that the excess risk achieved by our method is optimal among all efficient algorithms within the low-degree framework, yet is provably worse than what is achievable by an exponential-time algorithm. For linear regression, we prove a qualitatively similar lower bound. Our algorithms draw upon the privacy-to-robustness framework of arXiv:2212.05015, but with the curious twist that to achieve private Bayes-optimal estimation, we need to design sum-of-squares-based robust estimators for inherently non-robust objects like the empirical mean and OLS estimator. Along the way we also add to the sum-of-squares toolkit a new kind of constraint based on short-flat decompositions.
- Europe (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks
We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Middle East > Israel (0.04)
- Asia > Middle East > Jordan (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.05)
- North America > United States > District of Columbia > Washington (0.05)
- Asia > Middle East > Israel > Haifa District > Haifa (0.05)
- (6 more...)