unbiased estimator
On the Optimal Construction of Unbiased Gradient Estimators for Zeroth-Order Optimization
Zeroth-order optimization (ZOO) is an important framework for stochastic optimization when gradients are unavailable or expensive to compute. A potential limitation of existing ZOO methods is the bias inherent in most gradient estimators unless the perturbation stepsize vanishes. In this paper, we overcome this biasedness issue by proposing a novel family of unbiased gradient estimators based solely on function evaluations. By reformulating directional derivatives as a telescoping series and sampling from carefully designed distributions, we construct estimators that eliminate bias while maintaining favorable variance. We analyze their theoretical properties, derive optimal scaling distributions and perturbation stepsizes of four specific constructions, and prove that SGD using the proposed estimators achieves optimal complexity for smooth non-convex objectives. Experiments on synthetic tasks and language model fine-tuning confirm the superior accuracy and convergence of our approach compared to standard methods.
Covariances for Free: Exploiting Mean Distributions for Training-free Federated Learning
Using pre-trained models has been found to reduce the effect of data heterogeneity and speed up federated learning algorithms. Recent works have explored trainingfree methods using first-and second-order statistics to aggregate local client data distributions at the server and achieve high performance without any training. In this work, we propose a training-free method based on an unbiased estimator of class covariance matrices which only uses first-order statistics in the form of class means communicated by clients to the server. We show how these estimated class covariances can be used to initialize the global classifier, thus exploiting the covariances without actually sharing them. We also show that using only withinclass covariances results in a better classifier initialization. Our approach improves performance in the range of 4-26% with exactly the same communication cost when compared to methods sharing only class means and achieves performance competitive or superior to methods sharing second-order statistics with dramatically less communication overhead. The proposed method is much more communicationefficient than federated prompt-tuning methods and still outperforms them. Finally, using our method to initialize classifiers and then performing federated fine-tuning or linear probing again yields better performance.
Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning
Mohri, Mehryar, Schneider, Jon, Zhong, Yutao
The Distributional Alignment Game framework provides a powerful variational perspective on Answer-Level Fine-Tuning (ALFT). However, standard algorithms for these games rely on estimating logarithmic rewards from small batches, introducing a systematic bias due to Jensen's inequality that can destabilize training. In this paper, we systematically resolve this structural estimation bias. First, we generalize the alignment game to arbitrary Bregman divergences, showing that for a family of geometries inducing polynomial rewards, we can construct provably exact and unbiased estimators using U-statistics. Second, for the canonical KL divergence game where an exact solution is impossible, we derive a globally robust minimax polynomial estimator that is provably optimal, achieving the fundamental statistical error limit of $ฮ(1/K^2)$, which we establish via the Ditzian-Totik theorem. Finally, we synthesize these two approaches to propose a novel Variance-Optimal Augmented Polynomial Optimization Program (AQP) Estimator, proving that by systematically reducing variance, our method achieves not only optimal bias but also provably accelerated game convergence, leading to more efficient and stable training with zero online computational overhead.
Finite Population Regression Adjustment and Non-asymptotic Guarantees for Treatment Effect Estimation
The design and analysis of randomized experiments is fundamental to many areas, from the physical and social sciences to industrial settings. Regression adjustment is a popular technique to reduce the variance of estimates obtained from experiments, by utilizing information contained in auxiliary covariates. While there is a large literature within the statistics community studying various approaches to regression adjustment and their asymptotic properties, little focus has been given to approaches in the finite population setting with non-asymptotic accuracy bounds. Further, prior work typically assumes that an entire population is exposed to an experiment, whereas practitioners often seek to minimize the number of subjects exposed to an experiment, for ethical and pragmatic reasons. In this work, we study the problems of estimating the sample mean, individual treatment effects, and average treatment effect with regression adjustment. We propose approaches that use techniques from randomized numerical linear algebra to sample a subset of the population on which to perform an experiment. We give non-asymptotic accuracy bounds for our methods and demonstrate that they compare favorably with prior approaches.
Coupled Gradient Estimators for Discrete Latent Variables
Training models with discrete latent variables is challenging due to the high variance of unbiased gradient estimators. While low-variance reparameterization gradients of a continuous relaxation can provide an effective solution, a continuous relaxation is not always available or tractable. Dong et al. (2020) and Yin et al. (2020) introduced a performant estimator that does not rely on continuous relaxations; however, it is limited to binary random variables. We introduce a novel derivation of their estimator based on importance sampling and statistical couplings, which we extend to the categorical setting. Motivated by the construction of a stick-breaking coupling, we introduce gradient estimators based on reparameterizing categorical variables as sequences of binary variables and Rao-Blackwellization. In systematic experiments, we show that our proposed categorical gradient estimators provide state-of-the-art performance, whereas even with additional Rao-Blackwellization, previous estimators (Yin et al., 2019) underperform a simpler REINFORCE with a leave-one-out-baseline estimator (Kool et al., 2019).
ROIMaximization in Stochastic Online Decision-Making Supplementary Material ADecision-Making Policies
In this section, we give a formal functional definition of the decision-making policies introduced in Section 3. During each task, the agent sequentially observes samples xi [ 1,1] representing realizations of stochastic observations of the current innovation value. A map ฯ: [ 1,1]N N is a duration (of a decision task) if for all x [ 1,1]N, its value d= ฯ(x) Nat xdepends only on the first dcomponents x1,x2,...,xd of x = (x1,x2,...); mathematically speaking, if X is a discrete stochastic process (i.e., a random sequence), then ฯ(X) is a stopping time with respect to the filtration generated by X. This definition reflects the fact that the components x1,x2,... of the sequence x = (x1,x2,...) are generated sequentially, and the decision to stop testing an innovation depends only on what occurred so far. A concrete example of a duration function is the one, mentioned in the introduction and formalized in (4), that keeps drawing samples until the empirical average of the observed values xi surpasses/falls below a certain threshold, or a maximum number of samples have been drawn.
8 Supplementary Material
Calculation of T Given data D, disaggregate Y into M equal-size bins, and the m-th bin is denoted as Bm. Let m = |Bm| denote the number of samples in Bm. For distribution p 2 (V A Y) conditioned on y in Bm, pV,A|ym, pV|ym and pA|ym are denoted as the joint distribution of (V,A), marginal distribution of V and A, respectively. As detailed in Section 5.1 of [33] and Algorithm 4 of [32], Um could be calculated through U-statistic. Specifically, in [33], they consider designing kernel as ij(av)= I(Ai = a,Vi = v) I(Ai = a)I(Vi = v), for i and j-th sample in Dt.