corollary 4
When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding
Speculative decoding accelerates language model inference by using a fast drafter to propose candidate tokens that are then verified by a larger target model. Existing theory largely studies the stochastic, distribution-preserving setting, where the goal is to exactly sample from the target distribution. In contrast, many practical systems use greedy decoding, relaxed acceptance rules, or tree-based candidate sets, where success is governed by local ranking and threshold events rather than exact distributional equality. We develop a theory for these regimes. We identify that many common acceptance criteria have rejection regions that can be characterized as lower level sets of the target distribution. For these, we characterize the exact KL divergence required for rejection yielding exact certificates and sharp margin-based bounds for strict greedy decoding, additive and multiplicative relaxed acceptance, top-(m) relaxed criteria, and entropy-thresholded acceptance. We then extend the framework to greedy tree decoding, deriving exact and margin-only certificates for when the target greedy token remains covered by the drafter's top-(m) candidates. Finally, we evaluate the resulting certificates on Qwen3 models, showing that relaxed and tree-based criteria substantially enlarge the region of certified acceptance, especially on decoding steps with low target model distribution margin. These results complement existing distribution-preserving analyses of speculative decoding by characterizing the deterministic local acceptance events common in practical inference systems.
Angular Constraint Embedding via SpherePair Loss for Constrained Clustering
However, existing deep constrained clustering (DCC) methods are either limited by anchors inherent in end-to-end modeling or struggle with learning discriminative Euclidean embedding, restricting their scalability and real-world applicability. To avoid their respective pitfalls, we propose a novel angular constraint embedding approach for DCC, termed SpherePair. Using the SpherePair loss with a geometric formulation, our method faithfully encodes pairwise constraints and leads to embeddings that are clustering-friendly in angular space, effectively separating representation learning from clustering. SpherePair preserves pairwise relations without conflict, removes the need to specify the exact number of clusters, generalizes to unseen data, enables rapid inference of the number of clusters, and is supported by rigorous theoretical guarantees. Comparative evaluations with stateof-the-art DCC methods on diverse benchmarks, along with empirical validation of theoretical insights, confirm its superior performance, scalability, and overall real-world effectiveness. Code is available at our repository.
Unveiling Extraneous Sampling Bias with Data Missing-Not-At-Random
Selection bias poses a widely recognized challenge for unbiased evaluation and learning in many industrial scenarios. For example, in recommender systems, it arises from the users' selective interactions with items. Recently, doubly robust and its variants have been widely studied to achieve debiased learning of prediction models, however, all of them consider a simple exact matching scenario, i.e., the units (such as user-item pairs in a recommender system) are the same between the training and test sets. In practice, there may be limited or even no overlap in units between the training and test. In this paper, we consider a more practical scenario: the joint distribution of the feature and rating is the same in the training and test sets. Theoretical analysis shows that the previous DR estimator is biased even if the imputed errors and learned propensities are correct in this scenario. In addition, we propose a novel super-population doubly robust estimator (SuperDR), which can achieve a more accurate estimation and desirable generalization error bound compared to the existing DR estimators, and extend the joint learning algorithm for training the prediction and imputation models. We conduct extensive experiments on three real-world datasets, including a large-scale industrial dataset, to show the effectiveness of our method.
Kernel-based potential mean-field games with unbiased random Fourier $U$-statistics
We study the subclass of potential mean-field games in which the running interaction cost and the terminal target cost are both expressed through reproducing-kernel maximum mean discrepancy (MMD) penalties, and develop a computational framework that exploits this kernel structure. Both costs are estimated from finite-sample empirical distributions using a random Fourier U-statistic representation that is unbiased and has linear cost in the batch size. The drift of the controlled diffusion is parametrized by a neural network and trained via stochastic gradient descent. For this subclass we prove a sample-level almost-sure convergence theorem and an explicit almost-sure rate of convergence, under coupled rate conditions on the penalty parameter, the random-feature count, the sample size, and the optimization tolerance. The framework includes the kernel-MMD-penalty Schrรถdinger bridge problem as the special case of a vanishing interaction cost. Numerical experiments illustrate the method on the Schrรถdinger bridge problem in dimensions up to one hundred, and on an electric vehicle charging coordination problem with per-vehicle physical heterogeneity, where an aggregate-demand congestion cost represents price-feedback competition at the population level and the terminal MMD penalty shapes the state-of-charge distribution at the deadline.
Non-asymptotic quantisation of spherically symmetric distributions
Pronzato, Luc, Zhigljavsky, Anatoly
Zador's celebrated theorem is a cornerstone of optimal quantisation, establishing both the weak limit of the empirical distribution of an $n$-point optimal quantiser in $R^d$ and the decay rate of the associated $L_s$-mean quantisation error. However, for large dimensions $d$, observing this asymptotic behaviour demands an astronomically large sample size $n$, which grows super-exponentially with $d$. Through a detailed analysis of the quantisation problem for spherically symmetric distributions, we demonstrate that for moderate $n$ random quantisers uniformly distributed on a sphere of suitable radius $r$ achieve exceptional performance. The expected distortion, expressed as a triple integral, can be computed with arbitrary precision, and the optimal radius $r$ can be efficiently determined numerically. Leveraging results from extreme-value theory, we derive approximations for $r$, particularly in scenarios where $n$ scales with $d$. Depending on the growth rate of $n$, $r$ may either converge to zero or approach a limiting value that is independent of $s$.
A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention
Hayase, Tomohiro, Karakida, Ryo
Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_n$ of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different $N_n$ and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.
Optimal sequential tests yield log-optimal e-processes
It has been recently shown that e-processes are sufficient for sequential testing in the following sense: every level-$ฮฑ$ sequential test can be obtained by thresholding an e-process at $1/ฮฑ$. However, in the above result, neither does the test have to be asymptotically optimal (in terms of stopping times) nor does the e-process have to be asymptotically log-optimal. It has separately been shown that asymptotically log-optimal e-processes yield asymptotically optimal sequential tests. In this paper, we prove the converse, arguably completing the story: it is possible to aggregate asymptotically optimal sequential tests into asymptotically log-optimal e-processes. This is accomplished by using a new class of WAIT e-processes: those that are Weighted Aggregates of Indicators of stopping Times that begin at zero, are nondecreasing and increase to infinity under the alternative at the optimal rate. Importantly, the paper discusses several nuances in the varied definitions of asymptotic (log-)optimality.
Kernel-based guarantees for nonlinear parametric models in Bayesian optimization
Modern Bayesian optimization and adaptive sampling methods increasingly rely on nonlinear parametric models, yet theoretical guarantees for such models under adaptive data collection remain limited. Existing analyses largely focus on Gaussian processes, kernel machines, linear models, or linearized neural approximations, leaving a gap between theory and the nonlinear models used in practice. We develop a kernel-based framework for analyzing regularized nonlinear parametric models trained on adaptively collected data. Our approach uses kernels over the parameter space to induce reproducing-kernel Hilbert space structures over the corresponding model class, yielding confidence bounds for models trained with broad classes of regularized convex losses. We show how these bounds can support convergence guarantees for nonlinear acquisition and surrogate models, including randomized regularized policies that select points by maximizing a trained random model. These results provide a unified route to analyzing nonlinear parametric models in Bayesian optimization and related adaptive optimization settings.
fd8872fcba4ba87312cdfe5ebba91ca9-Supplemental-Conference.pdf
The appendix includes the missing proofs, detailed discussions of some argument in the main body483 and more numerical experiments. We organize the appendix as follows:484 The proof of infeasibility condition (Theorem 3.2) is provided in Section B.485 Explanations on conditions derived in Theorem 3.2 are included in Section C.486 The proof of properties of the proposed model (r)LogSpecT (Proposition 3.4 & 3.6) is given487 in Section D and some additional properties are discussed.488 The truncated Hausdorff distance based proof details of Theorem 4.1 and Corollary 4.4 are489 given in Section E.490 Details of L-ADMM and its convergence analysis are in Section F.491 Additional experiments and discussions on synthetic data are included in Section G.492 Since the linear system (4) has no solution, we know from Farkas' lemma that the following system494 Hence, S is also a solution to (13). However, (13) does not have a solution. We can conclude that504 rSpecT is infeasible in this case.505
When Does Dynamic Preconditioning Preserve the Polyak-Ruppert CLT? A Stabilization Threshold
The central limit theorem (CLT) is a foundation of statistical inference: it provides the asymptotic distribution needed for confidence intervals, hypothesis tests, and efficiency comparisons [24, 42]. For iterate-averaged stochastic gradient methods, it specifies both a Gaussian limit and its sandwich covariance in a single theorem statement. This foundation now underpins inference in streaming and online settings--online A/B testing, continual monitoring of treatment effects, and streaming M-estimation, for example--where the estimator is updated one observation at a time and inference must be performed in real time. A line of recent work develops online inference procedures for averaged SGD [10, 23, 46]. In practice, one-pass stochastic optimization is routinely combined with adaptive preconditioning, which improves computational efficiency and is believed to sharpen the resulting Gaussian approximation in finite samples. If the CLT fails or the asymptotic variance is altered by the adaptive preconditioning, all downstream inference-- coverage of confidence intervals, size of hypothesis tests, consistency of plug-in covariance estimators--is compromised. A rigorous understanding of when adaptive preconditioning preserves the CLT is, therefore, a prerequisite for reliable inference in these settings.