lemma 4
Anytime-Valid Federated Conformal RAG for LLM Swarms
Dubey, Prasanjit, Huo, Xiaoming
Federated Conformal RAG (FC-RAG) provides distribution-free coverage for a bandwidth-limited swarm of weak language models, but only at a fixed horizon. We extend it to anytime-valid sequential coverage: validity at every stopping time, preserved under predictable adaptive control (recalibration, per-node bandwidth escalation, distilled-student refresh), at no extra cost in assumptions over fixed-horizon FC-RAG. Naive composition fails because FC-RAG's marginal coverage bound makes the betting e-process a non-supermartingale on adverse calibration draws, and Ville's inequality cannot be invoked. We give Anytime-FC-RAG, a sequential extension built on a summable per-step calibration-deviation budget that converts the marginal bound into a strict conditional bound on a calibration-good event, paired with a truncated betting e-process that is a nonnegative supermartingale on the entire probability space. From these two ingredients, we obtain four guarantees: time-uniform alarm validity $\mathbb{P}(\sup_t E_t \ge 1/ฮด_e) \le ฮด_e + ฮด_{\mathrm{cal}}$, a Hoeffding-stitched cumulative-miscoverage envelope at the same total budget, safety under any predictable controller (recalibration, bandwidth escalation, student refresh), and training-side error propagation across an unbounded sequence of Federated Probe-Logit Distillation (FPLD) refreshes via a summable training budget. As a practical consequence, an adaptive controller that escalates retrieval bandwidth only when the e-process crosses a warning threshold matches the alarm rate of a fixed-high-bandwidth schedule at substantially lower communication cost. Experiments on a GPT-2-small + MiniLM swarm across MMLU, DBpedia, and AG News verify the predicted alarm rate, detection delay, envelope coverage, and $14$-$57\%$ bandwidth savings; the alarm fires when and only when coverage genuinely breaks.
Kernel-based potential mean-field games with unbiased random Fourier $U$-statistics
We study the subclass of potential mean-field games in which the running interaction cost and the terminal target cost are both expressed through reproducing-kernel maximum mean discrepancy (MMD) penalties, and develop a computational framework that exploits this kernel structure. Both costs are estimated from finite-sample empirical distributions using a random Fourier U-statistic representation that is unbiased and has linear cost in the batch size. The drift of the controlled diffusion is parametrized by a neural network and trained via stochastic gradient descent. For this subclass we prove a sample-level almost-sure convergence theorem and an explicit almost-sure rate of convergence, under coupled rate conditions on the penalty parameter, the random-feature count, the sample size, and the optimization tolerance. The framework includes the kernel-MMD-penalty Schrรถdinger bridge problem as the special case of a vanishing interaction cost. Numerical experiments illustrate the method on the Schrรถdinger bridge problem in dimensions up to one hundred, and on an electric vehicle charging coordination problem with per-vehicle physical heterogeneity, where an aggregate-demand congestion cost represents price-feedback competition at the population level and the terminal MMD penalty shapes the state-of-charge distribution at the deadline.
On Language Generation in the Limit with Bounded Memory
Kleinberg, Jon, Mehrotra, Anay, Saberi, Amin, Velegkas, Grigoris
We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.
Proper Agnostic Learning of Functions of Halfspaces under Gaussian Marginals
Tikhonov, Sergei, Vasilyan, Arsen
We study the problem of computationally efficient proper agnostic learning of multidimensional concept classes under the Gaussian distribution. In this setting, given i.i.d. labeled samples from an unknown distribution over $\mathbb{R}^d \times \{\pm 1\}$ whose marginal on $\mathbb{R}^d$ is Gaussian, the goal is to output a hypothesis from a target class $\mathcal{F}$ whose 0-1 loss is within $ฮต$ of that of the best classifier in $\mathcal{F}$. We give the first efficient proper agnostic learning algorithm for arbitrary Boolean functions of $K$ halfspaces under Gaussian marginals. Our algorithm runs in time $d^{O(K^2 \log(1/ฮต)/ฮต^2)} + (K/ฮต)^{O(K^3/ฮต^{2.5})}$. Prior to our work, the only known algorithm for $K \geq 2$ was brute-force search, with run-time exponential in $d$. Moreover, the dependence of our run-time on the dimension $d$ matches that of the best known improper learning algorithm, namely $d^{\widetilde{O}(K^2/ฮต^2)}$. For the special case of a single halfspace ($K=1$), the best previous run-time was $d^{O(1/ฮต^4)} + (1/ฮต)^{O(1/ฮต^6)}$. Our algorithm improves this to $d^{O(1/ฮต^2)} + (1/ฮต)^{O(1/ฮต^{2.5})}$. Once again, the dependence on $d$ matches that of the best known improper algorithm, namely $d^{O(1/ฮต^2)}$. Furthermore, the dependence of our run-time on the dimension $d$ is essentially optimal in the statistical query model.
Mind the Gap: Mixtures of Gaussians in Approximate Differential Privacy
Liu, Huikang, Selvi, Aras, Wiesemann, Wolfram
We design a class of additive noise mechanisms that satisfy \((\varepsilon, ฮด)\)-differential privacy (DP) for scalar, real-valued query functions with known sensitivities, with a particular focus on moderate and low-privacy regimes. These mechanisms, which we call \textit{mixture mechanisms}, are constructed by mixing multiple Gaussian distributions that share the same variance but differ in their means and mixture weights. The resulting distributions can be interpreted as convex combinations of a zero-mean Gaussian (as used in the analytic Gaussian mechanism) and additional Gaussians whose means depend on the sensitivity of the query function. We derive tight conditions on the variances required for \((\varepsilon, ฮด)\)-DP and provide efficient algorithms to compute them. Compared to the analytic Gaussian mechanism, our mechanisms yield substantially lower expected noise amplitudes (\(l_1\)-loss) and variances (\(l_2\)-loss for zero-mean distributions). In the low-privacy regime that motivates our design, our mechanisms approach optimality, mitigating nearly all of the optimality gap of the analytic Gaussian mechanism.
Convergence of empirical subgradients for optimal transport-based objectives
Optimal transport is widely used to learn distributions, enforce distributional constraints, and model uncertainty. In applications, transport losses are often computed from samples through tractable representations, such as one-dimensional sorting formulas or sliced Wasserstein costs, making them practical components in training pipelines. We study parameterized objectives defined by sampled transport costs and prove graphical convergence of their subdifferentials to the subdifferential of the population objective. In particular, this ensures that standard subgradient methods consistently approach stationary points of the population-level problem. We illustrate the results in several settings, including risk-averse optimization, fairness-constrained learning, and sliced Wasserstein problems. Our analysis highlights that smooth parameterizations provide a favorable interface between statistical consistency and optimization. By contrast, transport objectives with nonsmooth costs and models may exhibit unstable derivatives in the large-sample limit.
PAC Learning with Bandit Feedback: Sharp Sample Complexity in the Realizable Setting
Hanneke, Steve, Meng, Qinglin, Moran, Shay, Shaeiri, Amirreza
We study the problem of multiclass PAC learning with bandit feedback in the realizable setting. In this framework, there is an unknown data distribution over an instance space $\mathcal{X}$ and a label space $\mathcal{Y}$, as in classical multiclass PAC learning, but the learner does not observe the labels of the i.i.d. training examples. Instead, in each round, it receives an unlabeled instance, predicts its label, and receives bandit feedback indicating only whether the prediction is correct. Despite this restriction, the goal remains the same as in classical PAC learning. We provide a general characterization of the optimal sample complexity of this problem, sharp for every concept class up to logarithmic factors. Our characterization is based on a new combinatorial dimension, termed the bandit $\mathrm{DS}$ dimension, defined via generalized combinatorial structures we call pseudo-boxes. These extend the pseudo-cubes underlying the $\mathrm{DS}$ dimension by allowing a different number of neighbors in each coordinate. In contrast to the $\mathrm{DS}$ dimension, which governs the full-information setting by counting the number of coordinates in the pseudo-cube, the bandit $\mathrm{DS}$ dimension aggregates the number of neighbors across coordinates, leading to a characterization in which the sample complexity scales with the total number of neighbors. We also propose a general learning algorithm achieving the upper bound, based on an algorithmic principle called ListCascade, which connects bandit learning to list learning and may be of independent interest.
Sample Complexity of Policy Gradient for Log-Growth Control
Pan, Qiuhua, Shen, Yukai, Zhang, Liwei, Chen, Cailian, Guan, Xinping
We study the sample complexity of policy gradient for log-growth control -- the problem of learning, from observed state transitions, a feedback gain that optimally stabilizes a scalar linear system driven through a multiplicative-noise actuation channel. The objective $J(K) = \mathbb{E}[\log|1+BK|]$ is the top Lyapunov exponent of the closed loop. This problem carries a structural difficulty we call the cusp obstruction: the optimal gain $K^*$ always places the noise singularity $b_{\rm sing}(K) = -1/K$ in the interior of the support. At this singular optimum the policy gradient exists only as a Cauchy principal value, not as a Lebesgue integral, and the natural single-sample gradient estimator has infinite variance. Standard first-order stochastic-optimization analysis is thus inapplicable at the optimum, and merely smoothing the objective does not resolve the difficulty. The obstruction, however, has an exploitable symmetry: the Cauchy kernel is an odd function of the displacement from the moving pole, so pairing each observation with its reflection through the pole cancels the divergent part. This one cancellation simultaneously controls the population curvature, the gradient-estimator variance, and the bias incurred when the noise density is estimated. Combining these bounds with a closed-form single-transition gradient oracle, we prove that projected mini-batch policy gradient, initialized in any compact subset of the stabilizing region, attains total sample complexity $\tilde{O}(1/ฮท)$ when the noise density is known and $\tilde{O}(ฮท^{-(2s+1)/(2s)})$ when it must be estimated, for $C^s$ noise densities with $s \geq 2$.
Mildly Overparameterized ReLU Networks on Orthogonal Data: Incremental Learning and Implicit Bias
Town, James, Boursier, Etienne, Lewis, Ben, Englert, Matthias, Lazic, Ranko
The successful training of neural networks hinges on the use of first order optimization methods, yet the theoretical characterization of these methods remains incomplete. This is especially true in settings with mild overparameterization. In this work, we study the gradient flow dynamics of two-layer ReLU networks from small initialization with orthogonal training data. We prove the limiting flow converges to a saddle-to-saddle jump process as the initialization scale tends to zero, revealing an incremental learning phenomenon in which a new neuron activates at each saddle. This analysis recovers the known result of Dana et al. (2025, arXiv:2502.16977) that the network interpolates the training data with high probability as soon as $m \gtrsim \log(n)$, where $m$ is the network width and $n$ is the number of training samples. This incremental process characterization also allows us to derive a novel implicit bias result: the learned interpolator has a squared $\ell_2$-norm scaling as $\sqrt{n}$, which is within a constant factor of the minimal $\ell_2$-norm interpolator. More broadly, our work provides the first rigorous proof of an incremental learning process for ReLU networks, whilst suggesting mildly overparameterized networks can converge to interpolating solutions whose complexity is of the same order as that of the optimal interpolator.
Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction
Sarkar, Dhruv, Sinha, Abhishek
We consider Constrained Online Convex Optimization (COCO) with adversarially chosen constraints. At each round, the learner chooses an action before observing the loss and constraint function for that round. The goal is to achieve small static regret against the best point satisfying all constraints while also controlling cumulative constraint violation ($\mathsf{CCV}$). For strongly convex losses, state-of-the-art algorithms achieve $O(\log T)$ regret and $O(\sqrt{T \log T})$ $\mathsf{CCV}.$ The corresponding best-known bounds for convex losses is $O(\sqrt{T})$ regret and $O(\sqrt{T} \log T)$ $\mathsf{CCV}$. In this paper, we give a simple projection-based algorithm that simultaneously achieves $O(\log T)$ regret and $O(\log T)$ $\mathsf{CCV}$ for strongly-convex losses, yielding an exponential improvement in the $\mathsf{CCV}$. For the convex losses, our algorithm improves the $\mathsf{CCV}$ to $O(\sqrt{T})$ while maintaining the optimal $O(\sqrt{T})$ regret. The key to our improvement is a recent geometric result for self-contracted curves, which may be of independent interest.