randomness
Agnostic Learning under Targeted Poisoning: Optimal Rates and the Role of Randomness
We study the problem of learning in the presence of an adversary that can corrupt an ฮท fraction of the training examples with the goal of causing failure on a specific test point. In the realizable setting, prior work established that the optimal error under such instance-targeted poisoning attacks scales as ฮ(dฮท), where d is the VC dimension of the hypothesis class [Hanneke, Karbasi, Mahmoody, Mehalel, and Moran (NeurIPS 2022)]. In this work, we resolve the corresponding question in the agnostic setting. We show that the optimal excess error is eฮ( dฮท), answering one of the main open problems left by Hanneke et al. To achieve this rate, it is necessary to use randomized learners: Hanneke et al. showed that deterministic learners can be forced to suffer error close to 1 even under small amounts of poisoning.
Generalization Bounds for Model-based Algorithm Configuration
Algorithm configuration, which involves selecting algorithm parameters based on sampled problem instances, is a crucial step in applying modern algorithms such as SAT solvers. Although prior work has attempted to understand the theoretical foundations of algorithm configuration, we still lack a comprehensive understanding of why practical algorithm configurators exhibit strong generalization performances in real-world scenarios. In this paper, through the lens of machine learning theory, we provide an algorithm-dependent generalization bound for the widely used model-based algorithm configurators under mild assumptions. Our approach is based on the algorithmic stability framework for generalization bounds. To the best of our knowledge, this is the first generalization bound that applies to a model closely approximating practical model-based algorithm configurators.
Combining Discrete Adversarial Training for LLMs
Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MIXAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MIXAT across a wide spectrum of state-of-theart attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MIXAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MIXAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MIXAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs.
Sinusoidal Initialization, Time for a New Start
Initialization plays a critical role in Deep Neural Network training, directly influencing convergence, stability, and generalization. Common approaches such as Glorot and He initializations rely on randomness, which can produce uneven weight distributions across layer connections. In this paper, we introduce the Sinusoidal initialization, a novel deterministic method that employs sinusoidal functions to construct structured weight matrices expressly to improve the spread and balance of weights throughout the network while simultaneously fostering a more uniform, well-conditioned distribution of neuron activation states from the very first forward pass. Because Sinusoidal initialization begins with weights and activations that are already evenly and efficiently utilized, it delivers consistently faster convergence, greater training stability, and higher final accuracy across a wide range of models, including convolutional neural networks, vision transformers, and large language models. On average, our experiments show an increase of 4.9% in final validation accuracy and 20.9% in convergence speed. By replacing randomness with structure, this initialization provides a stronger and more reliable foundation for Deep Learning systems.
Replicable Online pricing
We explore the concept of replicability, which ensures algorithmic consistency despite input data variations, for online pricing problems, specifically prophet inequalities and delegation. Given the crucial role of replicability in enhancing transparency in economic decision-making, we present a replicable and nearly optimal pricing strategy for prophet inequalities, achieving a sample complexity of poly(log |X|), where X is the ground set of distributions. Furthermore, we extend these findings to the delegation problem and establish lower bound that proves the necessity of the log |X| dependence. En route to obtaining these results, we develop a number of technical contributions which are of independent interest. Most notably, we propose a new algorithm for a variant of the heavy hitter problem, which has a nearly linear dependence on the inverse of the heavy hitter parameter, significantly improving upon existing results which have a cubic dependence.
Bi-Directional Communication-Efficient Stochastic FL via Remote Source Generation
The literature largely focuses on lossy compression of model updates in deterministic FL. In contrast, stochastic (Bayesian) FL considers distributions over parameters, enabling uncertainty quantification, better generalization, and, crucially, inherent communication-regularized training through a mirror-descent structure. In this paper, we consider both uplink and downlink communication in stochastic FL, and propose a communication framework based on remote source generation. Employing Minimal Random Coding (MRC) for remote generation, we allow the server and the clients to sample from local and global posteriors (sources), respectively, rather than transmitting locally sampled updates. The framework encompasses communication-regularized local optimization and principled compression of model updates, leveraging gradually updated prior distributions as side information. Through extensive simulations, we show that our method achieves 5 32 reduction in total communication cost while preserving accuracy. We further analyze the communication cost, refining existing MRC bounds and enabling precise quantification of uplink and downlink trade-offs. We also extend our method to conventional FL via stochastic quantization and prove a contraction property for the biased MRC compressor to facilitate convergence analysis.
Top-HDecoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in openended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including temperature scaling, top-p (nucleus) sampling, and min-p sampling, aim to manage this trade-off.
Optimal Neural Compressors for the Rate-Distortion-Perception Tradeoff
Recent efforts in neural compression have focused on the rate-distortion-perception (RDP) tradeoff, where the perception constraint ensures the source and reconstruction distributions are close in terms of a statistical divergence. Theoretical work on RDP describes properties of RDP-optimal compressors without providing constructive and low complexity solutions. While classical rate-distortion theory shows that optimal compressors should efficiently pack space, RDP theory additionally shows that infinite randomness shared between the encoder and decoder may be necessary for RDP optimality. In this paper, we propose neural compressors that are low complexity and benefit from high packing efficiency through lattice coding and shared randomness through shared dithering over the lattice cells. For two important settings, namely infinite shared and zero shared randomness, we analyze the RDP tradeoff achieved by our proposed neural compressors and show optimality in both cases. Experimentally, we investigate the roles that these two components of our design, lattice coding and randomness, play in the performance of neural compressors on synthetic and real-world data. We observe that performance improves with more shared randomness and better lattice packing.
Replicable Distribution Testing
We initiate a systematic investigation of distribution testing in the framework of algorithmic replicability. Specifically, given independent samples from a collection of probability distributions, the goal is to characterize the sample complexity of replicably testing natural properties of the underlying distributions. On the algorithmic front, we develop new replicable algorithms for testing closeness and independence of discrete distributions. On the lower bound front, we develop a new methodology for proving sample complexity lower bounds for replicable testing that may be of broader interest. As an application of our technique, we establish near-optimal sample complexity lower bounds for replicable uniformity testing--answering an open question from prior work--and closeness testing.
Generalization Bound of Gradient Flow through Training Trajectory and Data-dependent Kernel
Gradient-based optimization methods have shown remarkable empirical success, yet their theoretical generalization properties remain only partially understood. In this paper, we establish a generalization bound for gradient flow that aligns with the classical Rademacher complexity bounds for kernel methods-specifically those based on the RKHS norm and kernel trace-through a data-dependent kernel called the loss path kernel (LPK).