theoretical result
Capturing Individual Human Preferences with Reward Features
Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We formalise and analyse the problem of learning a reward model that can be specialised to a user. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound showing the dependency of the approximation error on the number of training examples, as usual, and also on the number of human raters who provided feedback on them. Based on our theoretical findings, we discuss how to best collect pairwise preference data and argue that adaptive reward models should be beneficial when there is considerable disagreement among users.
The Adaptive Complexity of Minimizing Relative Fisher Information
Non-log-concave sampling from an unnormalized density is fundamental in machine learning and statistics. As datasets grow larger, computational efficiency becomes increasingly important, particularly in reducing adaptive complexity, namely the number of sequential rounds required for sampling algorithms. In this work, we initiate the study of the adaptive complexity of non-log-concave sampling within the framework of relative Fisher information introduced by Balasubramanian et al. in 2022. To obtain a relative Fisher information of at most ε2 from the target distribution, we propose a novel algorithm that reduces the adaptive complexity from O(d2/ε4) to O(d/ε2) by leveraging parallelism. Furthermore, we show our algorithm is optimal for a specific regime of large ε. Our algorithm builds on a diagonally parallelized Picard iteration, while the lower bound is based on a reduction from the problem of finding stationary points.
On the Robustness of Transformers against Context Hijacking for Linear Classification
Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. However, their predictions can be disrupted by factually correct context, a phenomenon known as context hijacking, revealing a significant robustness issue. To understand this phenomenon theoretically, we explore an in-context linear classification problem based on recent advances in linear transformers. In our setup, context tokens are designed as factually correct query-answer pairs, where the queries are similar to the final query but have opposite labels. Then, we develop a general theoretical analysis on the robustness of the linear transformers, which is formulated as a function of the model depth, training context lengths, and number of hijacking context tokens. A key finding is that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations. We show that this improvement arises because deeper layers enable more fine-grained optimization steps, effectively mitigating interference from context hijacking. This is also well supported by our numerical and real-world experiments. Our findings provide theoretical insights into the benefits of deeper architectures and contribute to enhancing the understanding of transformer architectures.
ADerivation of D1 Denote the logit vector as x, we have pj = exj
Without zero-mean constraint, the training becomes unstable. Following the training setting of [23], the classifier network is trained with SGD with a weight decay 5e-4, an initial learning rate of 1e-1 and a mini-batch size of 100 for all methods. We use the cosine learning rate decay schedule [49] for a total of 80 epochs. We set the outer level learning ηω as 14 Figure 7: Training curve without zero-mean constraint on CIFAR10 under 40% uniform noise. The MLP weighting network is trained with Adam [51] with a fixed learning rate 1e-3 and a weight decay 1e-4.
Approximations for the computation of m
Providing a very low critical probability pc means that certification occurs when the simulation ends after a large number of iterations m. We introduce `c the threshold associated to pc s.t. Table 5 shows that this approximation is excellent even for large pc. This shows that mis a little larger than mc = log(pc)/log(1 1/N). This section assumes that X = xo + σ X with X N(0n; In) and that h(x) = x>g τ with g Rn and kgk= 1 (w.l.o.g.).
Supplement to " Uniform Concentration Bounds toward a Unified Framework for Robust Clustering "
For the theoretical exposition, we first establish the following Lemmas. Lemma A.1 proves that the derivative of the function φis bounded in the `2-norm when the domain is restricted to the support of P. Lemma A.1. Lemma A.3 proves that the function fΘ, as a function of Θ, is Lipschitz with respect to the k k norm. Joint first authors contributed equally Corresponding author 35th Conference on Neural Information Processing Systems (NeurIPS 2021). Thus, from equation (1), h φ(PC(θ)) φ(θ),x PC(θ)i 0. (2) We now observe that, dφ(x,θ) dφ(x,PC(θ)) dφ(PC(θ),θ) = h φ(PC(θ)) φ(θ),x PC(θ)i 0. Hence the result.
lower bound
While there remains a small gap between our main lower bound of Theorem 3 and the deterministic quantised gradient descent of Section 6, we can show that the gap cannot be closed by improved deterministic algorithms where the coordinator learns value of objective function F(x) in addition to the minimiser x. That is, our quantised gradient descent is the communication-optimal deterministic algorithm for variant (1) for objectives with constant condition number. Recall that in the N-player equality over universe of size d, denoted by EQd,N, each player i is given an input bi 2{ 0,1}d, and the task is to decide if all players have the same input. It is known [33] that the deterministic communication complexity of EQd,N is CC(EQd,N)= ( Nd). Theorem 8. Given parameters N, d, ", 0 and = 0N satisfying d /" = (1), any deterministic protocol solving (1) for quadratic input functions x 7! 0kx x0k22 has communication complexity Nd log( d/"), if the coordinator is also required to output estimate r 2 R for the minimum function value such that Assume is a deterministic protocol solving (1) with communication complexity C .We show that can then solve N-party equality over a universe of size D = ( dlog( d/")), implying C = ( ND)= Nd log( d/") . More specifically, let S be the set given by Lemma 2 with =(2 "/)1/2, and let D = dlog|S|e = (dlog( d/")). Note that since we assume d /" = (1), the set S has at least two elements and D 1.