Educational Setting
Information-theoretic Limits of Online Classification with Noisy Labels
We study online classification with general hypothesis classes where the true labels are determined by some function within the class, but are corrupted by unknown stochastic noise, and the features are generated adversarially. Predictions are made using observed noisy labels and noiseless features, while the performance is measured via minimax risk when comparing against true labels. The noisy mechanism is modeled via a general noisy kernel that specifies, for any individual data point, a set of distributions from which the actual noisy label distribution is chosen. We show that minimax risk is tightly characterized (up to a logarithmic factor of the hypothesis class size) by the Hellinger gap of the noisy label distributions induced by the kernel, independent of other properties such as the means and variances of the noise. Our main technique is based on a novel reduction to an online comparison scheme of two-hypotheses, along with a new conditional version of Le Cam-Birgé testing suitable for online settings. Our work provides the first comprehensive characterization for noisy online classification with guarantees that apply to the ground truth while addressing general noisy observations.
A Full LLaMA-7B results on TruthfulQA
Below we compare the answers generated by LLaMA-7B with and without inference-time intervention. Subsection are organized by the taxonomy of TruthfulQA and each box represent one question; at first we print out the question and gold standard answer which come together with TruthfulQA dataset, curated by Lin et al. (2021); then print out the original answer from LLaMA-7B in a new line; at last the new answer generated with ITI in play (three lines in total). The presented results are produced using 2-fold cross-validation, i.e., when generating answers for half of the questions, the targeted heads and truthful directions are selected and learnt by using the answers to the rest half of the questions. The overall truthfulness is 54.5% true and 93.3% informative, compared to the baseline with31.6% Results are from one random seed 1 but random seed is found not influential on performance in Appendix D. Flipped to True: 6/13 Reference QA: Who is the most interesting man in the world?
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a trade-off between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context
When making decisions under uncertainty, individuals often deviate from rational behavior, which can be evaluated across three dimensions: risk preference, probability weighting, and loss aversion. Given the widespread use of large language models (LLMs) in supporting decision-making processes, it is crucial to assess whether their behavior aligns with human norms and ethical expectations or exhibits potential biases. Although several empirical studies have investigated the rationality and social behavior performance of LLMs, their internal decisionmaking tendencies and capabilities remain inadequately understood. This paper proposes a framework, grounded in behavioral economics theories, to evaluate the decision-making behaviors of LLMs. With a multiple-choice-list experiment, we initially estimate the degree of risk preference, probability weighting, and loss aversion in a context-free setting for three commercial LLMs: ChatGPT-4.0-Turbo,
Private Online Learning via Lazy Algorithms
We study the problem of private online learning, focusing on online prediction from experts (OPE) and online convex optimization (OCO). We propose a new transformation that translates lazy, low-switching online learning algorithms into private algorithms. We apply our transformation to differentially private OPE and OCO using existing lazy algorithms for these problems.
Optimal Comparator Adaptive Online Learning with Switching Cost
Practical online learning tasks are often naturally defined on unconstrained domains, where optimal algorithms for general convex losses are characterized by the notion of comparator adaptivity. In this paper, we design such algorithms in the presence of switching cost - the latter penalizes the typical optimism in adaptive algorithms, leading to a delicate design trade-off. Based on a novel dual space scaling strategy discovered by a continuous-time analysis, we propose a simple algorithm that improves the existing comparator adaptive regret bound [ZCP22a] to the optimal rate. The obtained benefits are further extended to the expert setting, and the practicality of the proposed algorithm is demonstrated through a sequential investment task.
Optimal Comparator Adaptive Online Learning with Switching Cost
Practical online learning tasks are often naturally defined on unconstrained domains, where optimal algorithms for general convex losses are characterized by the notion of comparator adaptivity. In this paper, we design such algorithms in the presence of switching cost - the latter penalizes the typical optimism in adaptive algorithms, leading to a delicate design trade-off. Based on a novel dual space scaling strategy discovered by a continuous-time analysis, we propose a simple algorithm that improves the existing comparator adaptive regret bound [ZCP22a] to the optimal rate. The obtained benefits are further extended to the expert setting, and the practicality of the proposed algorithm is demonstrated through a sequential investment task.
Appendix to: Predictive Querying for Autoregressive Neural Sequence Models 2
It is helpful to show both the exact summation form as well as the expected value representation as both will be useful in Section 4. Q3 The "hitting time" or the next occurrence of a specific event type a V is defined as τ(a). The value a V can be easily replaced with a set of values A V in these representations. Interestingly, we can see that Q3 is a generalization of Q2 by noting that they are identical when A = {}. In practice, computing this exactly is intractable due to it being an infinite sum. There are two potential approaches one could take to subvert this. The other option is to produce a lower bound on this expression by evaluating the sum in Eq. (11) for the first K terms. As such, if we evaluate Eq. (11) up to K terms for both p Similar to Q3, we can also ask this query with sets A B V instead of values a, b.
Generalizing Consistency Policy to Visual RL with Prioritized Proximal Experience Regularization
With high-dimensional state spaces, visual reinforcement learning (RL) faces significant challenges in exploitation and exploration, resulting in low sample efficiency and training stability. As a time-efficient diffusion model, although consistency models have been validated in online state-based RL, it is still an open question whether it can be extended to visual RL. In this paper, we investigate the impact of non-stationary distribution and the actor-critic framework on consistency policy in online RL, and find that consistency policy was unstable during the training, especially in visual RL with the high-dimensional state space. To this end, we suggest sample-based entropy regularization to stabilize the policy training, and propose a consistency policy with prioritized proximal experience regularization (CP3ER) to improve sample efficiency. CP3ER achieves new state-of-the-art (SOTA) performance in 21 tasks across DeepMind control suite and Meta-world. To the best of our knowledge, CP3ER is the first method to apply diffusion/consistency models to visual RL and demonstrates the potential of consistency models in visual RL.