Goto

Collaborating Authors

 Song, Yuda


Accelerating Unbiased LLM Evaluation via Synthetic Feedback

arXiv.org Artificial Intelligence

When developing new large language models (LLMs), a key step is evaluating their final performance, often by computing the win-rate against a reference model based on external feedback. Human feedback is the gold standard, particularly for capturing nuanced qualities like coherence, readability, and alignment with human expectations. However, human evaluations are costly -- even for large tech companies -- and when conducted with active users, they may negatively impact user experience. A promising alternative is synthetic feedback, where evaluations are conducted by other large language models, including reward models. While this eliminates the need for costly human annotations, it introduces biases that may distort the evaluation process. In this work, we propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations while maintaining unbiased win-rate calculations. Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant. Apart from being generalizable, scalable, and free of hyper-parameter tuning, our method offers predictable annotation savings, which can be estimated based on data-dependent characteristics.


Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

arXiv.org Artificial Intelligence

While synthetic data, often generated by LLMs, offers a valuable complement to human-generated data, its misuse can harm performance. Bertrand et al. (2023) and Gerstgrasser et al. (2024) showed self-training on model-generated data leads to degradation. To mitigate this, incorporating a "reliable" verifier to label data has shown promise in preventing such performance collapse (Gillman et al., 2024). A straightforward verification mechanism is to train a reward model on human-annotated data to assess the quality of synthetic data (Lightman et al., 2023; Wang et al., 2024a). However, this approach can be prohibitively expensive and may offer few signals in domains where models exhibit super-human performance. An alternative is to use a stronger model (Chang et al., 2023; Havrilla et al., 2024) for annotation, but this becomes infeasible when the model is at the frontier of current capabilities. A promising solution is to use the model to label its own generations. Motivated by the intuition that "verification is easier than generation", one can hypothesize that the model may act as a better-than-random verifier of its own outputs, enabling self-improvement (Zelikman et al., 2022).


The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

arXiv.org Artificial Intelligence

Learning from human preference data has emerged as the dominant paradigm for fine-tuning large language models (LLMs). The two most common families of techniques -- online reinforcement learning (RL) such as Proximal Policy Optimization (PPO) and offline contrastive methods such as Direct Preference Optimization (DPO) -- were positioned as equivalent in prior work due to the fact that both have to start from the same offline preference dataset. To further expand our theoretical understanding of the similarities and differences between online and offline techniques for preference fine-tuning, we conduct a rigorous analysis through the lens of dataset coverage, a concept that captures how the training data covers the test distribution and is widely used in RL. We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the optimal policy, but a weaker partial coverage condition suffices for online RL methods. This separation provides one explanation of why online RL methods can perform better than offline methods, especially when the offline preference data is not diverse enough. Finally, motivated by our preceding theoretical observations, we derive a hybrid preference optimization (HyPO) algorithm that uses offline data for contrastive-based preference optimization and online data for KL regularization. Theoretically and empirically, we demonstrate that HyPO is more performant than its pure offline counterpart DPO, while still preserving its computation and memory efficiency.


Hybrid Reinforcement Learning from Offline Observation Alone

arXiv.org Artificial Intelligence

We consider the hybrid reinforcement learning setting where the agent has access to both offline data and online interactive access. While Reinforcement Learning (RL) research typically assumes offline data contains complete action, reward and transition information, datasets with only state information (also known as observation-only datasets) are more general, abundant and practical. This motivates our study of the hybrid RL with observation-only offline dataset framework. While the task of competing with the best policy "covered" by the offline data can be solved if a reset model of the environment is provided (i.e., one that can be reset to any state), we show evidence of hardness when only given the weaker trace model (i.e., one can only reset to the initial states and must produce full traces through the environment), without further assumption of admissibility of the offline data. Under the admissibility assumptions -- that the offline data could actually be produced by the policy class we consider -- we propose the first algorithm in the trace model setting that provably matches the performance of algorithms that leverage a reset model. We also perform proof-of-concept experiments that suggest the effectiveness of our algorithm in practice.


Rich-Observation Reinforcement Learning with Continuous Latent Dynamics

arXiv.org Artificial Intelligence

It is becoming increasingly common to deploy algorithms for reinforcement learning and control in systems where the underlying ("latent") dynamics are nonlinear, continuous, and low-dimensional, yet the agent perceives the environment through high-dimensional ("rich") observations such as images from a camera (Wahlstrรถm et al., 2015; Levine et al., 2016; Kumar et al., 2021; Nair et al., 2023; Baker et al., 2022; Brohan et al., 2022). These domains demand that agents (i) efficiently explore in the face of complex nonlinearities, and (ii) learn continuous representations that respect the structure of the latent dynamics, ideally in tandem with exploration. In spite of extensive empirical investigation into modeling and algorithm design (Laskin et al., 2020; Yarats et al., 2021a; Hafner et al., 2023), sample-efficiency and reliability remain major challenges (Dean et al., 2020), and our understanding of fundamental algorithmic principles for representation learning and exploration is still in its infancy. Toward understanding algorithmic principles and fundamental limits for reinforcement learning and control with high-dimensional observations, a recent line of theoretical research adopts the framework of richobservation reinforcement learning (c.f., Du et al., 2019; Misra et al., 2020; Mhammedi et al., 2020; Zhang et al., 2022; Mhammedi et al., 2023b). Rich-observation RL provides a mathematical framework for the design and analysis of algorithms that perform exploration in the presence of high-dimensional observations, with an emphasis on generalization and sample-efficiency. However, existing work in this domain is largely restricted to systems with discrete ("tabular") latent dynamics, which is unsuitable for most real-world control applications.


Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees

arXiv.org Machine Learning

Hybrid RL is the setting where an RL agent has access to both offline data and online data by interacting with the real-world environment. In this work, we propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data. On-policy methods such as policy gradient and natural policy gradient (NPG) have shown to be more robust to model misspecification, though sometimes it may not be as sample efficient as methods that rely on off-policy learning. On the other hand, offline methods that depend on off-policy training often require strong assumptions in theory and are less stable to train in practice. Our new approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework. We show that our approach, in theory, can obtain a best-of-both-worlds type of result -- it achieves the state-of-art theoretical guarantees of offline RL when offline RL-specific assumptions hold, while at the same time maintaining the theoretical guarantees of on-policy NPG regardless of the offline RL assumptions' validity. Experimentally, in challenging rich-observation environments, we show that our approach outperforms a state-of-the-art hybrid RL baseline which only relies on off-policy policy optimization, demonstrating the empirical benefit of combining on-policy and off-policy learning. Our code is publicly available at https://github.com/YifeiZhou02/HNPG.


Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient

arXiv.org Artificial Intelligence

Learning by interacting with an environment, in the standard online reinforcement learning (RL) protocol, has led to impressive results across a number of domains. State-of-the-art RL algorithms are quite general, employing function approximation to scale to complex environments with minimal domain expertise and inductive bias. However, online RL agents are also notoriously sample inefficient, often requiring billions of environment interactions to achieve suitable performance. This issue is particularly salient when the environment requires sophisticated exploration and a high quality reset distribution is unavailable to help overcome the exploration challenge. As a consequence, the practical success of online RL and related policy gradient/improvement methods has been largely restricted to settings where a high quality simulator is available. To overcome the issue of sample inefficiency, attention has turned to the offline RL setting [Levine et al., 2020], where, rather than interacting with the environment, the agent trains on a large dataset of experience collected in some other manner (e.g., by a system running in production or an expert). While these methods still require a large dataset, they mitigate the sample complexity concerns of online RL, since the dataset can be collected without compromising system performance. However, offline RL methods can suffer from distribution shift, where the state distribution induced by the learned policy differs significantly from the offline distribution [Wang et al., 2021]. Existing provable approaches for addressing distribution shift are computationally intractable, while empirical approaches rely on heuristics that can be sensitive to the domain and offline dataset (as we will see).


The Virtues of Laziness in Model-based RL: A Unified Objective and Algorithms

arXiv.org Artificial Intelligence

We propose a novel approach to addressing two fundamental challenges in Model-based Reinforcement Learning (MBRL): the computational expense of repeatedly finding a good policy in the learned model, and the objective mismatch between model fitting and policy computation. Our "lazy" method leverages a novel unified objective, Performance Difference via Advantage in Model, to capture the performance difference between the learned policy and expert policy under the true dynamics. This objective demonstrates that optimizing the expected policy advantage in the learned model under an exploration distribution is sufficient for policy computation, resulting in a significant boost in computational efficiency compared to traditional planning methods. Additionally, the unified objective uses a value moment matching term for model fitting, which is aligned with the model's usage during policy computation. We present two no-regret algorithms to optimize the proposed objective, and demonstrate their statistical and computational gains compared to existing MBRL methods through simulated benchmarks.


Provable Benefits of Representational Transfer in Reinforcement Learning

arXiv.org Artificial Intelligence

We study the problem of representational transfer in RL, where an agent first pretrains in a number of source tasks to discover a shared representation, which is subsequently used to learn a good policy in a \emph{target task}. We propose a new notion of task relatedness between source and target tasks, and develop a novel approach for representational transfer under this assumption. Concretely, we show that given generative access to source tasks, we can discover a representation, using which subsequent linear RL techniques quickly converge to a near-optimal policy in the target task. The sample complexity is close to knowing the ground truth features in the target task, and comparable to prior representation learning results in the source tasks. We complement our positive results with lower bounds without generative access, and validate our findings with empirical evaluation on rich observation MDPs that require deep exploration. In our experiments, we observe a speed up in learning in the target by pre-training, and also validate the need for generative access in source tasks.


Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach

arXiv.org Artificial Intelligence

Representation learning in Reinforcement Learning (RL) has gained increasing attention in recent years from both theoretical and empirical research communities (Schwarzer et al., 2020; Laskin et al., 2020) due to its potential in enabling sample-efficient non-linear function approximation, the benefits in multitask settings (Zhang et al., 2020; Yang et al., 2022; Sodhani et al., 2021), and the potential to leverage advances on representation learning in related areas such as computer vision and natural language processing. Despite this interest, there remains a gap between the theoretical and empirical literature, where the theoretically sound methods are seldom evaluated or even implemented and often rely on strong assumptions, while the empirical techniques are not backed with any theoretical guarantees even under stylistic assumptions. This leaves open the key challenge of designing representation learning methods that are both theoretically sound and empirically effective. In this work, we tackle this challenge for a special class of problems called Block MDPs, where the high dimensional and rich observations of the agent are generated from certain latent states and there exists some fixed, but unknown mapping from observations to the latent states (each observation is generated only by one latent state). Prior works (Dann et al., 2018; Du et al., 2019; Misra et al., 2020; Zhang et al., 2020; Sodhani et al., 2021) have motivated the Block MDP model through scenarios such as navigation tasks and image based robotics tasks where the observations can often be reasonably mapped to the latent physical location and states.