Goto

Collaborating Authors

 qwen-1



A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning

Wang, Ruiyi, Ammanabrolu, Prithviraj

arXiv.org Artificial Intelligence

We study what actually works and what doesn't for training large language models as agents via multi-turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter-related pillars--environment, reward, and policy--and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE-Gym for more software engineering style tasks. Training LLMs as autonomous agents to navigate open-ended environments presents unique challenges: planning across extended horizons, making multi-turn sequential decisions, and optimizing for multi-turn rewards. The transition from static single-turn problem-solving to dynamic multi-step reasoning is essential for agentic benchmarks such as interactive text and embodied simulations (TextWorld (C ˆ ot e et al., 2018), ALFWorld (Shridhar et al., 2021), etc.), real-world software programming (OSWorld (Xie et al., 2024), SWE-gym (Pan et al., 2025), etc.), and abstract reasoning in novel situations (ARC-AGI (Chollet et al., 2025)). However, existing multi-turn RL implementations vary widely: some refer to tool-augmented single queries as multi-turn (Zeng et al., 2025), while many rely on model-based assumptions (Wang et al., 2025). This fragmentation has led to incomparable results across papers and confusion about what constitutes true multi-turn learning versus pseudo-multi-turn adaptations of single-turn methods. This paper aims to facilitate research efforts on the open research question: What factors are practically important in making multi-turn RL for LLM agent learning work. Motivated by the lack of standardization of multi-turn RL approaches, we systematically decompose the design space into three interdependent pillars--environment, reward, and policy--and empirically derive a recipe for training LLM agents in situated textual domains (Figure 1). We evaluate our approach on TextWorld and ALFWorld for embodied reasoning, and SWE-gym for real-world programming, revealing critical insights for each pillar.


Instruction Tuning Chronologically Consistent Language Models

He, Songrun, Lv, Linying, Manela, Asaf, Wu, Jimmy

arXiv.org Artificial Intelligence

We introduce a family of chronologically consistent, instruction-tuned large language models to eliminate lookahead bias. Each model is trained only on data available before a clearly defined knowledge-cutoff date, ensuring strict temporal separation from any post-cutoff data. The resulting framework offers (i) a simple, conversational chat interface, (ii) fully open, fixed model weights that guarantee replicability, and (iii) a conservative lower bound on forecast accuracy, isolating the share of predictability that survives once training leakage is removed. Together, these features provide researchers with an easy-to-use generative AI tool useful for a wide range of prediction tasks that is free of lookahead bias.


Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

Do, Cong-Thanh, Doddipatla, Rama, Knill, Kate

arXiv.org Artificial Intelligence

Chain-of-Thought (CoT) prompting is a widely used method to improve the reasoning capability of Large Language Models (LLMs). More recently, CoT has been leveraged in Knowledge Distillation (KD) to transfer reasoning capability from a larger LLM to a smaller one. This paper examines the role of CoT in distilling the reasoning capability from larger LLMs to smaller LLMs using white-box KD, analysing its effectiveness in improving the performance of the distilled models for various natural language reasoning and understanding tasks. We conduct white-box KD experiments using LLMs from the Qwen and Llama2 families, employing CoT data from the CoT-Collection dataset. The distilled models are then evaluated on natural language reasoning and understanding tasks from the BIG-Bench-Hard (BBH) benchmark, which presents complex challenges for smaller LLMs. Experimental results demonstrate the role of CoT in improving white-box KD effectiveness, enabling the distilled models to achieve better average performance in natural language reasoning and understanding tasks from BBH.



Assessing Large Language Models in Updating Their Forecasts with New Information

Yuan, Zhangdie, Ding, Zifeng, Vlachos, Andreas

arXiv.org Artificial Intelligence

Prior work has largely treated future event prediction as a static task, failing to consider how forecasts and the confidence in them should evolve as new evidence emerges. To address this gap, we introduce EVOLVECAST, a framework for evaluating whether large language models appropriately revise their predictions in response to new information. In particular, EVOLVECAST assesses whether LLMs adjust their forecasts when presented with information released after their training cutoff. We use human forecasters as a comparative reference to analyze prediction shifts and confidence calibration under updated contexts. While LLMs demonstrate some responsiveness to new information, their updates are often inconsistent or overly conservative. We further find that neither verbalized nor logits-based confidence estimates consistently outperform the other, and both remain far from the human reference standard. Across settings, models tend to express conservative bias, underscoring the need for more robust approaches to belief updating.


Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

Lyu, Hanjia, Luo, Jiebo, Kang, Jian, Koenecke, Allison

arXiv.org Artificial Intelligence

While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).


Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

Song, Yuda, Zhang, Hanlin, Eisenach, Carson, Kakade, Sham, Foster, Dean, Ghai, Udaya

arXiv.org Artificial Intelligence

While synthetic data, often generated by LLMs, offers a valuable complement to human-generated data, its misuse can harm performance. Bertrand et al. (2023) and Gerstgrasser et al. (2024) showed self-training on model-generated data leads to degradation. To mitigate this, incorporating a "reliable" verifier to label data has shown promise in preventing such performance collapse (Gillman et al., 2024). A straightforward verification mechanism is to train a reward model on human-annotated data to assess the quality of synthetic data (Lightman et al., 2023; Wang et al., 2024a). However, this approach can be prohibitively expensive and may offer few signals in domains where models exhibit super-human performance. An alternative is to use a stronger model (Chang et al., 2023; Havrilla et al., 2024) for annotation, but this becomes infeasible when the model is at the frontier of current capabilities. A promising solution is to use the model to label its own generations. Motivated by the intuition that "verification is easier than generation", one can hypothesize that the model may act as a better-than-random verifier of its own outputs, enabling self-improvement (Zelikman et al., 2022).


PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Fu, Tingchen, Sharma, Mrinank, Torr, Philip, Cohen, Shay B., Krueger, David, Barez, Fazl

arXiv.org Artificial Intelligence

Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.


Efficient Reinforcement Learning with Large Language Model Priors

Yan, Xue, Song, Yan, Feng, Xidong, Yang, Mengyue, Zhang, Haifeng, Ammar, Haitham Bou, Wang, Jun

arXiv.org Artificial Intelligence

In sequential decision-making (SDM) tasks, methods like reinforcement learning (RL) and heuristic search have made notable advances in specific cases. However, they often require extensive exploration and face challenges in generalizing across diverse environments due to their limited grasp of the underlying decision dynamics. In contrast, large language models (LLMs) have recently emerged as powerful general-purpose tools, due to their capacity to maintain vast amounts of domainspecific knowledge. To harness this rich prior knowledge for efficiently solving complex SDM tasks, we propose treating LLMs as prior action distributions and integrating them into RL frameworks through Bayesian inference methods, making use of variational inference and direct posterior sampling. The proposed approaches facilitate the seamless incorporation of fixed LLM priors into both policy-based and value-based RL frameworks. Our experiments show that incorporating LLMbased action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios. Traditional approaches to SDM, such as optimal control (Garcia et al., 1989), heuristic search (Świechowski et al., 2023) and reinforcement learning (RL) (Mnih, 2013), have seen substantial success. Notably, AlphaGo (Silver et al., 2016) and AlphaStar (Vinyals et al., 2019), both based on deep reinforcement learning (DRL), have achieved human-level proficiency in the games of Go and StarCraft II, respectively. However, these methods still suffer from high computational complexity, along with poor generalizability and limited applicability across diverse domains (Dulac-Arnold et al., 2015; Cobbe et al., 2019). Recently, Large Language Models (LLMs) have emerged as effective tools for tackling diverse general-purpose tasks, such as in dialogue systems (Brooks et al., 2023), decision-making (Zhao et al., 2024a), and mathematical reasoning (Imani et al., 2023).