Reinforcement Learning
Integrating Large Language Models and Reinforcement Learning for Sentiment-Driven Quantitative Trading
Long, Wo, Zeng, Wenxin, Zhang, Xiaoyu, Zhou, Ziyao
The increasing availability of unstructured data has opened new frontiers in quantitative finance. In particular, the integration of sentiment analysis into trading strategies has gained great interest. In contrast to traditional technical indicators, which capture patterns in historical price and volume data, sentiment signals extracted from news articles and other media offer a complementary, forward-looking perspective rooted in investor expectations and market narratives. However, effectively combining these two distinct sources of information, one backward-looking and one anticipatory, remains a significant challenge in systematic investing. This paper explores an innovative approach to integrating sentiment information with traditional technical indicators in equity market trading.
Towards Dynamic Quadrupedal Gaits: A Symmetry-Guided RL Hierarchy Enables Free Gait Transitions at Varying Speeds
Ding, Jiayu, Chen, Xulin, Katz, Garrett E., Gan, Zhenyu
Quadrupedal robots exhibit a wide range of viable gaits, but generating specific footfall sequences often requires laborious expert tuning of numerous variables, such as touch-down and lift-off events and holonomic constraints for each leg. This paper presents a unified reinforcement learning framework for generating versatile quadrupedal gaits by leveraging the intrinsic symmetries and velocity-period relationship of dynamic legged systems. We propose a symmetry-guided reward function design that incorporates temporal, morphological, and time-reversal symmetries. By focusing on preserved symmetries and natural dynamics, our approach eliminates the need for predefined trajectories, enabling smooth transitions between diverse locomotion patterns such as trotting, bounding, half-bounding, and galloping. Implemented on the Unitree Go2 robot, our method demonstrates robust performance across a range of speeds in both simulations and hardware tests, significantly improving gait adaptability without extensive reward tuning or explicit foot placement control. This work provides insights into dynamic locomotion strategies and underscores the crucial role of symmetries in robotic gait design.
Data-driven simulator of multi-animal behavior with unknown dynamics via offline and online reinforcement learning
Fujii, Keisuke, Tsutsui, Kazushi, Teshima, Yu, Itoh, Makoto, Takeishi, Naoya, Nishiumi, Nozomi, Tanaka, Ryoya, Shigaki, Shunsuke, Kawahara, Yoshinobu
Simulators of animal movements play a valuable role in studying behavior. Advances in imitation learning for robotics have expanded possibilities for reproducing human and animal movements. A key challenge for realistic multi-animal simulation in biology is bridging the gap between unknown real-world transition models and their simulated counterparts. Because locomotion dynamics are seldom known, relying solely on mathematical models is insufficient; constructing a simulator that both reproduces real trajectories and supports reward-driven optimization remains an open problem. We introduce a data-driven simulator for multi-animal behavior based on deep reinforcement learning and counterfactual simulation. We address the ill-posed nature of the problem caused by high degrees of freedom in locomotion by estimating movement variables of an incomplete transition model as actions within an RL framework. We also employ a distance-based pseudo-reward to align and compare states between cyber and physical spaces. Validated on artificial agents, flies, newts, and silkmoth, our approach achieves higher reproducibility of species-specific behaviors and improved reward acquisition compared with standard imitation and RL methods. Moreover, it enables counterfactual behavior prediction in novel experimental settings and supports multi-individual modeling for flexible what-if trajectory generation, suggesting its potential to simulate and elucidate complex multi-animal behaviors.
SGM: A Statistical Godel Machine for Risk-Controlled Recursive Self-Modification
Wu, Xuening, Yin, Shenqin, Kang, Yanlan, Zhang, Xinhang, Xu, Qianya, Chen, Zeping, Zhang, Wenqiang
Recursive self-modification has often been discussed as a cornerstone for building continually improving ML systems (Y ampolskiy, 2015). Modern ML already hints at this trend: reinforcement learning agents tune hyperparameters online, AutoML loops search over training recipes, and optimization pipelines reconfigure code and settings during runs. Y et these procedures often adopt changes on the basis of noisy gains, creating the risk of harmful edits - modifications that seems beneficial in finite trials but ultimately degrade true performance. Such risks are especially concerning in high-stakes scientific domains such as drug design, protein engineering, or climate modeling, where spurious gains can misdirect costly pipelines. G odel machines (Schmidhuber, 2007) offer a conceptually clean answer: an agent rewrites its code only when it can prove the rewrite increases expected utility. But in stochastic, high-dimensional ML, such formal proofs are unattainable. At the other extreme, practical AutoML and RL systems adopt edits using heuristics such as rolling averages, best-of-seeds, or bandit rules, which lack guarantees and may silently accumulate regressions.
One4Many-StablePacker: An Efficient Deep Reinforcement Learning Framework for the 3D Bin Packing Problem
Gao, Lei, Huang, Shihong, Wang, Shengjie, Ma, Hong, Zhang, Feng, Bao, Hengda, Chen, Qichang, Zhou, Weihua
The three-dimensional bin packing problem (3D-BPP) is widely applied in logistics and warehousing. Existing learning-based approaches often neglect practical stability-related constraints and exhibit limitations in generalizing across diverse bin dimensions. To address these limitations, we propose a novel deep reinforcement learning framework, One4Many-StablePacker (O4M-SP). The primary advantage of O4M-SP is its ability to handle various bin dimensions in a single training process while incorporating support and weight constraints common in practice. Our training method introduces two innovative mechanisms. First, it employs a weighted reward function that integrates loading rate and a new height difference metric for packing layouts, promoting improved bin utilization through flatter packing configurations. Second, it combines clipped policy gradient optimization with a tailored policy drifting method to mitigate policy entropy collapse, encouraging exploration at critical decision nodes during packing to avoid suboptimal solutions. Extensive experiments demonstrate that O4M-SP generalizes successfully across diverse bin dimensions and significantly outperforms baseline methods. Furthermore, O4M-SP exhibits strong practical applicability by effectively addressing packing scenarios with stability constraints.
Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization
Li, Yang, Zhang, Ruichen, Liu, Yinqiu, Liu, Guangyuan, Niyato, Dusit, Jamalipour, Abbas, Wang, Xianbin, Kim, Dong In
Abstract--The rapid advancement of Low-Altitude Economy Networks (LAENets) has enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection. T o support these scenarios, unmanned aerial vehicles (UA Vs) equipped with onboard vision-language models (VLMs) offer a promising solution for real-time multimodal inference. However, ensuring both inference accuracy and communication efficiency remains a significant challenge due to limited onboard resources and dynamic network conditions. In this paper, we first propose a UA V-enabled LAENet system model that jointly captures UA V mobility, user-UA V communication, and the onboard visual question answering (VQA) pipeline. Based on this model, we formulate a mixed-integer non-convex optimization problem to minimize task latency and power consumption under user-specific accuracy constraints. T o solve the problem, we design a hierarchical optimization framework composed of two parts: (i) an Alternating Resolution and Power Optimization (ARPO) algorithm for resource allocation under accuracy constraints, and (ii) a Large Language Model-augmented Reinforcement Learning Approach (LLaRA) for adaptive UA V trajectory optimization. The large language model (LLM) serves as an expert in refining reward design of reinforcement learning in an offline fashion, introducing no additional latency in real-time decision-making. Numerical results demonstrate the efficacy of our proposed framework in improving inference performance and communication efficiency under dynamic LAENet conditions. Low-Altitude Economy Networks (LAENets) have recently garnered growing attention as a novel paradigm that leverages the low-altitude airspace (typically below 1000 meters) to deliver digital services [1]. Li and G. Liu are with the College of Computing and Data Science, the Energy Research Institute @ NTU, Interdisciplinary Graduate Program, Nanyang Technological University, Singapore (e-mail: yang048@e.ntu.edu.sg; Liu and D. Niyato are with the College of Computing and Data Science, Nanyang Technological University, Singapore (e-mails: ruichen.zhang@ntu.edu.sg; X. Wang is with the Department of Electrical and Computer Engineering, Western University, London, Canada (e-mail: xianbin.wang@uwo.ca).
Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning
Zhao, Shu, Yu, Tan, Xu, Anbang
Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.
Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models
Lyu, Mingyang, Sun, Yinqian, Lin, Erliang, Li, Huangrui, Chen, Ruolin, Zhao, Feifei, Zeng, Yi
Vision-Language-Action (VLA) models such as OpenVLA, Octo, and $ฯ_0$ have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of supervised data. Reinforcement learning (RL) provides a promising path for improving and fine-tuning VLAs through online interaction. However, conventional policy gradient methods are computationally infeasible in the context of flow-matching based models due to the intractability of the importance sampling process, which requires explicit computation of policy ratios. To overcome this limitation, we propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective. Furthermore, FPO achieves stable and scalable online reinforcement fine-tuning of the $ฯ_0$ model by integrating structure-aware credit assignment to enhance gradient efficiency, clipped surrogate objectives to stabilize optimization, multi-step latent exploration to encourage diverse policy updates, and a Q-ensemble mechanism to provide robust value estimation. We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL, and $ฯ_0$-FAST baselines, observing consistent improvements over the imitation prior and strong alternatives with stable learning under sparse rewards. In addition, ablation studies and analyses of the latent space dynamics further highlight the contributions of individual components within FPO, validating the effectiveness of the proposed computational modules and the stable convergence of the conditional flow-matching objective during online RL.
ARROW: An Adaptive Rollout and Routing Method for Global Weather Forecasting
Tian, Jindong, Ding, Yifei, Xu, Ronghui, Miao, Hao, Guo, Chenjuan, Yang, Bin
Weather forecasting is a fundamental task in spatiotemporal data analysis, with broad applications across a wide range of domains. Existing data-driven forecasting methods typically model atmospheric dynamics over a fixed short time interval (e.g., 6 hours) and rely on naive autoregression-based rollout for long-term forecasting (e.g., 138 hours). However, this paradigm suffers from two key limitations: (1) it often inadequately models the spatial and multi-scale temporal dependencies inherent in global weather systems, and (2) the rollout strategy struggles to balance error accumulation with the capture of fine-grained atmospheric variations. In this study, we propose ARROW, an Adaptive-Rollout Multi-scale temporal Routing method for Global Weather Forecasting. To contend with the first limitation, we construct a multi-interval forecasting model that forecasts weather across different time intervals. Within the model, the Shared-Private Mixture-of-Experts captures both shared patterns and specific characteristics of atmospheric dynamics across different time scales, while Ring Positional Encoding accurately encodes the circular latitude structure of the Earth when representing spatial information. For the second limitation, we develop an adaptive rollout scheduler based on reinforcement learning, which selects the most suitable time interval to forecast according to the current weather state. Experimental results demonstrate that ARROW achieves state-of-the-art performance in global weather forecasting, establishing a promising paradigm in this field.
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Wang, Chenyu, Rashidinejad, Paria, Su, DiJia, Jiang, Song, Wang, Sid, Zhao, Siyan, Zhou, Cai, Shen, Shannon Zejiang, Chen, Feiyu, Jaakkola, Tommi, Tian, Yuandong, Liu, Bo
Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.