AITopics

arXiv.org Machine LearningOct-23-2025

On the hardness of RL with Lookahead

Pla, Corentin, Richard, Hugo, Abeille, Marc, Merlis, Nadav, Perchet, Vianney

We study reinforcement learning (RL) with transition look-ahead, where the agent may observe which states would be visited upon playing any sequence of $\ell$ actions before deciding its course of action. While such predictive information can drastically improve the achievable performance, we show that using this information optimally comes at a potentially prohibitive computational cost. Specifically, we prove that optimal planning with one-step look-ahead ($\ell=1$) can be solved in polynomial time through a novel linear programming formulation. In contrast, for $\ell \geq 2$, the problem becomes NP-hard. Our results delineate a precise boundary between tractable and intractable cases for the problem of planning with transition look-ahead in reinforcement learning.

machine learning, reinforcement learning, transition look-ahead, (18 more...)

arXiv.org Machine Learning

2510.19372

Country:

North America > United States > New York (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.68)
(2 more...)

Ravindran, Santhosh Kumar

CosmoCore Affective Dream-Replay Reinforcement Learning for Code Generation

We introduce CosmoCore, a neuroscience-inspired reinforcement learning (RL) architecture that integrates affective signals to enhance code generation in large language models (LLMs). Motivated by human and animal learning where embarrassment from mistakes drives rapid correction, as observed in training a puppy to avoid repeating errors after a single scolding CosmoCore tags code generation trajectories with valence and surprise using a lightweight multi-layer perceptron (MLP). High-negative valence (cringe) episodes, such as buggy code outputs, are prioritized in a Dream Queue for five-fold replay during off-policy updates, while low-surprise successes are pruned to prevent overconfidence and buffer bloat. Evaluated on code generation benchmarks like HumanEval and BigCodeBench, alongside simulations with a custom data pipeline environment, CosmoCore reduces hallucinated code (e.g., syntax errors or logical bugs) by 48\% and accelerates self-correction by 45\%. Local experiments using Hugging Face models in a PySpark environment validate these gains, with code snippets provided for replication. Ablations confirm valence tagging boosts curiosity in exploration, and pruning mitigates inefficiency. This framework extends RL from human feedback (RLHF) for more emotionally aware code assistants, with applications in IDEs and data pipelines. Code and the custom mini-world simulation are released.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2510.18895

Country: North America > United States (0.14)

Genre:

Research Report (0.50)
Overview (0.46)

Industry:

Information Technology (0.46)
Health & Medicine (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.55)

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning

Shen, Junhao, Zhao, Haiteng, Gu, Yuzhe, Gao, Songyang, Liu, Kuikun, Huang, Haian, Gao, Jianfei, Lin, Dahua, Zhang, Wenwei, Chen, Kai

Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

2507.16814

Country:

Asia (0.93)
Europe (0.68)
North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Giwa, Oluwaseyi, Mohsin, Muhammad Ahmed, Adesola, Folarin Jubril, Jamshed, Muhammad Ali

QPPG: Quantum-Preconditioned Policy Gradient for Link Adaptation in Rayleigh Fading Channels

IRELESS communication over fading channels remains one of the fundamental challenges in modern networks. In particular, Rayleigh fading channels, which model rich-scattering non-line-of-sight environments, cause rapid and unpredictable fluctuations in signal strength that can significantly degrade throughput and reliability. To mitigate these effects, link adaptation techniques such as adaptive modulation and coding (AMC) and power control have been extensively studied as key enablers of efficient spectrum use [1], [2]. Early works on link adaptation for Rayleigh fading channels demonstrated how explicit channel estimation and threshold-based switching could improve throughput and maintain robustness under fading conditions [3]-[6]. Despite their success, these classical approaches rely on accurate channel estimation, fixed rules, and often compromise between average throughput and outage probability in a suboptimal manner [4]-[6]. Furthermore, as networks evolve toward 6G with denser topologies and stringent reliability demands, such schemes struggle to scale or adapt to system-level complexities [7], [8]. Recent works have explored deep reinforcement learning (DRL) and meta reinforcement learning (RL) for link adaptation and resource allocation, showing promising adaptability but still facing high sample complexity and training instability [9]-[12]. In this letter, we propose quantum-preconditioned policy gradient (QPPG), a natural actor-critic method for link adap-Oluwaseyi Giwa is with the African Institute for Mathematical Sciences, South Africa (e-mail: {oluwaseyi}@aims.ac.za). Muhammad Ahmed Mohsin is with Stanford University, Stanford, California, 94305, United States (e-mail: {muahmed}@stanford.edu).

link adaptation, machine learning, reinforcement learning, (11 more...)

2506.15753

Country:

North America > United States > California > Santa Clara County > Stanford (0.24)
North America > United States > California > Santa Clara County > Palo Alto (0.24)

Genre: Research Report (0.40)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.78)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.49)

Horizon Reduction Makes RL Scalable

Park, Seohong, Frans, Kevin, Mann, Deepinder, Eysenbach, Benjamin, Kumar, Aviral, Levine, Sergey

In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000x larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL. Code: https://github.com/seohongpark/horizon-reduction

artificial intelligence, machine learning, reinforcement learning, (11 more...)

2506.04168

Country: North America > United States (0.45)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Chatterjee, Palash, Khardon, Roni

Improving planning and MBRL with temporally-extended actions

Continuous time systems are often modeled using discrete time dynamics but this requires a small simulation step to maintain accuracy. In turn, this requires a large planning horizon which leads to computationally demanding planning problems and reduced performance. Previous work in model-free reinforcement learning has partially addressed this issue using action repeats where a policy is learned to determine a discrete action duration. Instead we propose to control the continuous decision timescale directly by using temporally-extended actions and letting the planner treat the duration of the action as an additional optimization variable along with the standard action variables. This additional structure has multiple advantages. It speeds up simulation time of trajectories and, importantly, it allows for deep horizon search in terms of primitive actions while using a shallow search depth in the planner. In addition, in the model-based reinforcement learning (MBRL) setting, it reduces compounding errors from model learning and improves training time for models. We show that this idea is effective and that the range for action durations can be automatically selected using a multi-armed bandit formulation and integrated into the MBRL framework. An extensive experimental evaluation both in planning and in MBRL, shows that our approach yields faster planning, better solutions, and that it enables solutions to problems that are not solved in the standard formulation.

data mining, machine learning, reinforcement learning, (21 more...)

2505.15754

Genre: Research Report > Experimental Study (1.00)

Industry: Education > Educational Setting > Online (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Data Science > Data Mining > Big Data (0.66)

Bridging Earth and Space: A Survey on HAPS for Non-Terrestrial Networks

Svistunov, G., Akhtarshenas, A., López-Pérez, D., Giordani, M., Geraci, G., Yanikomeroglu, H.

HAPS are emerging as key enablers in the evolution of 6G wireless networks, bridging terrestrial and non-terrestrial infrastructures. Operating in the stratosphere, HAPS can provide wide-area coverage, low-latency, energy-efficient broadband communications with flexible deployment options for diverse applications. This survey delivers a comprehensive overview of HAPS use cases, technologies, and integration strategies within the 6G ecosystem. The roles of HAPS in extending connectivity to underserved regions, supporting dynamic backhauling, enabling massive IoT, and delivering reliable low-latency communications for autonomous and immersive services are discussed. The paper reviews state-of-the-art architectures for terrestrial and non-terrestrial network integration, highlights recent field trials. Furthermore, key enabling technologies such as channel modeling, AI-driven resource allocation, interference control, mobility management, and energy-efficient communications are examined. The paper also outlines open research challenges. By addressing existing gaps in the literature, this survey positions HAPS as a foundational component of globally integrated, resilient, and sustainable 6G networks.

machine learning, real time system, reinforcement learning, (23 more...)

2510.19731

Country:

Asia (0.92)
North America > United States (0.92)
North America > Canada > Ontario (0.27)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Air (1.00)
Telecommunications > Networks (1.00)
(9 more...)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Internet of Things (1.00)
Information Technology > Data Science (1.00)
(10 more...)

Castanyer, Roger Creus, Mohamed, Faisal, Castro, Pablo Samuel, Neary, Cyrus, Berseth, Glen

ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) -- an automata-based formalism for reward specification -- are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.

large language model, machine learning, reinforcement learning, (17 more...)

2510.14176

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Pudovikov, Andrey, Khirianova, Alexandra, Solodneva, Ekaterina, Katrutsa, Aleksandr, Samosvat, Egor, Dorn, Yuriy

Autobidding Arena: unified evaluation of the classical and RL-based autobidding algorithms

Advertisement auctions play a crucial role in revenue generation for e-commerce companies. To make the bidding procedure scalable to thousands of auctions, the automatic bidding (autobidding) algorithms are actively developed in the industry. Therefore, the fair and reproducible evaluation of autobidding algorithms is an important problem. We present a standardized and transparent evaluation protocol for comparing classical and reinforcement learning (RL) autobidding algorithms. We consider the most efficient autobidding algorithms from different classes, e.g., ones based on the controllers, RL, optimal formulas, etc., and benchmark them in the bidding environment. We utilize the most recent open-source environment developed in the industry, which accurately emulates the bidding process. Our work demonstrates the most promising use cases for the considered autobidding algorithms, highlights their surprising drawbacks, and evaluates them according to multiple metrics. We select the evaluation metrics that illustrate the performance of the autobidding algorithms, the corresponding costs, and track the budget pacing. Such a choice of metrics makes our results applicable to the broad range of platforms where autobidding is effective. The presented comparison results help practitioners to evaluate the candidate autobidding algorithms from different perspectives and select ones that are efficient according to their companies' targets.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2510.19357

Country:

Asia (0.28)
North America (0.28)
Europe > Russia (0.15)

Genre: Research Report > New Finding (0.87)

Industry:

Marketing (1.00)
Information Technology > Services (0.90)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)