Goto

Collaborating Authors

 Reinforcement Learning


Increasing Interaction Fidelity: Training Routines for Biomechanical Models in HCI

arXiv.org Artificial Intelligence

Biomechanical forward simulation holds great potential for HCI, enabling the generation of human-like movements in interactive tasks. However, training biomechanical models with reinforcement learning is challenging, particularly for precise and dexterous movements like those required for touchscreen interactions on mobile devices. Current approaches are limited in their interaction fidelity, require restricting the underlying biomechanical model to reduce complexity, and do not generalize well. In this work, we propose practical improvements to training routines that reduce training time, increase interaction fidelity beyond existing methods, and enable the use of more complex biomechanical models. Using a touchscreen pointing task, we demonstrate that curriculum learning, action masking, more complex network configurations, and simple adjustments to the simulation environment can significantly improve the agent's ability to learn accurate touch behavior. Our work provides HCI researchers with practical tips and training routines for developing better biomechanical models of human-like interaction fidelity.


Compute-Optimal Scaling for Value-Based Deep RL

arXiv.org Artificial Intelligence

As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal manner that extracts maximal performance per unit of compute. While such scaling has been well studied for language modeling, reinforcement learning (RL) has received less attention in this regard. In this paper, we investigate compute scaling for online, value-based deep RL. These methods present two primary axes for compute allocation: model capacity and the update-to-data (UTD) ratio. Given a fixed compute budget, we ask: how should resources be partitioned across these axes to maximize sample efficiency? Our analysis reveals a nuanced interplay between model size, batch size, and UTD. In particular, we identify a phenomenon we call TD-overfitting: increasing the batch quickly harms Q-function accuracy for small models, but this effect is absent in large models, enabling effective use of large batch size at scale. We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD to optimize compute usage. Our findings provide a grounded starting point for compute-optimal scaling in deep RL, mirroring studies in supervised learning but adapted to TD learning.


Expert-Guided Diffusion Planner for Auto-Bidding

arXiv.org Artificial Intelligence

Auto-bidding is widely used in advertising systems, serving a diverse range of advertisers. Generative bidding is increasingly gaining traction due to its strong planning capabilities and generalizability. Unlike traditional reinforcement learning-based bidding, generative bidding does not depend on the Markov Decision Process (MDP), thereby exhibiting superior planning performance in long-horizon scenarios. Conditional diffusion modeling approaches have shown significant promise in the field of auto-bidding. However, relying solely on return as the optimality criterion is insufficient to guarantee the generation of truly optimal decision sequences, as it lacks personalized structural information. Moreover, the auto-regressive generation mechanism of diffusion models inherently introduces timeliness risks. To address these challenges, we introduce a novel conditional diffusion modeling approach that integrates expert trajectory guidance with a skip-step sampling strategy to improve generation efficiency. The efficacy of this method has been demonstrated through comprehensive offline experiments and further substantiated by statistically significant outcomes in online A/B testing, yielding an 11.29% increase in conversions and a 12.36% growth in revenue relative to the baseline.


How to craft a deep reinforcement learning policy for wind farm flow control

arXiv.org Artificial Intelligence

Within wind farms, wake effects between turbines can significantly reduce overall energy production. Wind farm flow control encompasses methods designed to mitigate these effects through coordinated turbine control. Wake steering, for example, consists in intentionally misaligning certain turbines with the wind to optimize airflow and increase power output. However, designing a robust wake steering controller remains challenging, and existing machine learning approaches are limited to quasi-static wind conditions or small wind farms. This work presents a new deep reinforcement learning methodology to develop a wake steering policy that overcomes these limitations. Our approach introduces a novel architecture that combines graph attention networks and multi-head self-attention blocks, alongside a novel reward function and training strategy. The resulting model computes the yaw angles of each turbine, optimizing energy production in time-varying wind conditions. An empirical study conducted on steady-state, low-fidelity simulation, shows that our model requires approximately 10 times fewer training steps than a fully connected neural network and achieves more robust performance compared to a strong optimization baseline, increasing energy production by up to 14 %. To the best of our knowledge, this is the first deep reinforcement learning-based wake steering controller to generalize effectively across any time-varying wind conditions in a low-fidelity, steady-state numerical simulation setting.


Self-Correcting Code Generation Using Small Language Models

arXiv.org Artificial Intelligence

Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement. Recent studies have explored prompting-based strategies that incorporate verification or feedback loops using proprietary models, as well as training-based methods that leverage their strong reasoning capabilities. However, whether smaller models possess the capacity to effectively guide their outputs through self-reflection remains unexplored. Our findings reveal that smaller models struggle to exhibit reflective revision behavior across both self-correction paradigms. In response, we introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction. Specifically, we propose an online reinforcement learning objective that trains the model to confidently maintain correct outputs while progressively correcting incorrect outputs as turns proceed. Our approach features an accumulated reward function that aggregates rewards across the entire trajectory and a fine-grained reward better suited to multi-turn correction scenarios. This facilitates the model in enhancing initial response quality while achieving substantial improvements through self-correction. With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.


DeeP-Mod: Deep Dynamic Programming based Environment Modelling using Feature Extraction

arXiv.org Artificial Intelligence

The DeeP-Mod framework builds an environment model using features from a Deep Dynamic Programming Network (DDPN), trained via a Deep Q-Network (DQN). While Deep Q-Learning is effective in decision-making, state information is lost in deeper DQN layers due to mixed state-action representations. We address this by using Dynamic Programming (DP) to train a DDPN, where Value Iteration ensures the output represents state values, not state-action pairs. Extracting features from the DDPN preserves state information, enabling task and action set independence. We show that a reduced DDPN can be trained using features extracted from the original DDPN trained on an identical problem. This reduced DDPN achieves faster convergence under noise and outperforms the original DDPN. Finally, we introduce the DeeP-Mod framework, which creates an environment model using the evolution of features extracted from a DDPN in response to actions. A second DDPN, which learns directly from this feature model rather than raw states, can learn an effective feature-value representation and thus optimal policy. A key advantage of DeeP-Mod is that an externally defined environment model is not needed at any stage, making DDPN applicable to a wide range of environments.


Optimal Dynamic Regret by Transformers for Non-Stationary Reinforcement Learning

arXiv.org Machine Learning

Transformers have emerged as a powerful class of sequence models with remarkable expressive capabilities. Originally popularized in the context of natural language processing, they leverage self-attention mechanisms to in-context learn new tasks without any parameter updates (Vaswani, 2017; Liu et al., 2021; Dosovitskiy, 2020; Yun et al., 2019; Dong et al., 2018). In other words, a large transformer model can be given a prompt consisting of example input-output pairs for an unseen task and subsequently produce correct outputs for new queries of that task, purely by processing the sequence of examples and queries (Lee et al., 2022; Laskin et al., 2022; Yang et al., 2023; Lin et al., 2024). This ability to dynamically adapt via context rather than gradient-based fine-tuning has spurred extensive interest in understanding the theoretical expressivity of transformers and how they might learn algorithms internally during training. Recent theoretical work has begun to analyze the various aspects of transformers.


Hierarchical Decision-Making for Autonomous Navigation: Integrating Deep Reinforcement Learning and Fuzzy Logic in Four-Wheel Independent Steering and Driving Systems

arXiv.org Artificial Intelligence

This paper presents a hierarchical decision-making framework for autonomous navigation in four-wheel independent steering and driving (4WISD) systems. The proposed approach integrates deep reinforcement learning (DRL) for high-level navigation with fuzzy logic for low-level control to ensure both task performance and physical feasibility. The DRL agent generates global motion commands, while the fuzzy logic controller enforces kinematic constraints to prevent mechanical strain and wheel slippage. Simulation experiments demonstrate that the proposed framework outperforms traditional navigation methods, offering enhanced training efficiency and stability and mitigating erratic behaviors compared to purely DRL-based solutions. Real-world validations further confirm the framework's ability to navigate safely and effectively in dynamic industrial settings. Overall, this work provides a scalable and reliable solution for deploying 4WISD mobile robots in complex, real-world scenarios.


Guiding Diffusion Models with Reinforcement Learning for Stable Molecule Generation

arXiv.org Artificial Intelligence

Generating physically realistic 3D molecular structures remains a core challenge in molecular generative modeling. While diffusion models equipped with equivariant neural networks have made progress in capturing molecular geometries, they often struggle to produce equilibrium structures that adhere to physical principles such as force field consistency. To bridge this gap, we propose Reinforcement Learning with Physical Feedback (RLPF), a novel framework that extends Denoising Diffusion Policy Optimization to 3D molecular generation. RLPF formulates the task as a Markov decision process and applies proximal policy optimization to fine-tune equivariant diffusion models. Crucially, RLPF introduces reward functions derived from force-field evaluations, providing direct physical feedback to guide the generation toward energetically stable and physically meaningful structures. Experiments on the QM9 and GEOM-drug datasets demonstrate that RLPF significantly improves molecular stability compared to existing methods. These results highlight the value of incorporating physics-based feedback into generative modeling. The code is available at: https://github.com/ZhijianZhou/RLPF/tree/verl_diffusion.


Integrated Noise and Safety Management in UAM via A Unified Reinforcement Learning Framework

arXiv.org Artificial Intelligence

Urban Air Mobility (UAM) envisions the widespread use of small aerial vehicles to transform transportation in dense urban environments. However, UAM faces critical operational challenges, particularly the balance between minimizing noise exposure and maintaining safe separation in low-altitude urban airspace, two objectives that are often addressed separately. We propose a reinforcement learning (RL)-based air traffic management system that integrates both noise and safety considerations within a unified, decentralized framework. Under this scalable air traffic coordination solution, agents operate in a structured, multi-layered airspace and learn altitude adjustment policies to jointly manage noise impact and separation constraints. The system demonstrates strong performance across both objectives and reveals tradeoffs among separation, noise exposure, and energy efficiency under high traffic density. The findings highlight the potential of RL and multi-objective coordination strategies in enhancing the safety, quietness, and efficiency of UAM operations.