macro action
Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization
Guo, Jian-Ting, Chen, Yu-Cheng, Hsieh, Ping-Chun, Ho, Kuo-Hao, Huang, Po-Wei, Wu, Ti-Rong, Wu, I-Chen
Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively little attention has been focused on designing human-like RL agents. As a result, many reward-driven RL agents often exhibit unnatural behaviors compared to humans, raising concerns for both interpretability and trustworthiness. To achieve human-like behavior in RL, this paper first formulates human-likeness as trajectory optimization, where the objective is to find an action sequence that closely aligns with human behavior while also maximizing rewards, and adapts the classic receding-horizon control to human-like learning as a tractable and efficient implementation. To achieve this, we introduce Macro Action Quantization (MAQ), a human-like RL framework that distills human demonstrations into macro actions via Vector-Quantized VAE. Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at https://rlg.iis.sinica.edu.tw/papers/MAQ.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology (0.93)
- Leisure & Entertainment > Games > Computer Games (0.67)
Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces
Ma, Haitong, Nabati, Ofir, Rosenberg, Aviv, Dai, Bo, Lang, Oran, Szpektor, Idan, Boutilier, Craig, Li, Na, Mannor, Shie, Shani, Lior, Tenneholtz, Guy
Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Asia > Macao (0.04)
- Health & Medicine (1.00)
- Leisure & Entertainment > Games (0.68)
- Education > Educational Setting > Online (0.54)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Robot Operation of Home Appliances by Reading User Manuals
Zhang, Jian, Zhang, Hanbo, Xiao, Anxing, Hsu, David
Operating home appliances, among the most common tools in every household, is a critical capability for assistive home robots. This paper presents ApBot, a robot system that operates novel household appliances by "reading" their user manuals. ApBot faces multiple challenges: (i) infer goal-conditioned partial policies from their unstructured, textual descriptions in a user manual document, (ii) ground the policies to the appliance in the physical world, and (iii) execute the policies reliably over potentially many steps, despite compounding errors. To tackle these challenges, ApBot constructs a structured, symbolic model of an appliance from its manual, with the help of a large vision-language model (VLM). It grounds the symbolic actions visually to control panel elements. Finally, ApBot closes the loop by updating the model based on visual feedback. Our experiments show that across a wide range of simulated and real-world appliances, ApBot achieves consistent and statistically significant improvements in task success rate, compared with state-of-the-art large VLMs used directly as control policies. These results suggest that a structured internal representations plays an important role in robust robot operation of home appliances, especially, complex ones.
- Workflow (1.00)
- Research Report > New Finding (0.87)
Deep Reinforcement Learning Based Navigation with Macro Actions and Topological Maps
Hakenes, Simon, Glasmachers, Tobias
This paper addresses the challenge of navigation in large, visually complex environments with sparse rewards. We propose a method that uses object-oriented macro actions grounded in a topological map, allowing a simple Deep Q-Network (DQN) to learn effective navigation policies. The agent builds a map by detecting objects from RGBD input and selecting discrete macro actions that correspond to navigating to these objects. This abstraction drastically reduces the complexity of the underlying reinforcement learning problem and enables generalization to unseen environments. We evaluate our approach in a photorealistic 3D simulation and show that it significantly outperforms a random baseline under both immediate and terminal reward conditions. Our results demonstrate that topological structure and macro-level abstraction can enable sample-efficient learning even from pixel data.
- North America > United States > California (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Germany (0.04)
- Leisure & Entertainment (0.68)
- Health & Medicine (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction
Luo, Baiting, Pettet, Ava, Laszka, Aron, Dubey, Abhishek, Mukhopadhyay, Ayan
If we were to apply MCTS directly to this abstracted space, we would encounter two main issues: inefficient utilization of our pre-built search space, with the search potentially diverging prematurely into unexplored regions, and difficulty in building sufficiently deep trees for high-quality long-term decision-making, particularly in areas of high stochasticity or uncertainty (Cou etoux et al., 2011). Therefore, we use progressive widening to extend MCTS to incrementally expand the search tree. It balances the exploration of new states with the exploitation of already visited states based on two hyperparameters: α [0, 1] and ϵ R + . Let |C (s, z) | denote the number of children for the state-action pair (s, z) . The key idea is to alternate between adding new child nodes and selecting among existing child nodes, depending on the number of times a state-action pair ( s, z) has been visited. A new state is added to the tree if |C ( s, z)| < ϵ N (s, z) α, where N (s, z) is the number of times the state-action pair has been visited. The hyperparameter α controls the propensity to select among existing children, with α = 0 leading to always selecting among existing child and α = 1 leading to vanilla MCTS behavior (always adding a new child). In this way, we could enhance our approach by efficiently utilizing the pre-built search space, prioritizing the exploration of promising macro actions while allowing for incremental expansion of the search tree. This technique enables our method to make quick decisions in an anytime manner, leveraging the cached information, and further refine the planning tree if additional time is available.
- North America > United States > California > Los Angeles County > Long Beach (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (15 more...)
- Leisure & Entertainment > Games (0.67)
- Health & Medicine > Therapeutic Area > Immunology (0.48)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.30)
Reviews: Strategic Attentive Writer for Learning Macro-Actions
Summary of Recommendation: The paper introduces an original idea. Committing to a plan has been introduced before in RL, e.g., in Sutton's options literature (where no learning occurs), and Schmidhuber's hierarchical RL systems of the early 1990s, and Wiering's HQ learning, but the new approach is different. However, the formalisation and experimental section seem to lack clarity and raise several questions. In particular, the experiments don't show very convincingly that the attentional mechanism is needed (although it seems like a very nice idea) and the actual behaviour of the attention is not explored at all. I don't see this as a fatal flaw, but this is definitely problematic since the title and main thrust of the paper rely on it.
Multi-agent reinforcement learning strategy to maximize the lifetime of Wireless Rechargeable
The thesis proposes a generalized charging framework for multiple mobile chargers to maximize the network lifetime and ensure target coverage and connectivity in large scale WRSNs. Moreover, a multi-point charging model is leveraged to enhance charging efficiency, where the MC can charge multiple sensors simultaneously at each charging location. The thesis proposes an effective Decentralized Partially Observable Semi-Markov Decision Process (Dec POSMDP) model that promotes Mobile Chargers (MCs) cooperation and detects optimal charging locations based on realtime network information. Furthermore, the proposal allows reinforcement algorithms to be applied to different networks without requiring extensive retraining. To solve the Dec POSMDP model, the thesis proposes an Asynchronous Multi Agent Reinforcement Learning algorithm (AMAPPO) based on the Proximal Policy Optimization algorithm (PPO).
- Asia > Vietnam > Hanoi > Hanoi (0.04)
- North America > United States > California > Yolo County > Davis (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Overview (1.00)
- Research Report > Promising Solution (0.45)
- Information Technology (1.00)
- Health & Medicine (1.00)
- Energy (1.00)
- Leisure & Entertainment > Games (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Scaling Long-Horizon Online POMDP Planning via Rapid State Space Sampling
Liang, Yuanchu, Kim, Edward, Thomason, Wil, Kingston, Zachary, Kurniawati, Hanna, Kavraki, Lydia E.
Partially Observable Markov Decision Processes (POMDPs) are a general and principled framework for motion planning under uncertainty. Despite tremendous improvement in the scalability of POMDP solvers, long-horizon POMDPs (e.g., $\geq15$ steps) remain difficult to solve. This paper proposes a new approximate online POMDP solver, called Reference-Based Online POMDP Planning via Rapid State Space Sampling (ROP-RaS3). ROP-RaS3 uses novel extremely fast sampling-based motion planning techniques to sample the state space and generate a diverse set of macro actions online which are then used to bias belief-space sampling and infer high-quality policies without requiring exhaustive enumeration of the action space -- a fundamental constraint for modern online POMDP solvers. ROP-RaS3 is evaluated on various long-horizon POMDPs, including on a problem with a planning horizon of more than 100 steps and a problem with a 15-dimensional state space that requires more than 20 look ahead steps. In all of these problems, ROP-RaS3 substantially outperforms other state-of-the-art methods by up to multiple folds.
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- North America > United States > Texas > Harris County > Houston (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
Chai, Yekun, Sun, Haoran, Fang, Huang, Wang, Shuohuan, Sun, Yu, Wu, Hua
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at https://github.com/ernie-research/MA-RLHF .
- Europe > Italy (0.14)
- Europe > Austria > Vienna (0.14)
- North America > Mexico (0.04)
- (11 more...)
- Health & Medicine > Consumer Health (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)
Deep Reinforcement Learning for Decentralized Multi-Robot Exploration With Macro Actions
Tan, Aaron Hao, Bejarano, Federico Pizarro, Zhu, Yuhan, Ren, Richard, Nejat, Goldie
Cooperative multi-robot teams need to be able to explore cluttered and unstructured environments while dealing with communication dropouts that prevent them from exchanging local information to maintain team coordination. Therefore, robots need to consider high-level teammate intentions during action selection. In this letter, we present the first Macro Action Decentralized Exploration Network (MADE-Net) using multi-agent deep reinforcement learning (DRL) to address the challenges of communication dropouts during multi-robot exploration in unseen, unstructured, and cluttered environments. Simulated robot team exploration experiments were conducted and compared against classical and DRL methods where MADE-Net outperformed all benchmark methods in terms of computation time, total travel distance, number of local interactions between robots, and exploration rate across various degrees of communication dropouts. A scalability study in 3D environments showed a decrease in exploration time with MADE-Net with increasing team and environment sizes. The experiments presented highlight the effectiveness and robustness of our method.
- Leisure & Entertainment > Games (0.55)
- Consumer Products & Services > Travel (0.54)